Bio

Bio


Dr. Dennis P. Wall, PhD is Associate Professor of Pediatrics, Psychiatry and Biomedical Data Sciences at Stanford Medical School. He leads a lab in Pediatric Innovation focused on developing methods in biomedical informatics to disentangle complex conditions that originate in childhood and perpetuate through the life course, including autism and related developmental delays. For over a decade, first on faculty at Harvard and now at Stanford University, and as healthcare has shifted increasingly to the use of digital technologies for data capture and finer resolutions of genomic scale, Dr. Wall has innovated, adapted and deployed bioinformatic strategies to enable precise and personalized interpretation of high resolution molecular and phenotypic data. Dr. Wall has pioneered the use of machine learning and artificial intelligence for fast, quantitative and mobile detection of neurodevelopmental disorders in children, as well as the use of use of machine learning systems on wearable devices, such as Google Glass, for real-time “exclinical" therapy. These same precision health approaches enable quantitative tracking of progress during treatment throughout an individual’s life enabling big data generation of a type and scale never before possible, and have defined a new paradigm for behavioral detection and therapy that has won Dr. Wall several awards including a spot in the top ten of the World’s top 30 autism researchers. Dr. Wall has acted as science advisor to several biotechnology and pharmaceutical companies, has created and advised on cutting-edge approaches to cloud computing, and has received numerous awards, including the Fred R. Cagle Award for Outstanding Achievement in Biology, the Vice Chancellor's Award for Research, three awards for excellence in teaching, the Harvard Medical School Leadership award, and the Slifka/Ritvo Clinical Innovation in Autism Research Award for outstanding advancements in clinical translation. He completed his PhD at the University of California, Berkeley and a National Science Foundation postdoctoral fellowship in Computational Genetics at Stanford University before joining the faculty at Harvard Medical School.

Academic Appointments


Professional Education


  • Fellow, Stanford University, Bioinformatics and Computational Genetics (2003)
  • Ph.D., University of California, Berkeley, Integrative Biology (2001)

Research & Scholarship

Current Research and Scholarly Interests


Systems biology for design of clinical solutions that detect and treat disease

Teaching

2018-19 Courses


Stanford Advisees


Graduate and Fellowship Programs


Publications

All Publications


  • Machine learning approach for early detection of autism by combining questionnaire and home video screening JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Abbas, H., Garberson, F., Glover, E., Wall, D. P. 2018; 25 (8): 1000–1007

    Abstract

    Existing screening tools for early detection of autism are expensive, cumbersome, time- intensive, and sometimes fall short in predictive value. In this work, we sought to apply Machine Learning (ML) to gold standard clinical data obtained across thousands of children at-risk for autism spectrum disorder to create a low-cost, quick, and easy to apply autism screening tool.Two algorithms are trained to identify autism, one based on short, structured parent-reported questionnaires and the other on tagging key behaviors from short, semi-structured home videos of children. A combination algorithm is then used to combine the results into a single assessment of higher accuracy. To overcome the scarcity, sparsity, and imbalance of training data, we apply novel feature selection, feature engineering, and feature encoding techniques. We allow for inconclusive determination where appropriate in order to boost screening accuracy when conclusive. The performance is then validated in a controlled clinical study.A multi-center clinical study of n = 162 children is performed to ascertain the performance of these algorithms and their combination. We demonstrate a significant accuracy improvement over standard screening tools in measurements of AUC, sensitivity, and specificity.These findings suggest that a mobile, machine learning process is a reliable method for detection of autism outside of clinical settings. A variety of confounding factors in the clinical analysis are discussed along with the solutions engineered into the algorithms. Final results are statistically limited and will benefit from future clinical studies to extend the sample size.

    View details for DOI 10.1093/jamia/ocy039

    View details for Web of Science ID 000440955200010

    View details for PubMedID 29741630

  • Brain-specific functional relationship networks inform autism spectrum disorder gene prediction TRANSLATIONAL PSYCHIATRY Duda, M., Zhang, H., Li, H., Wall, D. P., Burmeister, M., Guan, Y. 2018; 8: 56

    Abstract

    Autism spectrum disorder (ASD) is a neuropsychiatric disorder with strong evidence of genetic contribution, and increased research efforts have resulted in an ever-growing list of ASD candidate genes. However, only a fraction of the hundreds of nominated ASD-related genes have identified de novo or transmitted loss of function (LOF) mutations that can be directly attributed to the disorder. For this reason, a means of prioritizing candidate genes for ASD would help filter out false-positive results and allow researchers to focus on genes that are more likely to be causative. Here we constructed a machine learning model by leveraging a brain-specific functional relationship network (FRN) of genes to produce a genome-wide ranking of ASD risk genes. We rigorously validated our gene ranking using results from two independent sequencing experiments, together representing over 5000 simplex and multiplex ASD families. Finally, through functional enrichment analysis on our highly prioritized candidate gene network, we identified a small number of pathways that are key in early neural development, providing further support for their potential role in ASD.

    View details for DOI 10.1038/s41398-018-0098-6

    View details for Web of Science ID 000428350300002

    View details for PubMedID 29507298

    View details for PubMedCentralID PMC5838237

  • Feasibility Testing of a Wearable Behavioral Aid for Social Learning in Children with Autism APPLIED CLINICAL INFORMATICS Daniels, J., Haber, N., Voss, C., Schwartz, J., Tamura, S., Fazel, A., Kline, A., Washington, P., Phillips, J., Winograd, T., Feinstein, C., Wall, D. P. 2018; 9 (1): 129–40

    Abstract

    Recent advances in computer vision and wearable technology have created an opportunity to introduce mobile therapy systems for autism spectrum disorders (ASD) that can respond to the increasing demand for therapeutic interventions; however, feasibility questions must be answered first.We studied the feasibility of a prototype therapeutic tool for children with ASD using Google Glass, examining whether children with ASD would wear such a device, if providing the emotion classification will improve emotion recognition, and how emotion recognition differs between ASD participants and neurotypical controls (NC).We ran a controlled laboratory experiment with 43 children: 23 with ASD and 20 NC. Children identified static facial images on a computer screen with one of 7 emotions in 3 successive batches: the first with no information about emotion provided to the child, the second with the correct classification from the Glass labeling the emotion, and the third again without emotion information. We then trained a logistic regression classifier on the emotion confusion matrices generated by the two information-free batches to predict ASD versus NC.All 43 children were comfortable wearing the Glass. ASD and NC participants who completed the computer task with Glass providing audible emotion labeling (n = 33) showed increased accuracies in emotion labeling, and the logistic regression classifier achieved an accuracy of 72.7%. Further analysis suggests that the ability to recognize surprise, fear, and neutrality may distinguish ASD cases from NC.This feasibility study supports the utility of a wearable device for social affective learning in ASD children and demonstrates subtle differences in how ASD and NC children perform on an emotion recognition task.

    View details for DOI 10.1055/s-0038-1626727

    View details for Web of Science ID 000428690000006

    View details for PubMedID 29466819

    View details for PubMedCentralID PMC5821509

  • A Low Rank Model for Phenotype Imputation in Autism Spectrum Disorder. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Paskov, K. M., Wall, D. P. 2018; 2017: 178–87

    Abstract

    Autism Spectrum Disorder is a highly heterogeneous condition currently diagnosed using behavioral symptoms. A better understanding of the phenotypic subtypes of autism is a necessary component of the larger goal of mapping autism genotype to phenotype. However, as with most clinical records describing human disease, the phenotypic data available for autism contains varying levels of noise and incompleteness that complicate analysis. Here we analyze behavioral data from 16,291 subjects using 250 items from three gold standard diagnostic instruments. We apply a low-rank model to impute missing entries and entire missing instruments with high fidelity, showing that we can complete clinical records for all subjects. Finally, we analyze the low-rank representation of our subjects to identify plausible subtypes of autism, setting the stage for genome-to-phenome prediction experiments. These procedures can be adapted and used with other similarly structured clinical records to enable a more complete mapping between genome and phenome.

    View details for PubMedID 29888068

  • Sparsifying machine learning models identify stable subsets of predictive features for behavioral detection of autism MOLECULAR AUTISM Levy, S., Duda, M., Haber, N., Wall, D. P. 2017; 8: 65

    Abstract

    Autism spectrum disorder (ASD) diagnosis can be delayed due in part to the time required for administration of standard exams, such as the Autism Diagnostic Observation Schedule (ADOS). Shorter and potentially mobilized approaches would help to alleviate bottlenecks in the healthcare system. Previous work using machine learning suggested that a subset of the behaviors measured by ADOS can achieve clinically acceptable levels of accuracy. Here we expand on this initial work to build sparse models that have higher potential to generalize to the clinical population.We assembled a collection of score sheets for two ADOS modules, one for children with phrased speech (Module 2; 1319 ASD cases, 70 controls) and the other for children with verbal fluency (Module 3; 2870 ASD cases, 273 controls). We used sparsity/parsimony enforcing regularization techniques in a nested cross validation grid search to select features for 17 unique supervised learning models, encoding missing values as additional indicator features. We augmented our feature sets with gender and age to train minimal and interpretable classifiers capable of robust detection of ASD from non-ASD.By applying 17 unique supervised learning methods across 5 classification families tuned for sparse use of features and to be within 1 standard error of the optimal model, we find reduced sets of 10 and 5 features used in a majority of models. We tested the performance of the most interpretable of these sparse models, including Logistic Regression with L2 regularization or Linear SVM with L1 regularization. We obtained an area under the ROC curve of 0.95 for ADOS Module 3 and 0.93 for ADOS Module 2 with less than or equal to 10 features.The resulting models provide improved stability over previous machine learning efforts to minimize the time complexity of autism detection due to regularization and a small parameter space. These robustness techniques yield classifiers that are sparse, interpretable and that have potential to generalize to alternative modes of autism screening, diagnosis and monitoring, possibly including analysis of short home videos.

    View details for DOI 10.1186/s13229-017-0180-6

    View details for Web of Science ID 000418669700001

    View details for PubMedID 29270283

    View details for PubMedCentralID PMC5735531

  • The GapMap project: a mobile surveillance system to map diagnosed autism cases and gaps in autism services globally MOLECULAR AUTISM Daniels, J., Schwartz, J., Albert, N., Du, M., Wall, D. P. 2017; 8: 55

    Abstract

    Although the number of autism diagnoses is on the rise, we have no evidence-based tracking of size and severity of gaps in access to autism-related resources, nor do we have methods to geographically triangulate the locations of the widest gaps in either the US or elsewhere across the globe. To combat these related issues of (1) mapping diagnosed cases of autism and (2) quantifying gaps in access to key intervention services, we have constructed a crowd-based mobile platform called "GapMap" (http://gapmap.stanford.edu) for real-time tracking of autism prevalence and autism-related resources that can be accessed from any mobile device with cellular or wireless connectivity. Now in beta, our aim is for this Android/iOS compatible mobile tool to simultaneously crowd-enroll the massive and growing community of families with autism to capture geographic, diagnostic, and resource usage information while automatically computing prevalence at granular geographical scales to yield a more complete and dynamic understanding of autism resource epidemiology.

    View details for DOI 10.1186/s13229-017-0163-7

    View details for Web of Science ID 000413375000001

    View details for PubMedID 29075431

    View details for PubMedCentralID PMC5651585

  • Human Genome Sequencing at the Population Scale: A Primer on High-Throughput DNA Sequencing and Analysis AMERICAN JOURNAL OF EPIDEMIOLOGY Goldfeder, R. L., Wall, D. P., Khoury, M. J., Ioannidis, J. A., Ashley, E. A. 2017; 186 (8): 1000–1009

    Abstract

    Most human diseases have underlying genetic causes. To better understand the impact of genes on disease and its implications for medicine and public health, researchers have pursued methods for determining the sequences of individual genes, then all genes, and now complete human genomes. Massively parallel high-throughput sequencing technology, where DNA is sheared into smaller pieces, sequenced, and then computationally reordered and analyzed, enables fast and affordable sequencing of full human genomes. As the price of sequencing continues to decline, more and more individuals are having their genomes sequenced. This may facilitate better population-level disease subtyping and characterization, as well as individual-level diagnosis and personalized treatment and prevention plans. In this review, we describe several massively parallel high-throughput DNA sequencing technologies and their associated strengths, limitations, and error modes, with a focus on applications in epidemiologic research and precision medicine. We detail the methods used to computationally process and interpret sequence data to inform medical or preventative action.

    View details for DOI 10.1093/aje/kww224

    View details for Web of Science ID 000412798300013

    View details for PubMedID 29040395

  • ONE IN THREE DE NOVO VARIANTS SEEN IN AUTISM SPECTRUM DISORDER PROBANDS ARE PRESENT AS STANDING VARIATION IN A COHORT OF MORE THAN 60,000 NON-ASD INDIVIDUALS Kosmicki, J., Samocha, K., Lek, M., MacArthur, D., Wall, D., Robinson, E., Daly, M. ELSEVIER SCIENCE BV. 2017: S280–S281
  • Crowdsourced validation of a machine-learning classification system for autism and ADHD. Translational psychiatry Duda, M., Haber, N., Daniels, J., Wall, D. P. 2017; 7 (5)

    Abstract

    Autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD) together affect >10% of the children in the United States, but considerable behavioral overlaps between the two disorders can often complicate differential diagnosis. Currently, there is no screening test designed to differentiate between the two disorders, and with waiting times from initial suspicion to diagnosis upwards of a year, methods to quickly and accurately assess risk for these and other developmental disorders are desperately needed. In a previous study, we found that four machine-learning algorithms were able to accurately (area under the curve (AUC)>0.96) distinguish ASD from ADHD using only a small subset of items from the Social Responsiveness Scale (SRS). Here, we expand upon our prior work by including a novel crowdsourced data set of responses to our predefined top 15 SRS-derived questions from parents of children with ASD (n=248) or ADHD (n=174) to improve our model's capability to generalize to new, 'real-world' data. By mixing these novel survey data with our initial archival sample (n=3417) and performing repeated cross-validation with subsampling, we created a classification algorithm that performs with AUC=0.89±0.01 using only 15 questions.

    View details for DOI 10.1038/tp.2017.86

    View details for PubMedID 28509905

  • Cross-disorder comparative analysis of comorbid conditions reveals novel autism candidate genes BMC GENOMICS Diaz-Beltran, L., Esteban, F. J., Varma, M., Ortuzk, A., David, M., Wall, D. P. 2017; 18

    Abstract

    Numerous studies have highlighted the elevated degree of comorbidity associated with autism spectrum disorder (ASD). These comorbid conditions may add further impairments to individuals with autism and are substantially more prevalent compared to neurotypical populations. These high rates of comorbidity are not surprising taking into account the overlap of symptoms that ASD shares with other pathologies. From a research perspective, this suggests common molecular mechanisms involved in these conditions. Therefore, identifying crucial genes in the overlap between ASD and these comorbid disorders may help unravel the common biological processes involved and, ultimately, shed some light in the understanding of autism etiology.In this work, we used a two-fold systems biology approach specially focused on biological processes and gene networks to conduct a comparative analysis of autism with 31 frequently comorbid disorders in order to define a multi-disorder subcomponent of ASD and predict new genes of potential relevance to ASD etiology. We validated our predictions by determining the significance of our candidate genes in high throughput transcriptome expression profiling studies. Using prior knowledge of disease-related biological processes and the interaction networks of the disorders related to autism, we identified a set of 19 genes not previously linked to ASD that were significantly differentially regulated in individuals with autism. In addition, these genes were of potential etiologic relevance to autism, given their enriched roles in neurological processes crucial for optimal brain development and function, learning and memory, cognition and social behavior.Taken together, our approach represents a novel perspective of autism from the point of view of related comorbid disorders and proposes a model by which prior knowledge of interaction networks may enlighten and focus the genome-wide search for autism candidate genes to better define the genetic heterogeneity of ASD.

    View details for DOI 10.1186/s12864-017-3667-9

    View details for Web of Science ID 000400623900008

    View details for PubMedID 28427329

  • Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples. Nature genetics Kosmicki, J. A., Samocha, K. E., Howrigan, D. P., Sanders, S. J., Slowikowski, K., Lek, M., Karczewski, K. J., Cutler, D. J., Devlin, B., Roeder, K., Buxbaum, J. D., Neale, B. M., MacArthur, D. G., Wall, D. P., Robinson, E. B., Daly, M. J. 2017

    Abstract

    Recent research has uncovered an important role for de novo variation in neurodevelopmental disorders. Using aggregated data from 9,246 families with autism spectrum disorder, intellectual disability, or developmental delay, we found that ∼1/3 of de novo variants are independently present as standing variation in the Exome Aggregation Consortium's cohort of 60,706 adults, and these de novo variants do not contribute to neurodevelopmental risk. We further used a loss-of-function (LoF)-intolerance metric, pLI, to identify a subset of LoF-intolerant genes containing the observed signal of associated de novo protein-truncating variants (PTVs) in neurodevelopmental disorders. LoF-intolerant genes also carry a modest excess of inherited PTVs, although the strongest de novo-affected genes contribute little to this excess, thus suggesting that the excess of inherited risk resides in lower-penetrant genes. These findings illustrate the importance of population-based reference cohorts for the interpretation of candidate pathogenic variants, even for analyses of complex diseases and de novo variation.

    View details for DOI 10.1038/ng.3789

    View details for PubMedID 28191890

  • MC-GenomeKey: a multicloud system for the detection and annotation of genomic variants. BMC bioinformatics Elshazly, H., Souilmi, Y., Tonellato, P. J., Wall, D. P., Abouelhoda, M. 2017; 18 (1): 49-?

    Abstract

    Next Generation Genome sequencing techniques became affordable for massive sequencing efforts devoted to clinical characterization of human diseases. However, the cost of providing cloud-based data analysis of the mounting datasets remains a concerning bottleneck for providing cost-effective clinical services. To address this computational problem, it is important to optimize the variant analysis workflow and the used analysis tools to reduce the overall computational processing time, and concomitantly reduce the processing cost. Furthermore, it is important to capitalize on the use of the recent development in the cloud computing market, which have witnessed more providers competing in terms of products and prices.In this paper, we present a new package called MC-GenomeKey (Multi-Cloud GenomeKey) that efficiently executes the variant analysis workflow for detecting and annotating mutations using cloud resources from different commercial cloud providers. Our package supports Amazon, Google, and Azure clouds, as well as, any other cloud platform based on OpenStack. Our package allows different scenarios of execution with different levels of sophistication, up to the one where a workflow can be executed using a cluster whose nodes come from different clouds. MC-GenomeKey also supports scenarios to exploit the spot instance model of Amazon in combination with the use of other cloud platforms to provide significant cost reduction. To the best of our knowledge, this is the first solution that optimizes the execution of the workflow using computational resources from different cloud providers.MC-GenomeKey provides an efficient multicloud based solution to detect and annotate mutations. The package can run in different commercial cloud platforms, which enables the user to seize the best offers. The package also provides a reliable means to make use of the low-cost spot instance model of Amazon, as it provides an efficient solution to the sudden termination of spot machines as a result of a sudden price increase. The package has a web-interface and it is available for free for academic use.

    View details for DOI 10.1186/s12859-016-1454-2

    View details for PubMedID 28107819

    View details for PubMedCentralID PMC5248509

  • Machine learning for early detection of autism (and other conditions) using a parental questionnaire and home video screening Abbas, H., Garberson, F., Glover, E., Wall, D. P., Nie, J. Y., Obradovic, Z., Suzumura, T., Ghosh, R., Nambiar, R., Wang, C., Zang, H., BaezaYates, R., Hu, Kepner, J., Cuzzocrea, A., Tang, J., Toyoda, M. IEEE. 2017: 3558–61
  • Can we accelerate autism discoveries through crowdsourcing? RESEARCH IN AUTISM SPECTRUM DISORDERS David, M. M., Babineau, B. A., Wall, D. P. 2016; 32: 80-83
  • Comorbid Analysis of Genes Associated with Autism Spectrum Disorders Reveals Differential Evolutionary Constraints PLOS ONE David, M. M., Enard, D., Ozturk, A., Daniels, J., Jung, J., Diaz-Beltran, L., Wall, D. P. 2016; 11 (7)

    Abstract

    The burden of comorbidity in Autism Spectrum Disorder (ASD) is substantial. The symptoms of autism overlap with many other human conditions, reflecting common molecular pathologies suggesting that cross-disorder analysis will help prioritize autism gene candidates. Genes in the intersection between autism and related conditions may represent nonspecific indicators of dysregulation while genes unique to autism may play a more causal role. Thorough literature review allowed us to extract 125 ICD-9 codes comorbid to ASD that we mapped to 30 specific human disorders. In the present work, we performed an automated extraction of genes associated with ASD and its comorbid disorders, and found 1031 genes involved in ASD, among which 262 are involved in ASD only, with the remaining 779 involved in ASD and at least one comorbid disorder. A pathway analysis revealed 13 pathways not involved in any other comorbid disorders and therefore unique to ASD, all associated with basal cellular functions. These pathways differ from the pathways associated with both ASD and its comorbid conditions, with the latter being more specific to neural function. To determine whether the sequence of these genes have been subjected to differential evolutionary constraints, we studied long term constraints by looking into Genomic Evolutionary Rate Profiling, and showed that genes involved in several comorbid disorders seem to have undergone more purifying selection than the genes involved in ASD only. This result was corroborated by a higher dN/dS ratio for genes unique to ASD as compare to those that are shared between ASD and its comorbid disorders. Short-term evolutionary constraints showed the same trend as the pN/pS ratio indicates that genes unique to ASD were under significantly less evolutionary constraint than the genes associated with all other disorders.

    View details for DOI 10.1371/journal.pone.0157937

    View details for Web of Science ID 000379579500015

    View details for PubMedID 27414027

    View details for PubMedCentralID PMC4945013

  • Clinical Evaluation of a Novel and Mobile Autism Risk Assessment JOURNAL OF AUTISM AND DEVELOPMENTAL DISORDERS Duda, M., Daniels, J., Wall, D. P. 2016; 46 (6): 1953-1961

    Abstract

    The Mobile Autism Risk Assessment (MARA) is a new, electronically administered, 7-question autism spectrum disorder (ASD) screen to triage those at highest risk for ASD. Children 16 months-17 years (N = 222) were screened during their first visit in a developmental-behavioral pediatric clinic. MARA scores were compared to diagnosis from the clinical encounter. Participant median age was 5.8 years, 76.1 % were male, and most participants had an intelligence/developmental quotient score >85; 69 of the participants (31 %) received a clinical diagnosis of ASD. The sensitivity of the MARA in detecting ASD was 89.9 % [95 % CI = 82.7-97]; the specificity was 79.7 % [95 % CI = 73.4-86.1]. In a high-risk clinical setting, the MARA shows promise as a screen to distinguish ASD from other developmental/behavioral disorders.

    View details for DOI 10.1007/s10803-016-2718-4

    View details for Web of Science ID 000376100200007

    View details for PubMedID 26873142

    View details for PubMedCentralID PMC4860199

  • Automated integration of continuous glucose monitor data in the electronic health record using consumer technology. Journal of the American Medical Informatics Association Kumar, R. B., Goren, N. D., Stark, D. E., Wall, D. P., Longhurst, C. A. 2016; 23 (3): 532-537

    Abstract

    The diabetes healthcare provider plays a key role in interpreting blood glucose trends, but few institutions have successfully integrated patient home glucose data in the electronic health record (EHR). Published implementations to date have required custom interfaces, which limit wide-scale replication. We piloted automated integration of continuous glucose monitor data in the EHR using widely available consumer technology for 10 pediatric patients with insulin-dependent diabetes. Establishment of a passive data communication bridge via a patient's/parent's smartphone enabled automated integration and analytics of patient device data within the EHR between scheduled clinic visits. It is feasible to utilize available consumer technology to assess and triage home diabetes device data within the EHR, and to engage patients/parents and improve healthcare provider workflow.

    View details for DOI 10.1093/jamia/ocv206

    View details for PubMedID 27018263

  • Characterisation of agricultural drainage ditch sediments along the phosphorus transfer continuum in two contrasting headwater catchments JOURNAL OF SOILS AND SEDIMENTS Shore, M., Jordan, P., Mellander, P., Kelly-Quinn, M., Daly, K., Sims, J. T., Wall, D. P., Melland, A. R. 2016; 16 (5): 1643-1654
  • A research roadmap for next-generation sequencing informatics SCIENCE TRANSLATIONAL MEDICINE Altman, R. B., Prabhu, S., Sidow, A., Zook, J. M., Goldfeder, R., Litwack, D., Ashley, E., Asimenos, G., Bustamante, C. D., Donigan, K., Giacomini, K. M., Johansen, E., Khuri, N., Lee, E., Liang, X. S., Salit, M., Serang, O., Tezak, Z., Wall, D. P., Mansfield, E., Kass-Hout, T. 2016; 8 (335)

    Abstract

    Next-generation sequencing technologies are fueling a wave of new diagnostic tests. Progress on a key set of nine research challenge areas will help generate the knowledge required to advance effectively these diagnostics to the clinic.

    View details for DOI 10.1126/scitranslmed.aaf7314

    View details for Web of Science ID 000374412300003

    View details for PubMedID 27099173

  • A Complex Systems Approach to Causal Discovery in Psychiatry PLOS ONE Saxe, G. N., Statnikov, A., Fenyo, D., Ren, J., Li, Z., Prasad, M., Wall, D., Bergman, N., Briggs, E. C., Aliferis, C. 2016; 11 (3)

    Abstract

    Conventional research methodologies and data analytic approaches in psychiatric research are unable to reliably infer causal relations without experimental designs, or to make inferences about the functional properties of the complex systems in which psychiatric disorders are embedded. This article describes a series of studies to validate a novel hybrid computational approach-the Complex Systems-Causal Network (CS-CN) method-designed to integrate causal discovery within a complex systems framework for psychiatric research. The CS-CN method was first applied to an existing dataset on psychopathology in 163 children hospitalized with injuries (validation study). Next, it was applied to a much larger dataset of traumatized children (replication study). Finally, the CS-CN method was applied in a controlled experiment using a 'gold standard' dataset for causal discovery and compared with other methods for accurately detecting causal variables (resimulation controlled experiment). The CS-CN method successfully detected a causal network of 111 variables and 167 bivariate relations in the initial validation study. This causal network had well-defined adaptive properties and a set of variables was found that disproportionally contributed to these properties. Modeling the removal of these variables resulted in significant loss of adaptive properties. The CS-CN method was successfully applied in the replication study and performed better than traditional statistical methods, and similarly to state-of-the-art causal discovery algorithms in the causal detection experiment. The CS-CN method was validated, replicated, and yielded both novel and previously validated findings related to risk factors and potential treatments of psychiatric disorders. The novel approach yields both fine-grain (micro) and high-level (macro) insights and thus represents a promising approach for complex systems-oriented research in psychiatry.

    View details for DOI 10.1371/journal.pone.0151174

    View details for Web of Science ID 000373116500019

    View details for PubMedID 27028297

    View details for PubMedCentralID PMC4814084

  • A common molecular signature in ASD gene expression: following Root 66 to autism TRANSLATIONAL PSYCHIATRY Diaz-Beltran, L., Esteban, F. J., Wall, D. P. 2016; 6

    Abstract

    Several gene expression experiments on autism spectrum disorders have been conducted using both blood and brain tissue. Individually, these studies have advanced our understanding of the molecular systems involved in the molecular pathology of autism and have formed the bases of ongoing work to build autism biomarkers. In this study, we conducted an integrated systems biology analysis of 9 independent gene expression experiments covering 657 autism, 9 mental retardation and developmental delay and 566 control samples to determine if a common signature exists and to test whether regulatory patterns in the brain relevant to autism can also be detected in blood. We constructed a matrix of differentially expressed genes from these experiments and used a Jaccard coefficient to create a gene-based phylogeny, validated by bootstrap. As expected, experiments and tissue types clustered together with high statistical confidence. However, we discovered a statistically significant subgrouping of 3 blood and 2 brain data sets from 3 different experiments rooted by a highly correlated regulatory pattern of 66 genes. This Root 66 appeared to be non-random and of potential etiologic relevance to autism, given their enriched roles in neurological processes key for normal brain growth and function, learning and memory, neurodegeneration, social behavior and cognition. Our results suggest that there is a detectable autism signature in the blood that may be a molecular echo of autism-related dysregulation in the brain.

    View details for DOI 10.1038/tp.2015.112

    View details for Web of Science ID 000368549500005

    View details for PubMedID 26731442

  • The Quantified Brain: A Framework for Mobile Device-Based Assessment of Behavior and Neurological Function. Applied clinical informatics Stark, D. E., Kumar, R. B., Longhurst, C. A., Wall, D. P. 2016; 7 (2): 290–98

    View details for DOI 10.4338/ACI-2015-12-LE-0176

    View details for PubMedID 27437041

    View details for PubMedCentralID PMC4941840

  • A Practical Approach to Real-Time Neutral Feature Subtraction for Facial Expression Recognition Haber, N., Voss, C., Fazel, A., Winograd, T., Wall, D. P., IEEE IEEE. 2016
  • DE NOVO MUTATIONS IN AUTISM IMPLICATE THE SYNAPTIC ELIMINATION NETWORK. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Ram Venkataraman, G., O'Connell, C., Egawa, F., Kashef-Haghighi, D., Wall, D. P. 2016; 22: 521-532

    Abstract

    Autism has been shown to have a major genetic risk component; the architecture of documented autism in families has been over and again shown to be passed down for generations. While inherited risk plays an important role in the autistic nature of children, de novo (germline) mutations have also been implicated in autism risk. Here we find that autism de novo variants verified and published in the literature are Bonferroni-significantly enriched in a gene set implicated in synaptic elimination. Additionally, several of the genes in this synaptic elimination set that were enriched in protein-protein interactions (CACNA1C, SHANK2, SYNGAP1, NLGN3, NRXN1, and PTEN) have been previously confirmed as genes that confer risk for the disorder. The results demonstrate that autism-associated de novos are linked to proper synaptic pruning and density, hinting at the etiology of autism and suggesting pathophysiology for downstream correction and treatment.

    View details for PubMedID 27897003

  • Use of machine learning for behavioral distinction of autism and ADHD. Translational psychiatry Duda, M., Ma, R., Haber, N., Wall, D. P. 2016; 6

    Abstract

    Although autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD) continue to rise in prevalence, together affecting >10% of today's pediatric population, the methods of diagnosis remain subjective, cumbersome and time intensive. With gaps upward of a year between initial suspicion and diagnosis, valuable time where treatments and behavioral interventions could be applied is lost as these disorders remain undetected. Methods to quickly and accurately assess risk for these, and other, developmental disorders are necessary to streamline the process of diagnosis and provide families access to much-needed therapies sooner. Using forward feature selection, as well as undersampling and 10-fold cross-validation, we trained and tested six machine learning models on complete 65-item Social Responsiveness Scale score sheets from 2925 individuals with either ASD (n=2775) or ADHD (n=150). We found that five of the 65 behaviors measured by this screening tool were sufficient to distinguish ASD from ADHD with high accuracy (area under the curve=0.965). These results support the hypotheses that (1) machine learning can be used to discern between autism and ADHD with high accuracy and (2) this distinction can be made using a small number of commonly measured behaviors. Our findings show promise for use as an electronically administered, caregiver-directed resource for preliminary risk evaluation and/or pre-clinical screening and triage that could help to speed the diagnosis of these disorders.

    View details for DOI 10.1038/tp.2015.221

    View details for PubMedID 26859815

    View details for PubMedCentralID PMC4872425

  • Identification of Human Neuronal Protein Complexes Reveals Biochemical Activities and Convergent Mechanisms of Action in Autism Spectrum Disorders. Cell systems Li, J., Ma, Z., Shi, M., Malty, R. H., Aoki, H., Minic, Z., Phanse, S., Jin, K., Wall, D. P., Zhang, Z., Urban, A. E., Hallmayer, J., Babu, M., Snyder, M. 2015; 1 (5): 361-374

    Abstract

    The prevalence of autism spectrum disorders (ASDs) is rapidly growing, yet its molecular basis is poorly understood. We used a systems approach in which ASD candidate genes were mapped onto the ubiquitous human protein complexes and the resulting complexes were characterized. The studies revealed the role of histone deacetylases (HDAC1/2) in regulating the expression of ASD orthologs in the embryonic mouse brain. Proteome-wide screens for the co-complexed subunits with HDAC1 and six other key ASD proteins in neuronal cells revealed a protein interaction network, which displayed preferential expression in fetal brain development, exhibited increased deleterious mutations in ASD cases, and were strongly regulated by FMRP and MECP2 causal for Fragile X and Rett syndromes, respectively. Overall, our study reveals molecular components in ASD, suggests a shared mechanism between the syndromic and idiopathic forms of ASDs, and provides a systems framework for analyzing complex human diseases.

    View details for PubMedID 26949739

    View details for PubMedCentralID PMC4776331

  • Identification of Human Neuronal Protein Complexes Reveals Biochemical Activities and Convergent Mechanisms of Action in Autism Spectrum Disorders CELL SYSTEMS Li, J., Ma, Z., Shi, M., Malty, R. H., Aoki, H., Minic, Z., Phanse, S., Jin, K., Wall, D. P., Zhang, Z., Urban, A. E., Hallmayer, J., Babu, M., Snyder, M. 2015; 1 (5): 361-374

    Abstract

    The prevalence of autism spectrum disorders (ASDs) is rapidly growing, yet its molecular basis is poorly understood. We used a systems approach in which ASD candidate genes were mapped onto the ubiquitous human protein complexes and the resulting complexes were characterized. The studies revealed the role of histone deacetylases (HDAC1/2) in regulating the expression of ASD orthologs in the embryonic mouse brain. Proteome-wide screens for the co-complexed subunits with HDAC1 and six other key ASD proteins in neuronal cells revealed a protein interaction network, which displayed preferential expression in fetal brain development, exhibited increased deleterious mutations in ASD cases, and were strongly regulated by FMRP and MECP2 causal for Fragile X and Rett syndromes, respectively. Overall, our study reveals molecular components in ASD, suggests a shared mechanism between the syndromic and idiopathic forms of ASDs, and provides a systems framework for analyzing complex human diseases.

    View details for DOI 10.1016/j.cels.2015.11.002

    View details for Web of Science ID 000209926300009

    View details for PubMedCentralID PMC4776331

  • Scalable and cost-effective NGS genotyping in the cloud BMC MEDICAL GENOMICS Souilmi, Y., Lancaster, A. K., Jung, J., Rizzo, E., Hawkins, J. B., Powles, R., Amzazi, S., Ghazal, H., Tonellato, P. J., Wall, D. P. 2015; 8

    Abstract

    While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10's of dollars.We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets.Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration.

    View details for DOI 10.1186/s12920-015-0134-9

    View details for Web of Science ID 000362868300001

    View details for PubMedID 26470712

    View details for PubMedCentralID PMC4608296

  • A transgenic resource for conditional competitive inhibition of conserved Drosophila microRNAs NATURE COMMUNICATIONS Fulga, T. A., McNeill, E. M., Binari, R., Yelick, J., Blanche, A., Booker, M., Steinkraus, B. R., Schnall-Levin, M., Zhao, Y., Deluca, T., Bejarano, F., Han, Z., Lai, E. C., Wall, D. P., Perrimon, N., Van Vactor, D. 2015; 6

    Abstract

    Although the impact of microRNAs (miRNAs) in development and disease is well established, understanding the function of individual miRNAs remains challenging. Development of competitive inhibitor molecules such as miRNA sponges has allowed the community to address individual miRNA function in vivo. However, the application of these loss-of-function strategies has been limited. Here we offer a comprehensive library of 141 conditional miRNA sponges targeting well-conserved miRNAs in Drosophila. Ubiquitous miRNA sponge delivery and consequent systemic miRNA inhibition uncovers a relatively small number of miRNA families underlying viability and gross morphogenesis, with false discovery rates in the 4-8% range. In contrast, tissue-specific silencing of muscle-enriched miRNAs reveals a surprisingly large number of novel miRNA contributions to the maintenance of adult indirect flight muscle structure and function. A strong correlation between miRNA abundance and physiological relevance is not observed, underscoring the importance of unbiased screens when assessing the contributions of miRNAs to complex biological processes.

    View details for DOI 10.1038/ncomms8279

    View details for Web of Science ID 000357170800006

    View details for PubMedID 26081261

    View details for PubMedCentralID PMC4471878

  • Searching for a minimal set of behaviors for autism detection through feature selection-based machine learning TRANSLATIONAL PSYCHIATRY Kosmicki, J. A., Sochat, V., Duda, M., Wall, D. P. 2015; 5

    Abstract

    Although the prevalence of autism spectrum disorder (ASD) has risen sharply in the last few years reaching 1 in 68, the average age of diagnosis in the United States remains close to 4-well past the developmental window when early intervention has the largest gains. This emphasizes the importance of developing accurate methods to detect risk faster than the current standards of care. In the present study, we used machine learning to evaluate one of the best and most widely used instruments for clinical assessment of ASD, the Autism Diagnostic Observation Schedule (ADOS) to test whether only a subset of behaviors can differentiate between children on and off the autism spectrum. ADOS relies on behavioral observation in a clinical setting and consists of four modules, with module 2 reserved for individuals with some vocabulary and module 3 for higher levels of cognitive functioning. We ran eight machine learning algorithms using stepwise backward feature selection on score sheets from modules 2 and 3 from 4540 individuals. We found that 9 of the 28 behaviors captured by items from module 2, and 12 of the 28 behaviors captured by module 3 are sufficient to detect ASD risk with 98.27% and 97.66% accuracy, respectively. A greater than 55% reduction in the number of behaviorals with negligible loss of accuracy across both modules suggests a role for computational and statistical methods to streamline ASD risk detection and screening. These results may help enable development of mobile and parent-directed methods for preliminary risk evaluation and/or clinical triage that reach a larger percentage of the population and help to lower the average age of detection and diagnosis.

    View details for DOI 10.1038/tp.2015.7

    View details for Web of Science ID 000367652200002

  • COSMOS: cloud enabled NGS analysis Souilmi, Y., Jung, J., Lancaster, A., Gafni, E., Amzazi, S., Ghazal, H., Wall, D., Tonellato, P. BIOMED CENTRAL LTD. 2015
  • Rising interdisciplinary collaborations refine our understanding of autisms and give hope to more personalized solutions PERSONALIZED MEDICINE Duda, M., Wall, D. P. 2015; 12 (4): 359-369

    View details for DOI 10.2217/PME.15.8

    View details for Web of Science ID 000358945300006

  • Translational Meta-analytical Methods to Localize the Regulatory Patterns of Neurological Disorders in the Human Brain. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Sochat, V., David, M., Wall, D. P. 2015; 2015: 2073-2082

    Abstract

    The task of mapping neurological disorders in the human brain must be informed by multiple measurements of an individual's phenotype - neuroimaging, genomics, and behavior. We developed a novel meta-analytical approach to integrate disparate resources and generated transcriptional maps of neurological disorders in the human brain yielding a purely computational procedure to pinpoint the brain location of transcribed genes likely to be involved in either onset or maintenance of the neurological condition.

    View details for PubMedID 26958307

    View details for PubMedCentralID PMC4765688

  • Testing the accuracy of an observation-based classifier for rapid detection of autism risk. Translational psychiatry Duda, M., Kosmicki, J. A., Wall, D. P. 2015; 5

    View details for DOI 10.1038/tp.2015.51

    View details for PubMedID 25918993

  • COSMOS: Python library for massively parallel workflows BIOINFORMATICS Gafni, E., Luquette, L. J., Lancaster, A. K., Hawkins, J. B., Jung, J., Souilmi, Y., Wall, D. P., Tonellato, P. J. 2014; 30 (20): 2956-2958

    Abstract

    Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at http://lpm.hms.harvard.edu and http://wall-lab.stanford.edu.dpwall@stanford.edu or peter_tonellato@hms.harvard.edu.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btu385

    View details for Web of Science ID 000343083600015

    View details for PubMedID 24982428

    View details for PubMedCentralID PMC4184253

  • A framework for the interpretation of de novo mutation in human disease NATURE GENETICS Samocha, K. E., Robinson, E. B., Sanders, S. J., Stevens, C., Sabo, A., McGrath, L. M., Kosmicki, J. A., Rehnstrom, K., Mallick, S., Kirby, A., Wall, D. P., MacArthur, D. G., Gabriel, S. B., DePristo, M., Purcell, S. M., Palotie, A., Boerwinkle, E., Buxbaum, J. D., Cook, E. H., Gibbs, R. A., Schellenberg, G. D., Sutcliffe, J. S., Devlin, B., Roeder, K., Neale, B. M., Daly, M. J. 2014; 46 (9): 944-?

    Abstract

    Spontaneously arising (de novo) mutations have an important role in medical genetics. For diseases with extensive locus heterogeneity, such as autism spectrum disorders (ASDs), the signal from de novo mutations is distributed across many genes, making it difficult to distinguish disease-relevant mutations from background variation. Here we provide a statistical framework for the analysis of excesses in de novo mutation per gene and gene set by calibrating a model of de novo mutation. We applied this framework to de novo mutations collected from 1,078 ASD family trios, and, whereas we affirmed a significant role for loss-of-function mutations, we found no excess of de novo loss-of-function mutations in cases with IQ above 100, suggesting that the role of de novo mutations in ASDs might reside in fundamental neurodevelopmental processes. We also used our model to identify ∼1,000 genes that are significantly lacking in functional coding variation in non-ASD samples and are enriched for de novo loss-of-function mutations identified in ASD cases.

    View details for DOI 10.1038/ng.3050

    View details for Web of Science ID 000341579400007

    View details for PubMedID 25086666

  • Evaluating the critical source area concept of phosphorus loss from soils to water-bodies in agricultural catchments. The Science of the total environment Shore, M., Jordan, P., Mellander, P., Kelly-Quinn, M., Wall, D. P., Murphy, P. N., Melland, A. R. 2014; 490: 405-415

    Abstract

    Using data collected from six basins located across two hydrologically contrasting agricultural catchments, this study investigated whether transport metrics alone provide better estimates of storm phosphorus (P) loss from basins than critical source area (CSA) metrics which combine source factors as well. Concentrations and loads of P in quickflow (QF) were measured at basin outlets during four storm events and were compared with dynamic (QF magnitude) and static (extent of highly-connected, poorly-drained soils) transport metrics and a CSA metric (extent of highly-connected, poorly-drained soils with excess plant-available P). Pairwise comparisons between basins with similar CSA risks but contrasting QF magnitudes showed that QF flow-weighted mean TRP (total molybdate-reactive P) concentrations and loads were frequently (at least 11 of 14 comparisons) more than 40% higher in basins with the highest QF magnitudes. Furthermore, static transport metrics reliably discerned relative QF magnitudes between these basins. However, particulate P (PP) concentrations were often (6 of 14 comparisons) higher in basins with the lowest QF magnitudes, most likely due to soil-management activities (e.g. ploughing), in these predominantly arable basins at these times. Pairwise comparisons between basins with contrasting CSA risks and similar QF magnitudes showed that TRP and PP concentrations and loads did not reflect trends in CSA risk or QF magnitude. Static transport metrics did not discern relative QF magnitudes between these basins. In basins with contrasting transport risks, storm TRP concentrations and loads were well differentiated by dynamic or static transport metrics alone, regardless of differences in soil P. In basins with similar transport risks, dynamic transport metrics and P source information additional to soil P may be required to predict relative storm TRP concentrations and loads. Regardless of differences in transport risk, information on land use and management, may be required to predict relative differences in storm PP concentrations between these agricultural basins.

    View details for DOI 10.1016/j.scitotenv.2014.04.122

    View details for PubMedID 24863139

  • A literature search tool for intelligent extraction of disease-associated genes JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Jung, J., DeLuca, T. F., Nelson, T. H., Wall, D. P. 2014; 21 (3): 399-405

    Abstract

    To extract disorder-associated genes from the scientific literature in PubMed with greater sensitivity for literature-based support than existing methods.We developed a PubMed query to retrieve disorder-related, original research articles. Then we applied a rule-based text-mining algorithm with keyword matching to extract target disorders, genes with significant results, and the type of study described by the article.We compared our resulting candidate disorder genes and supporting references with existing databases. We demonstrated that our candidate gene set covers nearly all genes in manually curated databases, and that the references supporting the disorder-gene link are more extensive and accurate than other general purpose gene-to-disorder association databases.We implemented a novel publication search tool to find target articles, specifically focused on links between disorders and genotypes. Through comparison against gold-standard manually updated gene-disorder databases and comparison with automated databases of similar functionality we show that our tool can search through the entirety of PubMed to extract the main gene findings for human diseases rapidly and accurately.

    View details for DOI 10.1136/amiajnl-2012-001563

    View details for Web of Science ID 000334611600003

    View details for PubMedID 23999671

    View details for PubMedCentralID PMC3994846

  • The Potential of Accelerating Early Detection of Autism through Content Analysis of YouTube Videos. PloS one Fusaro, V. A., Daniels, J., Duda, M., DeLuca, T. F., D'Angelo, O., Tamburello, J., Maniscalco, J., Wall, D. P. 2014; 9 (4)

    Abstract

    Autism is on the rise, with 1 in 88 children receiving a diagnosis in the United States, yet the process for diagnosis remains cumbersome and time consuming. Research has shown that home videos of children can help increase the accuracy of diagnosis. However the use of videos in the diagnostic process is uncommon. In the present study, we assessed the feasibility of applying a gold-standard diagnostic instrument to brief and unstructured home videos and tested whether video analysis can enable more rapid detection of the core features of autism outside of clinical environments. We collected 100 public videos from YouTube of children ages 1-15 with either a self-reported diagnosis of an ASD (N = 45) or not (N = 55). Four non-clinical raters independently scored all videos using one of the most widely adopted tools for behavioral diagnosis of autism, the Autism Diagnostic Observation Schedule-Generic (ADOS). The classification accuracy was 96.8%, with 94.1% sensitivity and 100% specificity, the inter-rater correlation for the behavioral domains on the ADOS was 0.88, and the diagnoses matched a trained clinician in all but 3 of 22 randomly selected video cases. Despite the diversity of videos and non-clinical raters, our results indicate that it is possible to achieve high classification accuracy, sensitivity, and specificity as well as clinically acceptable inter-rater reliability with nonclinical personnel. Our results also demonstrate the potential for video-based detection of autism in short, unstructured home videos and further suggests that at least a percentage of the effort associated with detection and monitoring of autism may be mobilized and moved outside of traditional clinical environments.

    View details for DOI 10.1371/journal.pone.0093533

    View details for PubMedID 24740236

    View details for PubMedCentralID PMC3989176

  • Testing the accuracy of an observation-based classifier for rapid detection of autism risk. Translational psychiatry Duda, M., Kosmicki, J. A., Wall, D. P. 2014; 4

    Abstract

    Current approaches for diagnosing autism have high diagnostic validity but are time consuming and can contribute to delays in arriving at an official diagnosis. In a pilot study, we used machine learning to derive a classifier that represented a 72% reduction in length from the gold-standard Autism Diagnostic Observation Schedule-Generic (ADOS-G), while retaining >97% statistical accuracy. The pilot study focused on a relatively small sample of children with and without autism. The present study sought to further test the accuracy of the classifier (termed the observation-based classifier (OBC)) on an independent sample of 2616 children scored using ADOS from five data repositories and including both spectrum (n=2333) and non-spectrum (n=283) individuals. We tested OBC outcomes against the outcomes provided by the original and current ADOS algorithms, the best estimate clinical diagnosis, and the comparison score severity metric associated with ADOS-2. The OBC was significantly correlated with the ADOS-G (r=-0.814) and ADOS-2 (r=-0.779) and exhibited >97% sensitivity and >77% specificity in comparison to both ADOS algorithm scores. The correspondence to the best estimate clinical diagnosis was also high (accuracy=96.8%), with sensitivity of 97.1% and specificity of 83.3%. The correlation between the OBC score and the comparison score was significant (r=-0.628), suggesting that the OBC provides both a classification as well as a measure of severity of the phenotype. These results further demonstrate the accuracy of the OBC and suggest that reductions in the process of detecting and monitoring autism are possible.

    View details for DOI 10.1038/tp.2014.65

    View details for PubMedID 25116834

    View details for PubMedCentralID PMC4150240

  • Responding to a Diagnosis of Localized Prostate Cancer Men's Experiences of Normal Distress During the First 3 Postdiagnostic Months CANCER NURSING Wall, D. P., Kristjanson, L. J., Fisher, C., Boldy, D., Kendall, G. E. 2013; 36 (6): E44-E50

    Abstract

    Men experience localized prostate cancer (PCa) as aversive and distressing. Little research has studied the distress men experience as a normal response to PCa, or how they manage this distress during the early stages of the illness.The objective of this study was to explore the experience of men diagnosed with localized PCa during their first postdiagnostic year.This constructivist qualitative study interviewed 8 men between the ages of 44 and 77 years, in their homes, on 2 occasions during the first 3 postdiagnostic months. Individual, in-depth semistructured interviews were used to collect the data.After an initial feeling of shock, the men in this study worked diligently to camouflage their experience of distress through hiding and attenuating their feelings and minimizing the severity of PCa.Men silenced distress because they believed it was expected of them. Maintaining silence allowed men to protect their strong and stoic self-image. This stereotype, of the strong and stoic man, prevented men from expressing their feelings of distress and from seeking support from family and friends and health professionals.It is important for nurses to acknowledge and recognize the normal distress experienced by men as a result of a PCa diagnosis. Hence, nurses must learn to identify the ways in which men avoid expressing their distress and develop early supportive relationships that encourage them to express and subsequently manage it.

    View details for DOI 10.1097/NCC.0b013e3182747bef

    View details for Web of Science ID 000326532000006

    View details for PubMedID 23154517

  • Quantification of Phosphorus Transport from a Karstic Agricultural Watershed to Emerging Spring Water ENVIRONMENTAL SCIENCE & TECHNOLOGY Mellander, P., Jordan, P., Melland, A. R., Murphy, P. N., Wall, D. P., Mechan, S., Meehan, R., Kelly, C., Shine, O., Shortle, G. 2013; 47 (12): 6111-6119

    Abstract

    The degree to which waters in a given watershed will be affected by nutrient export can be defined as that watershed's nutrient vulnerability. This study applied concepts of specific phosphorus (P) vulnerability to develop intrinsic groundwater vulnerability risk assessments in a 32 km(2) karst watershed (spring zone of contribution) in a relatively intensive agricultural landscape. To explain why emergent spring water was below an ecological impairment threshold, concepts of P attenuation potential were investigated along the nutrient transfer continuum based on soil P buffering, depth to bedrock, and retention within the aquifer. Surface karst features, such as enclosed depressions, were reclassified based on P attenuation potential in soil at the base. New techniques of high temporal resolution monitoring of P loads in the emergent spring made it possible to estimate P transfer pathways and retention within the aquifer and indicated small-medium fissure flows to be the dominant pathway, delivering 52-90% of P loads during storm events. Annual total P delivery to the main emerging spring was 92.7 and 138.4 kg total P (and 52.4 and 91.3 kg as total reactive P) for two monitored years, respectively. A revised groundwater vulnerability assessment was used to produce a specific P vulnerability map that used the soil and hydrogeological P buffering potential of the watershed as key assumptions in moderating P export to the emergent spring. Using this map and soil P data, the definition of critical source areas in karst landscapes was demonstrated.

    View details for DOI 10.1021/es304909y

    View details for Web of Science ID 000320749000007

    View details for PubMedID 23672730

  • Systems biology as a comparative approach to understand complex gene expression in neurological diseases. Behavioral sciences (Basel, Switzerland) Diaz-Beltran, L., Cano, C., Wall, D. P., Esteban, F. J. 2013; 3 (2): 253-272

    Abstract

    Systems biology interdisciplinary approaches have become an essential analytical tool that may yield novel and powerful insights about the nature of human health and disease. Complex disorders are known to be caused by the combination of genetic, environmental, immunological or neurological factors. Thus, to understand such disorders, it becomes necessary to address the study of this complexity from a novel perspective. Here, we present a review of integrative approaches that help to understand the underlying biological processes involved in the etiopathogenesis of neurological diseases, for example, those related to autism and autism spectrum disorders (ASD) endophenotypes. Furthermore, we highlight the role of systems biology in the discovery of new biomarkers or therapeutic targets in complex disorders, a key step in the development of personalized medicine, and we demonstrate the role of systems approaches in the design of classifiers that can shorten the time for behavioral diagnosis of autism.

    View details for DOI 10.3390/bs3020253

    View details for PubMedID 25379238

    View details for PubMedCentralID PMC4217627

  • Haplotype structure enables prioritization of common markers and candidate genes in autism spectrum disorder TRANSLATIONAL PSYCHIATRY Vardarajan, B. N., Eran, A., Jung, J., KUNKEL, L. M., Wall, D. P. 2013; 3

    Abstract

    Autism spectrum disorder (ASD) is a neurodevelopmental condition that results in behavioral, social and communication impairments. ASD has a substantial genetic component, with 88-95% trait concordance among monozygotic twins. Efforts to elucidate the causes of ASD have uncovered hundreds of susceptibility loci and candidate genes. However, owing to its polygenic nature and clinical heterogeneity, only a few of these markers represent clear targets for further analyses. In the present study, we used the linkage structure associated with published genetic markers of ASD to simultaneously improve candidate gene detection while providing a means of prioritizing markers of common genetic variation in ASD. We first mined the literature for linkage and association studies of single-nucleotide polymorphisms, copy-number variations and multi-allelic markers in Autism Genetic Resource Exchange (AGRE) families. From markers that reached genome-wide significance, we calculated male-specific genetic distances, in light of the observed strong male bias in ASD. Four of 67 autism-implicated regions, 3p26.1, 3p26.3, 3q25-27 and 5p15, were enriched with differentially expressed genes in blood and brain from individuals with ASD. Of 30 genes differentially expressed across multiple expression data sets, 21 were within 10 cM of an autism-implicated locus. Among them, CNTN4, CADPS2, SUMF1, SLC9A9, NTRK3 have been previously implicated in autism, whereas others have been implicated in neurological disorders comorbid with ASD. This work leverages the rich multimodal genomic information collected on AGRE families to present an efficient integrative strategy for prioritizing autism candidates and improving our understanding of the relationships among the vast collection of past genetic studies.

    View details for DOI 10.1038/tp.2013.38

    View details for Web of Science ID 000321184400008

    View details for PubMedID 23715297

    View details for PubMedCentralID PMC3669925

  • Genomics-Informed Pathology SCIENTIST Wall, D. P., Tonellato, P. J. 2013; 27 (1): 22-23
  • Autworks: a cross-disease analysis application for Autism and related disorders. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Wall, D. 2013; 2013: 42–43

    View details for PubMedID 24303295

  • Genetic Networks of Complex Disorders: from a Novel Search Engine for PubMed Article Database. AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science Jung, J., Wall, D. P. 2013; 2013: 99-?

    Abstract

    Finding genetic risk factors of complex disorders may involve reviewing hundreds of genes or thousands of research articles iteratively, but few tools have been available to facilitate this procedure. In this work, we built a novel publication search engine that can identify target-disorder specific, genetics-oriented research articles and extract the genes with significant results. Preliminary test results showed that the output of this engine has better coverage in terms of genes or publications, than other existing applications. We consider it as an essential tool for understanding genetic networks of complex disorders.

    View details for PubMedID 24303309

  • Streaming Support for Data Intensive Cloud-Based Sequence Analysis BIOMED RESEARCH INTERNATIONAL Issa, S. A., Kienzler, R., El-Kalioby, M., Tonellato, P. J., Wall, D., Bruggmann, R., Abouelhoda, M. 2013

    Abstract

    Cloud computing provides a promising solution to the genomics data deluge problem resulting from the advent of next-generation sequencing (NGS) technology. Based on the concepts of "resources-on-demand" and "pay-as-you-go", scientists with no or limited infrastructure can have access to scalable and cost-effective computational resources. However, the large size of NGS data causes a significant data transfer latency from the client's site to the cloud, which presents a bottleneck for using cloud computing services. In this paper, we provide a streaming-based scheme to overcome this problem, where the NGS data is processed while being transferred to the cloud. Our scheme targets the wide class of NGS data analysis tasks, where the NGS sequences can be processed independently from one another. We also provide the elastream package that supports the use of this scheme with individual analysis programs or with workflow systems. Experiments presented in this paper show that our solution mitigates the effect of data transfer latency and saves both time and cost of computation.

    View details for DOI 10.1155/2013/791051

    View details for Web of Science ID 000318725500001

    View details for PubMedID 23710461

    View details for PubMedCentralID PMC3655485

  • Personalized cloud-based bioinformatics services for research and education: use cases and the elasticHPC package Asia Pacific Bioinformatics Network (APBioNet) 11th International Conference on Bioinformatics (InCoB) El-Kalioby, M., Abouelhoda, M., Krueger, J., Giegerich, R., Sczyrba, A., Wall, D. P., Tonellato, P. BIOMED CENTRAL LTD. 2012

    Abstract

    Bioinformatics services have been traditionally provided in the form of a web-server that is hosted at institutional infrastructure and serves multiple users. This model, however, is not flexible enough to cope with the increasing number of users, increasing data size, and new requirements in terms of speed and availability of service. The advent of cloud computing suggests a new service model that provides an efficient solution to these problems, based on the concepts of "resources-on-demand" and "pay-as-you-go". However, cloud computing has not yet been introduced within bioinformatics servers due to the lack of usage scenarios and software layers that address the requirements of the bioinformatics domain.In this paper, we provide different use case scenarios for providing cloud computing based services, considering both the technical and financial aspects of the cloud computing service model. These scenarios are for individual users seeking computational power as well as bioinformatics service providers aiming at provision of personalized bioinformatics services to their users. We also present elasticHPC, a software package and a library that facilitates the use of high performance cloud computing resources in general and the implementation of the suggested bioinformatics scenarios in particular. Concrete examples that demonstrate the suggested use case scenarios with whole bioinformatics servers and major sequence analysis tools like BLAST are presented. Experimental results with large datasets are also included to show the advantages of the cloud model.Our use case scenarios and the elasticHPC package are steps towards the provision of cloud based bioinformatics services, which would help in overcoming the data challenge of recent biological research. All resources related to elasticHPC and its web-interface are available at http://www.elasticHPC.org.

    View details for DOI 10.1186/1471-2105-13-S17-S22

    View details for Web of Science ID 000317183600002

    View details for PubMedID 23281941

    View details for PubMedCentralID PMC3521398

  • Autworks: a cross-disease network biology application for Autism and related disorders BMC MEDICAL GENOMICS Nelson, T. H., Jung, J., DeLuca, T. F., Hinebaugh, B. K., St Gabriel, K. C., Wall, D. P. 2012; 5

    Abstract

    The genetic etiology of autism is heterogeneous. Multiple disorders share genotypic and phenotypic traits with autism. Network based cross-disorder analysis can aid in the understanding and characterization of the molecular pathology of autism, but there are few tools that enable us to conduct cross-disorder analysis and to visualize the results.We have designed Autworks as a web portal to bring together gene interaction and gene-disease association data on autism to enable network construction, visualization, network comparisons with numerous other related neurological conditions and disorders. Users may examine the structure of gene interactions within a set of disorder-associated genes, compare networks of disorder/disease genes with those of other disorders/diseases, and upload their own sets for comparative analysis.Autworks is a web application that provides an easy-to-use resource for researchers of varied backgrounds to analyze the autism gene network structure within and between disorders.http://autworks.hms.harvard.edu/

    View details for DOI 10.1186/1755-8794-5-56

    View details for Web of Science ID 000313043800001

    View details for PubMedID 23190929

    View details for PubMedCentralID PMC3533944

  • Cross-pollination of research findings, although uncommon, may accelerate discovery of human disease genes BMC MEDICAL GENETICS Duda, M., Nelson, T., Wall, D. P. 2012; 13

    Abstract

    Technological leaps in genome sequencing have resulted in a surge in discovery of human disease genes. These discoveries have led to increased clarity on the molecular pathology of disease and have also demonstrated considerable overlap in the genetic roots of human diseases. In light of this large genetic overlap, we tested whether cross-disease research approaches lead to faster, more impactful discoveries.We leveraged several gene-disease association databases to calculate a Mutual Citation Score (MCS) for 10,853 pairs of genetically related diseases to measure the frequency of cross-citation between research fields. To assess the importance of cooperative research, we computed an Individual Disease Cooperation Score (ICS) and the average publication rate for each disease.For all disease pairs with one gene in common, we found that the degree of genetic overlap was a poor predictor of cooperation (r(2)=0.3198) and that the vast majority of disease pairs (89.56%) never cited previous discoveries of the same gene in a different disease, irrespective of the level of genetic similarity between the diseases. A fraction (0.25%) of the pairs demonstrated cross-citation in greater than 5% of their published genetic discoveries and 0.037% cross-referenced discoveries more than 10% of the time. We found strong positive correlations between ICS and publication rate (r(2)=0.7931), and an even stronger correlation between the publication rate and the number of cross-referenced diseases (r(2)=0.8585). These results suggested that cross-disease research may have the potential to yield novel discoveries at a faster pace than singular disease research.Our findings suggest that the frequency of cross-disease study is low despite the high level of genetic similarity among many human diseases, and that collaborative methods may accelerate and increase the impact of new genetic discoveries. Until we have a better understanding of the taxonomy of human diseases, cross-disease research approaches should become the rule rather than the exception.

    View details for DOI 10.1186/1471-2350-13-114

    View details for Web of Science ID 000312866300001

    View details for PubMedID 23190421

    View details for PubMedCentralID PMC3532152

  • Use of Artificial Intelligence to Shorten the Behavioral Diagnosis of Autism PLOS ONE Wall, D. P., Dally, R., Luyster, R., Jung, J., DeLuca, T. F. 2012; 7 (8)

    Abstract

    The Autism Diagnostic Interview-Revised (ADI-R) is one of the most commonly used instruments for assisting in the behavioral diagnosis of autism. The exam consists of 93 questions that must be answered by a care provider within a focused session that often spans 2.5 hours. We used machine learning techniques to study the complete sets of answers to the ADI-R available at the Autism Genetic Research Exchange (AGRE) for 891 individuals diagnosed with autism and 75 individuals who did not meet the criteria for an autism diagnosis. Our analysis showed that 7 of the 93 items contained in the ADI-R were sufficient to classify autism with 99.9% statistical accuracy. We further tested the accuracy of this 7-question classifier against complete sets of answers from two independent sources, a collection of 1654 individuals with autism from the Simons Foundation and a collection of 322 individuals with autism from the Boston Autism Consortium. In both cases, our classifier performed with nearly 100% statistical accuracy, properly categorizing all but one of the individuals from these two resources who previously had been diagnosed with autism through the standard ADI-R. Our ability to measure specificity was limited by the small numbers of non-spectrum cases in the research data used, however, both real and simulated data demonstrated a range in specificity from 99% to 93.8%. With incidence rates rising, the capacity to diagnose autism quickly and effectively requires careful design of behavioral assessment methods. Ours is an initial attempt to retrospectively analyze large data repositories to derive an accurate, but significantly abbreviated approach that may be used for rapid detection and clinical prioritization of individuals likely to have an autism spectrum disorder. Such a tool could assist in streamlining the clinical diagnostic process overall, leading to faster screening and earlier treatment of individuals with autism.

    View details for DOI 10.1371/journal.pone.0043855

    View details for Web of Science ID 000308044800067

    View details for PubMedID 22952789

    View details for PubMedCentralID PMC3428277

  • Delivery and impact bypass in a karst aquifer with high phosphorus source and pathway potential WATER RESEARCH Mellander, P., Jordan, P., Wall, D. P., Melland, A. R., Meehan, R., Kelly, C., Shortle, G. 2012; 46 (7): 2225-2236

    Abstract

    Conduit and other karstic flows to aquifers, connecting agricultural soils and farming activities, are considered to be the main hydrological mechanisms that transfer phosphorus from the land surface to the groundwater body of a karstified aquifer. In this study, soil source and pathway components of the phosphorus (P) transfer continuum were defined at a high spatial resolution; field-by-field soil P status and mapping of all surface karst features was undertaken in a > 30 km(2) spring contributing zone. Additionally, P delivery and water discharge was monitored in the emergent spring at a sub-hourly basis for over 12 months. Despite moderate to intensive agriculture, varying soil P status with a high proportion of elevated soil P concentrations and a high karstic connectivity potential, background P concentrations in the emergent groundwater were low and indicative of being insufficient to increase the surface water P status of receiving surface waters. However, episodic P transfers via the conduit system increased the P concentrations in the spring during storm events (but not >0.035 mg total reactive P L(-1)) and this process is similar to other catchments where the predominant transfer is via episodic, surface flow pathways; but with high buffering potential over karst due to delayed and attenuated runoff. These data suggest that the current definitions of risk and vulnerability for P delivery to receiving surface waters should be re-evaluated as high source risk need not necessarily result in a water quality impact. Also, inclusion of conduit flows from sparse water quality data in these systems may over-emphasise their influence on the overall status of the groundwater body.

    View details for DOI 10.1016/j.watres.2012.01.048

    View details for Web of Science ID 000302645300020

    View details for PubMedID 22377147

  • Deriving clinical action from whole-genome analysis PERSONALIZED MEDICINE Wall, D. P., Tonellato, P. J. 2012; 9 (3): 247–52

    View details for DOI 10.2217/PME.12.32

    View details for Web of Science ID 000303702400004

    View details for PubMedID 29758797

  • Systems analysis of inflammatory bowel disease based on comprehensive gene information BMC MEDICAL GENETICS Suzuki, S., Takai-Igarashi, T., Fukuoka, Y., Wall, D. P., Tanaka, H., Tonellato, P. J. 2012; 13

    Abstract

    The rise of systems biology and availability of highly curated gene and molecular information resources has promoted a comprehensive approach to study disease as the cumulative deleterious function of a collection of individual genes and networks of molecules acting in concert. These "human disease networks" (HDN) have revealed novel candidate genes and pharmaceutical targets for many diseases and identified fundamental HDN features conserved across diseases. A network-based analysis is particularly vital for a study on polygenic diseases where many interactions between molecules should be simultaneously examined and elucidated. We employ a new knowledge driven HDN gene and molecular database systems approach to analyze Inflammatory Bowel Disease (IBD), whose pathogenesis remains largely unknown.Based on drug indications for IBD, we determined sibling diseases of mild and severe states of IBD. Approximately 1,000 genes associated with the sibling diseases were retrieved from four databases. After ranking the genes by the frequency of records in the databases, we obtained 250 and 253 genes highly associated with the mild and severe IBD states, respectively. We then calculated functional similarities of these genes with known drug targets and examined and presented their interactions as PPI networks.The results demonstrate that this knowledge-based systems approach, predicated on functionally similar genes important to sibling diseases is an effective method to identify important components of the IBD human disease network. Our approach elucidates a previously unknown biological distinction between mild and severe IBD states.

    View details for DOI 10.1186/1471-2350-13-25

    View details for Web of Science ID 000305184200001

    View details for PubMedID 22480395

    View details for PubMedCentralID PMC3368714

  • Use of machine learning to shorten observation-based screening and diagnosis of autism TRANSLATIONAL PSYCHIATRY Wall, D. P., Kosmicki, J., DeLuca, T. F., Harstad, E., Fusaro, V. A. 2012; 2

    Abstract

    The Autism Diagnostic Observation Schedule-Generic (ADOS) is one of the most widely used instruments for behavioral evaluation of autism spectrum disorders. It is composed of four modules, each tailored for a specific group of individuals based on their language and developmental level. On average, a module takes between 30 and 60 min to deliver. We used a series of machine-learning algorithms to study the complete set of scores from Module 1 of the ADOS available at the Autism Genetic Resource Exchange (AGRE) for 612 individuals with a classification of autism and 15 non-spectrum individuals from both AGRE and the Boston Autism Consortium (AC). Our analysis indicated that 8 of the 29 items contained in Module 1 of the ADOS were sufficient to classify autism with 100% accuracy. We further validated the accuracy of this eight-item classifier against complete sets of scores from two independent sources, a collection of 110 individuals with autism from AC and a collection of 336 individuals with autism from the Simons Foundation. In both cases, our classifier performed with nearly 100% sensitivity, correctly classifying all but two of the individuals from these two resources with a diagnosis of autism, and with 94% specificity on a collection of observed and simulated non-spectrum controls. The classifier contained several elements found in the ADOS algorithm, demonstrating high test validity, and also resulted in a quantitative score that measures classification confidence and extremeness of the phenotype. With incidence rates rising, the ability to classify autism effectively and quickly requires careful design of assessment and diagnostic tools. Given the brevity, accuracy and quantitative nature of the classifier, results from this study may prove valuable in the development of mobile tools for preliminary evaluation and clinical prioritization-in particular those focused on assessment of short home videos of children--that speed the pace of initial evaluation and broaden the reach to a significantly larger percentage of the population at risk.

    View details for DOI 10.1038/tp.2012.10

    View details for Web of Science ID 000306218400003

    View details for PubMedID 22832900

    View details for PubMedCentralID PMC3337074

  • Roundup 2.0: enabling comparative genomics for over 1800 genomes BIOINFORMATICS DeLuca, T. F., Cui, J., Jung, J., Gabriel, K. C., Wall, D. P. 2012; 28 (5): 715-716

    Abstract

    Roundup is an online database of gene orthologs for over 1800 genomes, including 226 Eukaryota, 1447 Bacteria, 113 Archaea and 21 Viruses. Orthologs are inferred using the Reciprocal Smallest Distance algorithm. Users may query Roundup for single-linkage clusters of orthologous genes based on any group of genomes. Annotated query results may be viewed in a variety of ways including as clusters of orthologs and as phylogenetic profiles. Genomic results may be downloaded in formats suitable for functional as well as phylogenetic analysis, including the recent OrthoXML standard. In addition, gene IDs can be retrieved using FASTA sequence search. All source code and orthologs are freely available.http://roundup.hms.harvard.edu.

    View details for DOI 10.1093/bioinformatics/bts006

    View details for Web of Science ID 000300986600017

    View details for PubMedID 22247275

    View details for PubMedCentralID PMC3289913

  • Cloud Computing for Comparative Genomics with Windows Azure Platform EVOLUTIONARY BIOINFORMATICS Kim, I., Jung, J., DeLuca, T. F., Nelson, T. H., Wall, D. P. 2012; 8: 527-534

    Abstract

    Cloud computing services have emerged as a cost-effective alternative for cluster systems as the number of genomes and required computation power to analyze them increased in recent years. Here we introduce the Microsoft Azure platform with detailed execution steps and a cost comparison with Amazon Web Services.

    View details for DOI 10.4137/EBO.S9946

    View details for Web of Science ID 000308500500001

    View details for PubMedID 23032609

    View details for PubMedCentralID PMC3433929

  • The future of genomics in pathology. F1000 medicine reports Wall, D. P., Tonellato, P. J. 2012; 4: 14-?

    Abstract

    The recent advances in technology and the promise of cheap and fast whole genomic data offer the possibility to revolutionise the discipline of pathology. This should allow pathologists in the near future to diagnose disease rapidly and early to change its course, and to tailor treatment programs to the individual. This review outlines some of these technical advances and the changes needed to make this revolution a reality.

    View details for DOI 10.3410/M4-14

    View details for PubMedID 22802873

  • Phylogenetically informed logic relationships improve detection of biological network organization BMC BIOINFORMATICS Cui, J., DeLuca, T. F., Jung, J., Wall, D. P. 2011; 12

    Abstract

    A "phylogenetic profile" refers to the presence or absence of a gene across a set of organisms, and it has been proven valuable for understanding gene functional relationships and network organization. Despite this success, few studies have attempted to search beyond just pairwise relationships among genes. Here we search for logic relationships involving three genes, and explore its potential application in gene network analyses.Taking advantage of a phylogenetic matrix constructed from the large orthologs database Roundup, we invented a method to create balanced profiles for individual triplets of genes that guarantee equal weight on the different phylogenetic scenarios of coevolution between genes. When we applied this idea to LAPP, the method to search for logic triplets of genes, the balanced profiles resulted in significant performance improvement and the discovery of hundreds of thousands more putative triplets than unadjusted profiles. We found that logic triplets detected biological network organization and identified key proteins and their functions, ranging from neighbouring proteins in local pathways, to well separated proteins in the whole pathway, and to the interactions among different pathways at the system level. Finally, our case study suggested that the directionality in a logic relationship and the profile of a triplet could disclose the connectivity between the triplet and surrounding networks.Balanced profiles are superior to the raw profiles employed by traditional methods of phylogenetic profiling in searching for high order gene sets. Gene triplets can provide valuable information in detection of biological network organization and identification of key genes at different levels of cellular interaction.

    View details for DOI 10.1186/1471-2105-12-476

    View details for Web of Science ID 000299824500001

    View details for PubMedID 22172058

    View details for PubMedCentralID PMC3402364

  • Identification of autoimmune gene signatures in autism TRANSLATIONAL PSYCHIATRY Jung, J., Kohane, I. S., Wall, D. P. 2011; 1

    Abstract

    The role of the immune system in neuropsychiatric diseases, including autism spectrum disorder (ASD), has long been hypothesized. This hypothesis has mainly been supported by family cohort studies and the immunological abnormalities found in ASD patients, but had limited findings in genetic association testing. Two cross-disorder genetic association tests were performed on the genome-wide data sets of ASD and six autoimmune disorders. In the polygenic score test, we examined whether ASD risk alleles with low effect sizes work collectively in specific autoimmune disorders and show significant association statistics. In the genetic variation score test, we tested whether allele-specific associations between ASD and autoimmune disorders can be found using nominally significant single-nucleotide polymorphisms. In both tests, we found that ASD is probabilistically linked to ankylosing spondylitis (AS) and multiple sclerosis (MS). Association coefficients showed that ASD and AS were positively associated, meaning that autism susceptibility alleles may have a similar collective effect in AS. The association coefficients were negative between ASD and MS. Significant associations between ASD and two autoimmune disorders were identified. This genetic association supports the idea that specific immunological abnormalities may underlie the etiology of autism, at least in a number of cases.

    View details for DOI 10.1038/tp.2011.62

    View details for Web of Science ID 000306217100007

    View details for PubMedID 22832355

    View details for PubMedCentralID PMC3309496

  • Detecting biological network organization and functional gene orthologs BIOINFORMATICS Cui, J., DeLuca, T. F., Jung, J., Wall, D. P. 2011; 27 (20): 2919-2920

    Abstract

    We developed a package TripletSearch to compute relationships within triplets of genes based on Roundup, an orthologous gene database containing >1500 genomes. These relationships, derived from the coevolution of genes, provide valuable information in the detection of biological network organization from the local to the system level, in the inference of protein functions and in the identification of functional orthologs. To run the computation, users need to provide the GI IDs of the genes of interest.http://wall.hms.harvard.edu/sites/default/files/tripletSearch.tar.gzdpwall@hms.harvard.eduSupplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btr485

    View details for Web of Science ID 000295680600025

    View details for PubMedID 21856738

    View details for PubMedCentralID PMC3187654

  • Biomedical Cloud Computing With Amazon Web Services PLOS COMPUTATIONAL BIOLOGY Fusaro, V. A., Patil, P., Gafni, E., Wall, D. P., Tonellato, P. J. 2011; 7 (8)

    Abstract

    In this overview to biomedical computing in the cloud, we discussed two primary ways to use the cloud (a single instance or cluster), provided a detailed example using NGS mapping, and highlighted the associated costs. While many users new to the cloud may assume that entry is as straightforward as uploading an application and selecting an instance type and storage options, we illustrated that there is substantial up-front effort required before an application can make full use of the cloud's vast resources. Our intention was to provide a set of best practices and to illustrate how those apply to a typical application pipeline for biomedical informatics, but also general enough for extrapolation to other types of computational problems. Our mapping example was intended to illustrate how to develop a scalable project and not to compare and contrast alignment algorithms for read mapping and genome assembly. Indeed, with a newer aligner such as Bowtie, it is possible to map the entire African genome using one m2.2xlarge instance in 48 hours for a total cost of approximately $48 in computation time. In our example, we were not concerned with data transfer rates, which are heavily influenced by the amount of available bandwidth, connection latency, and network availability. When transferring large amounts of data to the cloud, bandwidth limitations can be a major bottleneck, and in some cases it is more efficient to simply mail a storage device containing the data to AWS (http://aws.amazon.com/importexport/). More information about cloud computing, detailed cost analysis, and security can be found in references.

    View details for DOI 10.1371/journal.pcbi.1002147

    View details for Web of Science ID 000294299700022

    View details for PubMedID 21901085

    View details for PubMedCentralID PMC3161908

  • Using game theory to detect genes involved in Autism Spectrum Disorder TOP Esteban, F. J., Wall, D. P. 2011; 19 (1): 121-129
  • The semantic organization of the animal category: evidence from semantic verbal fluency and network theory COGNITIVE PROCESSING Goni, J., Arrondo, G., Sepulcre, J., Martincorena, I., Velez de Mendizabal, N., Corominas-Murtra, B., Bejarano, B., Ardanza-Trevijano, S., Peraita, H., Wall, D. P., Villoslada, P. 2011; 12 (2): 183-196

    Abstract

    Semantic memory is the subsystem of human memory that stores knowledge of concepts or meanings, as opposed to life-specific experiences. How humans organize semantic information remains poorly understood. In an effort to better understand this issue, we conducted a verbal fluency experiment on 200 participants with the aim of inferring and representing the conceptual storage structure of the natural category of animals as a network. This was done by formulating a statistical framework for co-occurring concepts that aims to infer significant concept-concept associations and represent them as a graph. The resulting network was analyzed and enriched by means of a missing links recovery criterion based on modularity. Both network models were compared to a thresholded co-occurrence approach. They were evaluated using a random subset of verbal fluency tests and comparing the network outcomes (linked pairs are clustering transitions and disconnected pairs are switching transitions) to the outcomes of two expert human raters. Results show that the network models proposed in this study overcome a thresholded co-occurrence approach, and their outcomes are in high agreement with human evaluations. Finally, the interplay between conceptual structure and retrieval mechanisms is discussed.

    View details for DOI 10.1007/s10339-010-0372-x

    View details for Web of Science ID 000289685000005

    View details for PubMedID 20938799

  • Genotator: A disease-agnostic tool for genetic annotation of disease BMC MEDICAL GENOMICS Wall, D. P., Pivovarov, R., Tong, M., Jung, J., Fusaro, V. A., DeLuca, T. F., Tonellato, P. J. 2010; 3

    Abstract

    Disease-specific genetic information has been increasing at rapid rates as a consequence of recent improvements and massive cost reductions in sequencing technologies. Numerous systems designed to capture and organize this mounting sea of genetic data have emerged, but these resources differ dramatically in their disease coverage and genetic depth. With few exceptions, researchers must manually search a variety of sites to assemble a complete set of genetic evidence for a particular disease of interest, a process that is both time-consuming and error-prone.We designed a real-time aggregation tool that provides both comprehensive coverage and reliable gene-to-disease rankings for any disease. Our tool, called Genotator, automatically integrates data from 11 externally accessible clinical genetics resources and uses these data in a straightforward formula to rank genes in order of disease relevance. We tested the accuracy of coverage of Genotator in three separate diseases for which there exist specialty curated databases, Autism Spectrum Disorder, Parkinson's Disease, and Alzheimer Disease. Genotator is freely available at http://genotator.hms.harvard.edu.Genotator demonstrated that most of the 11 selected databases contain unique information about the genetic composition of disease, with 2514 genes found in only one of the 11 databases. These findings confirm that the integration of these databases provides a more complete picture than would be possible from any one database alone. Genotator successfully identified at least 75% of the top ranked genes for all three of our use cases, including a 90% concordance with the top 40 ranked candidates for Alzheimer Disease.As a meta-query engine, Genotator provides high coverage of both historical genetic research as well as recent advances in the genetic understanding of specific diseases. As such, Genotator provides a real-time aggregation of ranked data that remains current with the pace of research in the disease fields. Genotator's algorithm appropriately transforms query terms to match the input requirements of each targeted databases and accurately resolves named synonyms to ensure full coverage of the genetic results with official nomenclature. Genotator generates an excel-style output that is consistent across disease queries and readily importable to other applications.

    View details for DOI 10.1186/1755-8794-3-50

    View details for Web of Science ID 000284541000001

    View details for PubMedID 21034472

    View details for PubMedCentralID PMC2990725

  • Cloud computing for comparative genomics BMC BIOINFORMATICS Wall, D. P., Kudtarkar, P., Fusaro, V. A., Pivovarov, R., Patil, P., Tonellato, P. J. 2010; 11

    Abstract

    Large comparative genomics studies and tools are becoming increasingly more compute-expensive as the number of available genome sequences continues to rise. The capacity and cost of local computing infrastructures are likely to become prohibitive with the increase, especially as the breadth of questions continues to rise. Alternative computing architectures, in particular cloud computing environments, may help alleviate this increasing pressure and enable fast, large-scale, and cost-effective comparative genomics strategies going forward. To test this, we redesigned a typical comparative genomics algorithm, the reciprocal smallest distance algorithm (RSD), to run within Amazon's Elastic Computing Cloud (EC2). We then employed the RSD-cloud for ortholog calculations across a wide selection of fully sequenced genomes.We ran more than 300,000 RSD-cloud processes within the EC2. These jobs were farmed simultaneously to 100 high capacity compute nodes using the Amazon Web Service Elastic Map Reduce and included a wide mix of large and small genomes. The total computation time took just under 70 hours and cost a total of $6,302 USD.The effort to transform existing comparative genomics algorithms from local compute infrastructures is not trivial. However, the speed and flexibility of cloud computing environments provides a substantial boost with manageable cost. The procedure designed to transform the RSD algorithm into a cloud-ready application is readily adaptable to similar comparative genomics problems.

    View details for DOI 10.1186/1471-2105-11-259

    View details for Web of Science ID 000279730300001

    View details for PubMedID 20482786

    View details for PubMedCentralID PMC3098063

  • Cost-Effective Cloud Computing: A Case Study Using the Comparative Genomics Tool, Roundup EVOLUTIONARY BIOINFORMATICS Kudtarkar, P., DeLuca, T. F., Fusaro, V. A., Tonellato, P. J., Wall, D. P. 2010; 6: 197-203

    Abstract

    Comparative genomics resources, such as ortholog detection tools and repositories are rapidly increasing in scale and complexity. Cloud computing is an emerging technological paradigm that enables researchers to dynamically build a dedicated virtual cluster and may represent a valuable alternative for large computational tools in bioinformatics. In the present manuscript, we optimize the computation of a large-scale comparative genomics resource-Roundup-using cloud computing, describe the proper operating principles required to achieve computational efficiency on the cloud, and detail important procedures for improving cost-effectiveness to ensure maximal computation at minimal costs.Utilizing the comparative genomics tool, Roundup, as a case study, we computed orthologs among 902 fully sequenced genomes on Amazon's Elastic Compute Cloud. For managing the ortholog processes, we designed a strategy to deploy the web service, Elastic MapReduce, and maximize the use of the cloud while simultaneously minimizing costs. Specifically, we created a model to estimate cloud runtime based on the size and complexity of the genomes being compared that determines in advance the optimal order of the jobs to be submitted.We computed orthologous relationships for 245,323 genome-to-genome comparisons on Amazon's computing cloud, a computation that required just over 200 hours and cost $8,000 USD, at least 40% less than expected under a strategy in which genome comparisons were submitted to the cloud randomly with respect to runtime. Our cost savings projections were based on a model that not only demonstrates the optimal strategy for deploying RSD to the cloud, but also finds the optimal cluster size to minimize waste and maximize usage. Our cost-reduction model is readily adaptable for other comparative genomics tools and potentially of significant benefit to labs seeking to take advantage of the cloud as an alternative to local computing infrastructure.

    View details for DOI 10.4137/EBO.S6259

    View details for Web of Science ID 000288866900009

    View details for PubMedID 21258651

    View details for PubMedCentralID PMC3023304

  • Collaborative text-annotation resource for disease-centered relation extraction from biomedical text JOURNAL OF BIOMEDICAL INFORMATICS Cano, C., Monaghan, T., Blanco, A., Wall, D. P., Peshkin, L. 2009; 42 (5): 967-977

    Abstract

    Agglomerating results from studies of individual biological components has shown the potential to produce biomedical discovery and the promise of therapeutic development. Such knowledge integration could be tremendously facilitated by automated text mining for relation extraction in the biomedical literature. Relation extraction systems cannot be developed without substantial datasets annotated with ground truth for benchmarking and training. The creation of such datasets is hampered by the absence of a resource for launching a distributed annotation effort, as well as by the lack of a standardized annotation schema. We have developed an annotation schema and an annotation tool which can be widely adopted so that the resulting annotated corpora from a multitude of disease studies could be assembled into a unified benchmark dataset. The contribution of this paper is threefold. First, we provide an overview of available benchmark corpora and derive a simple annotation schema for specific binary relation extraction problems such as protein-protein and gene-disease relation extraction. Second, we present BioNotate: an open source annotation resource for the distributed creation of a large corpus. Third, we present and make available the results of a pilot annotation effort of the autism disease network.

    View details for DOI 10.1016/j.jbi.2009.02.001

    View details for Web of Science ID 000270870500021

    View details for PubMedID 19232400

    View details for PubMedCentralID PMC2757509

  • Reply to the "Letter to the Editors" by Steven Buyske NEUROGENETICS Abu-Elneel, K., Liu, T., Gazzaniga, F. S., Nishimura, Y., Wall, D. P., Geschwind, D. H., Lao, K., Kosik, K. S. 2009; 10 (2): 169–70
  • Comparative analysis of neurological disorders focuses genome-wide search for autism genes GENOMICS Wall, D. P., Esteban, F. J., DeLuca, T. F., Huyck, M., Monaghan, T., de Mendizabal, N. V., Goni, J., Kohane, I. S. 2009; 93 (2): 120-129

    Abstract

    The behaviors of autism overlap with a diverse array of other neurological disorders, suggesting common molecular mechanisms. We conducted a large comparative analysis of the network of genes linked to autism with those of 432 other neurological diseases to circumscribe a multi-disorder subcomponent of autism. We leveraged the biological process and interaction properties of these multi-disorder autism genes to overcome the across-the-board multiple hypothesis corrections that a purely data-driven approach requires. Using prior knowledge of biological process, we identified 154 genes not previously linked to autism of which 42% were significantly differentially expressed in autistic individuals. Then, using prior knowledge from interaction networks of disorders related to autism, we uncovered 334 new genes that interact with published autism genes, of which 87% were significantly differentially regulated in autistic individuals. Our analysis provided a novel picture of autism from the perspective of related neurological disorders and suggested a model by which prior knowledge of interaction networks can inform and focus genome-scale studies of complex neurological disorders.

    View details for DOI 10.1016/j.ygeno.2008.09.015

    View details for Web of Science ID 000263227600003

    View details for PubMedID 18950700

  • Heterogeneous dysregulation of microRNAs across the autism spectrum NEUROGENETICS Abu-Elneel, K., Liu, T., Gazzaniga, F. S., Nishimura, Y., Wall, D. P., Geschwind, D. H., Lao, K., Kosik, K. S. 2008; 9 (3): 153-161

    Abstract

    microRNAs (miRNAs) are approximately 21 nt transcripts capable of regulating the expression of many mRNAs and are abundant in the brain. miRNAs have a role in several complex diseases including cancer as well as some neurological diseases such as Tourette's syndrome and Fragile x syndrome. As a genetically complex disease, dysregulation of miRNA expression might be a feature of autism spectrum disorders (ASDs). Using multiplex quantitative polymerase chain reaction (PCR), we compared the expression of 466 human miRNAs from postmortem cerebellar cortex tissue of individuals with ASD (n = 13) and a control set of non-autistic cerebellar samples (n = 13). While most miRNAs levels showed little variation across all samples suggesting that autism does not induce global dysfunction of miRNA expression, some miRNAs among the autistic samples were expressed at significantly different levels compared to the mean control value. Twenty-eight miRNAs were expressed at significantly different levels compared to the non-autism control set in at least one of the autism samples. To validate the finding, we reversed the analysis and compared each non-autism control to a single mean value for each miRNA across all autism cases. In this analysis, the number of dysregulated miRNAs fell from 28 to 9 miRNAs. Among the predicted targets of dysregulated miRNAs are genes that are known genetic causes of autism such Neurexin and SHANK3. This study finds that altered miRNA expression levels are observed in postmortem cerebellar cortex from autism patients, a finding which suggests that dysregulation of miRNAs may contribute to autism spectrum phenotype.

    View details for DOI 10.1007/s10048-008-0133-5

    View details for Web of Science ID 000257216200001

    View details for PubMedID 18563458

  • Testing the Accuracy of Eukaryotic Phylogenetic Profiles for Prediction of Biological Function EVOLUTIONARY BIOINFORMATICS Singh, S., Wall, D. P. 2008; 4: 217-223

    Abstract

    A phylogenetic profile captures the pattern of gene gain and loss throughout evolutionary time. Proteins that interact directly or indirectly within the cell to perform a biological function will often co-evolve, and this co-evolution should be well reflected within their phylogenetic profiles. Thus similar phylogenetic profiles are commonly used for grouping proteins into functional groups. However, it remains unclear how the size and content of the phylogenetic profile impacts the ability to predict function, particularly in Eukaryotes. Here we developed a straightforward approach to address this question by constructing a complete set of phylogenetic profiles for 31 fully sequenced Eukaryotes. Using Gene Ontology as our gold standard, we compared the accuracy of functional predictions made by a comprehensive array of permutations on the complete set of genomes. Our permutations showed that phylogenetic profiles containing between 25 and 31 Eukaryotic genomes performed equally well and significantly better than all other permuted genome sets, with one exception: we uncovered a core of group of 18 genomes that achieved statistically identical accuracy. This core group contained genomes from each branch of the eukaryotic phylogeny, but also contained several groups of closely related organisms, suggesting that a balance between phylogenetic breadth and depth may improve our ability to use Eukaryotic specific phylogenetic profiles for functional annotations.

    View details for Web of Science ID 000264677700019

    View details for PubMedID 19204819

    View details for PubMedCentralID PMC2614202

  • Ortholog detection using the reciprocal smallest distance algorithm. Methods in molecular biology (Clifton, N.J.) Wall, D. P., Deluca, T. 2007; 396: 95-110

    Abstract

    All protein coding genes have a phylogenetic history that when understood can lead to deep insights into the diversification or conservation of function, the evolution of developmental complexity, and the molecular basis of disease. One important part to reconstructing the relationships among genes in different organisms is an accurate method to find orthologs as well as an accurate measure of evolutionary diversification. The present chapter details such a method, called the reciprocal smallest distance algorithm (RSD). This approach improves upon the common procedure of taking reciprocal best Basic Local Alignment Search Tool hits (RBH) in the identification of orthologs by using global sequence alignment and maximum likelihood estimation of evolutionary distances to detect orthologs between two genomes. RSD finds many putative orthologs missed by RBH because it is less likely to be misled by the presence of close paralogs in genomes. The package offers a tremendous amount of flexibility in investigating parameter settings allowing the user to search for increasingly distant orthologs between highly divergent species, among other advantages. The flexibility of this tool makes it a unique and powerful addition to other available approaches for ortholog detection.

    View details for PubMedID 18025688

  • Roundup: a multi-genome repository of orthologs and evolutionary distances BIOINFORMATICS DeLuca, T. F., Wu, I., Pu, J., Monaghan, T., Peshkin, L., Singh, S., Wall, D. P. 2006; 22 (16): 2044-2046

    Abstract

    We have created a tool for ortholog and phylogenetic profile retrieval called Roundup. Roundup is backed by a massive repository of orthologs and associated evolutionary distances that was built using the reciprocal smallest distance algorithm, an approach that has been shown to improve upon alternative approaches of ortholog detection, such as reciprocal blast. Presently, the Roundup repository contains all possible pair-wise comparisons for over 250 genomes, including 32 Eukaryotes, more than doubling the coverage of any similar resource. The orthologs are accessible through an intuitive web interface that allows searches by genome or gene identifier, presenting results as phylogenetic profiles together with gene and molecular function annotations. Results may be downloaded as phylogenetic matrices for subsequent analysis, including the construction of whole-genome phylogenies based on gene-content data.http://rodeo.med.harvard.edu/tools/roundup.

    View details for DOI 10.1093/bioinformatics/btl286

    View details for Web of Science ID 000239900200016

    View details for PubMedID 16777906

  • Heparan sulfate proteoglycans and the emergence of neuronal connectivity CURRENT OPINION IN NEUROBIOLOGY Van Vactor, D., Wall, D. P., Johnson, K. G. 2006; 16 (1): 40-51

    Abstract

    With the identification of the molecular determinants of neuronal connectivity, our understanding of the extracellular information that controls axon guidance and synapse formation has evolved from single factors towards the complexity that neurons face in a living organism. As we move in this direction - ready to see the forest for the trees - attention is returning to one of the most ancient regulators of cell-cell interaction: the extracellular matrix. Among many matrix components that influence neuronal connectivity, recent studies of the heparan sulfate proteoglycans suggest that these ancient molecules function as versatile extracellular scaffolds that both sculpt the landscape of extracellular cues and modulate the way that neurons perceive the world around them.

    View details for DOI 10.1016/j.conb.2006.01.011

    View details for Web of Science ID 000236136200007

    View details for PubMedID 16417999

  • The role of selection in the evolution of human mitochondrial genomes GENETICS Kivisild, T., Shen, P. D., Wall, D. P., Do, B., Sung, R., Davis, K., Passarino, G., Underhill, P. A., Scharfe, C., Torroni, A., Scozzari, R., Modiano, D., Coppa, A., de Knijff, P., Feldman, M., Cavalli-Sforza, L. L., Oefner, P. J. 2006; 172 (1): 373-387

    Abstract

    High mutation rate in mammalian mitochondrial DNA generates a highly divergent pool of alleles even within species that have dispersed and expanded in size recently. Phylogenetic analysis of 277 human mitochondrial genomes revealed a significant (P < 0.01) excess of rRNA and nonsynonymous base substitutions among hotspots of recurrent mutation. Most hotspots involved transitions from guanine to adenine that, with thymine-to-cytosine transitions, illustrate the asymmetric bias in codon usage at synonymous sites on the heavy-strand DNA. The mitochondrion-encoded tRNAThr varied significantly more than any other tRNA gene. Threonine and valine codons were involved in 259 of the 414 amino acid replacements observed. The ratio of nonsynonymous changes from and to threonine and valine differed significantly (P = 0.003) between populations with neutral (22/58) and populations with significantly negative Tajima's D values (70/76), independent of their geographic location. In contrast to a recent suggestion that the excess of nonsilent mutations is characteristic of Arctic populations, implying their role in cold adaptation, we demonstrate that the surplus of nonsynonymous mutations is a general feature of the young branches of the phylogenetic tree, affecting also those that are found only in Africa. We introduce a new calibration method of the mutation rate of synonymous transitions to estimate the coalescent times of mtDNA haplogroups.

    View details for DOI 10.1534/genetics.105.043901

    View details for Web of Science ID 000235197700033

    View details for PubMedID 16172508

  • Converging on a general model of protein evolution TRENDS IN BIOTECHNOLOGY Herbeck, J. T., Wall, D. P. 2005; 23 (10): 485-487

    Abstract

    The availability of high-throughput genomic databases that establish protein dispensability, expression and interaction networks enables rigorous tests of competing models of protein evolution. Recent research utilizing these new data sets shows that protein evolution is more complex than was previously thought. Several variables, including protein dispensability, expression, functional density, and genetic modularity, appear to have independent effects on the evolutionary rate of proteins, suggesting that proteomes have evolved via an assembly of selectional regimes. These results indicate that a general model of protein evolution will emerge as more functional genomic data from a diversity of organisms accumulate.

    View details for DOI 10.1016/j.tibtech.2005.07.009

    View details for Web of Science ID 000232605900001

    View details for PubMedID 16054255

  • Origin and rapid diversification of a tropical moss EVOLUTION Wall, D. P. 2005; 59 (7): 1413-1424

    Abstract

    Molecular sequences rarely evolve at a constant rate. Yet, even in instances where a clock can be assumed or approximated for a particular set of sequences, fossils or clear patterns of vicariance are rarely available to calibrate the clock. Thus, obtaining absolute timing for diversification of natural lineages can prove difficult. Unfortunately, without absolute time we cannot develop a complete understanding of important evolutionary processes, including adaptive radiations and key innovations. In the present study, the coding sequence of the nuclear gene, glyceraldehyde 3-phosphate dehydrogenase (gpd), extracted from the paleotropical moss, Mitthyridium, was found to exhibit clocklike behavior and used to reconstruct the history of 80 distinct molecular lineages that cover the full geographic range of Mitthyridium. Two separate clades endemic to two geographically distinct oceanic archipelagos were revealed by this phylogenetic analysis. This allowed the use of island age (as derived from potassium-argon dating) as a maximum age of origin of each monophyletic group, providing two independent time anchors for the clock found in gpd, the final piece needed to study absolute time. Based on results from both maximum age calibrations, which separately yielded highly consistent estimates, the ancestor of this moss group arose approximately 8 million years ago, and then diversified at the rapid rate of 0.56 +/- 0.004 new lineages per million years. Such a rate is on par with the highest diversification rates reported in the literature including rapidly radiating insular groups like the Hawaiian silversword alliance, a classic example of an adaptive radiation. Using independent sources of data, it was found that neither the age nor diversification estimates were affected by the use of molecular lineages rather than species as the operational taxonomic units. Identifying the cause for this rapid diversification requires further testing, but it appears to be related to a general shift in reproductive strategy from sexual to asexual, which may be a key innovation for this young group.

    View details for Web of Science ID 000230975600004

    View details for PubMedID 16153028

  • Functional genomic analysis of the rates of protein evolution PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Wall, D. P., Hirsh, A. E., Fraser, H. B., Kumm, J., Giaever, G., Eisen, M. B., Feldman, M. W. 2005; 102 (15): 5483-5488

    Abstract

    The evolutionary rates of proteins vary over several orders of magnitude. Recent work suggests that analysis of large data sets of evolutionary rates in conjunction with the results from high-throughput functional genomic experiments can identify the factors that cause proteins to evolve at such dramatically different rates. To this end, we estimated the evolutionary rates of >3,000 proteins in four species of the yeast genus Saccharomyces and investigated their relationship with levels of expression and protein dispensability. Each protein's dispensability was estimated by the growth rate of mutants deficient for the protein. Our analyses of these improved evolutionary and functional genomic data sets yield three main results. First, dispensability and expression have independent, significant effects on the rate of protein evolution. Second, measurements of expression levels in the laboratory can be used to filter data sets of dispensability estimates, removing variates that are unlikely to reflect real biological effects. Third, structural equation models show that although we may reasonably infer that dispensability and expression have significant effects on protein evolutionary rate, we cannot yet accurately estimate the relative strengths of these effects.

    View details for DOI 10.1073/pnas.0501761102

    View details for Web of Science ID 000228376600036

    View details for PubMedID 15800036

  • Conservation of the RB1 gene in human and primates HUMAN MUTATION Sivakumaran, T. A., Shen, P. D., Wall, D. P., Do, B. H., Kucheria, K., Oefner, P. J. 2005; 25 (4): 396-409

    Abstract

    Mutations in the RB1 gene are associated with retinoblastoma, which has served as an important model for understanding hereditary predisposition to cancer. Despite the great scrutiny that RB1 has enjoyed as the prototypical tumor suppressor gene, it has never been the object of a comprehensive survey of sequence variation in diverse human populations and primates. Therefore, we analyzed the coding (2,787 bp) and adjacent intronic and untranslated (7,313 bp) sequences of RB1 in 137 individuals from a wide range of ethnicities, including 19 Asian Indian hereditary retinoblastoma cases, and five primate species. Aside from nine apparently disease-associated mutations, 52 variants were identified. They included six singleton, coding variants that comprised five amino acid replacements and one silent site. Nucleotide diversity of the coding region (pi=0.0763+/-1.35 x 10(-4)) was 52 times lower than that of the noncoding regions (pi=3.93+/-5.26 x 10(-4)), indicative of significant sequence conservation. The occurrence of purifying selection was corroborated by phylogeny-based maximum likelihood analysis of the RB1 sequences of human and five primates, which yielded an estimated ratio of replacement to silent substitutions (omega) of 0.095 across all lineages. RB1 displayed extensive linkage disequilibrium over 174 kb, and only four unique recombination events, two in Africa and one each in Europe and Southwest Asia, were observed. Using a parsimony approach, 15 haplotypes could be inferred. Ten were found in Africa, though only 12.4% of the 274 chromosomes screened were of African origin. In non-Africans, a single haplotype accounted for from 63 to 84% of all chromosomes, most likely the consequence of natural selection and a significant bottleneck in effective population size during the colonization of the non-African continents.

    View details for DOI 10.1002/humu.20154

    View details for Web of Science ID 000228099600009

    View details for PubMedID 15776430

  • Adjusting for selection on synonymous sites in estimates of evolutionary distance MOLECULAR BIOLOGY AND EVOLUTION Hirsh, A. E., Fraser, H. B., Wall, D. P. 2005; 22 (1): 174-177

    Abstract

    Evolution at silent sites is often used to estimate the pace of selectively neutral processes or to infer differences in divergence times of genes. However, silent sites are subject to selection in favor of preferred codons, and the strength of such selection varies dramatically across genes. Here, we use the relationship between codon bias and synonymous divergence observed in four species of the genus Saccharomyces to provide a simple correction for selection on silent sites.

    View details for DOI 10.1093/molbev/msh265

    View details for Web of Science ID 000225730100018

    View details for PubMedID 15371530

  • Improved haematopoietic recovery following transplantation with ex vivo-expanded mobilized blood cells 45th Annual Meeting and Exhibition of the American-Society-of-Hematology Prince, H. M., Simmons, P. J., Whitty, G., Wall, D. P., Barber, L., Toner, G. C., Seymour, J. F., Richardson, G., Mrongovius, R., Haylock, D. N. WILEY-BLACKWELL PUBLISHING, INC. 2004: 536–45

    Abstract

    Infusions of ex vivo-expanded (EXE) mobilized blood cells have been explored to enhance haematopoietic recovery following high dose chemotherapy (HDT). However, prior studies have not consistently demonstrated improvements in trilineage haematopoietic recovery. Three cohorts of three patients with breast cancer received three cycles of repetitive HDT supported by either unmanipulated (UM) and/or EXE cells. Efficacy was assessed by an internal comparison of each patient's consecutive HDT cycles, and to 106 historical UM infusions. Twenty-one cycles were supported by EXE cells and six by UM cells alone. Infusions of EXE cells resulted in fewer days with an absolute neutrophil count (ANC) <0.1 x 10(9)/l (median 2 vs. 4 d, P = 0.002) and 3 d faster ANC recovery to >0.1 x 10(9)/l (median 5 vs. 8 d, P = 0.0002). This resulted in a major reduction in the incidence of febrile neutropenia compared with UM cycles (0% vs. 83%; P = 0.008) and in 66% of historical UM cycles (P = 0.01) and a marked reduction in hospital re-admission. There were also fewer platelet transfusions required (43% vs. 100%; P = 0.009). We conclude that EXE cells enhance both neutrophil and platelet recovery and reduce febrile neutropenia, platelet transfusion and hospital re-admission.

    View details for DOI 10.1111/j.1365-2141.2004.05081.x

    View details for Web of Science ID 000223036300011

    View details for PubMedID 15287947

  • Coevolution of gene expression among interacting proteins PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Fraser, H. B., Hirsh, A. E., Wall, D. P., Eisen, M. B. 2004; 101 (24): 9033-9038

    Abstract

    Physically interacting proteins or parts of proteins are expected to evolve in a coordinated manner that preserves proper interactions. Such coevolution at the amino acid-sequence level is well documented and has been used to predict interacting proteins, domains, and amino acids. Interacting proteins are also often precisely coexpressed with one another, presumably to maintain proper stoichiometry among interacting components. Here, we show that the expression levels of physically interacting proteins coevolve. We estimate average expression levels of genes from four closely related fungi of the genus Saccharomyces using the codon adaptation index and show that expression levels of interacting proteins exhibit coordinated changes in these different species. We find that this coevolution of expression is a more powerful predictor of physical interaction than is coevolution of amino acid sequence. These results demonstrate that gene expression levels can coevolve, adding another dimension to the study of the coevolution of interacting proteins and underscoring the importance of maintaining coexpression of interacting proteins over evolutionary time. Our results also suggest that expression coevolution can be used for computational prediction of protein-protein interactions.

    View details for DOI 10.1073/pnas.0402591101

    View details for Web of Science ID 000222104900038

    View details for PubMedID 15175431

  • Gene expression level influences amino acid usage, but not codon usage, in the tsetse fly endosymbiont Wigglesworthia MICROBIOLOGY-SGM Herbeck, J. T., Wall, D. P., Wernegreen, J. J. 2003; 149: 2585-2596

    Abstract

    Wigglesworthia glossinidia brevipalpis, the obligate bacterial endosymbiont of the tsetse fly Glossina brevipalpis, is characterized by extreme genome reduction and AT nucleotide composition bias. Here, multivariate statistical analyses are used to test the hypothesis that mutational bias and genetic drift shape synonymous codon usage and amino acid usage of Wigglesworthia. The results show that synonymous codon usage patterns vary little across the genome and do not distinguish genes of putative high and low expression levels, thus indicating a lack of translational selection. Extreme AT composition bias across the genome also drives relative amino acid usage, but predicted high-expression genes (ribosomal proteins and chaperonins) use GC-rich amino acids more frequently than do low-expression genes. The levels and configuration of amino acid differences between Wigglesworthia and Escherichia coli were compared to test the hypothesis that the relatively GC-rich amino acid profiles of high-expression genes reflect greater amino acid conservation at these loci. This hypothesis is supported by reduced levels of protein divergence at predicted high-expression Wigglesworthia genes and similar configurations of amino acid changes across expression categories. Combined, the results suggest that codon and amino acid usage in the Wigglesworthia genome reflect a strong AT mutational bias and elevated levels of genetic drift, consistent with expected effects of an endosymbiotic lifestyle and repeated population bottlenecks. However, these impacts of mutation and drift are apparently attenuated by selection on amino acid composition at high-expression genes.

    View details for DOI 10.1099/mic.0.26381-0

    View details for Web of Science ID 000185342900027

    View details for PubMedID 12949182

  • Detecting putative orthologs BIOINFORMATICS Wall, D. P., Fraser, H. B., Hirsh, A. E. 2003; 19 (13): 1710-1711

    Abstract

    We developed an algorithm that improves upon the common procedure of taking reciprocal best blast hits(rbh) in the identification of orthologs. The method-reciprocal smallest distance algorithm (rsd)-relies on global sequence alignment and maximum likelihood estimation of evolutionary distances to detect orthologs between two genomes. rsd finds many putative orthologs missed by rbh because it is less likely than rbh to be misled by the presence of a close paralog.

    View details for DOI 10.1093/bioinformatics/btg213

    View details for Web of Science ID 000185310600016

    View details for PubMedID 15593400

  • Evolutionary patterns of codon usage in the chloroplast gene rbcL JOURNAL OF MOLECULAR EVOLUTION Wall, D. P., Herbeck, J. T. 2003; 56 (6): 673-688

    Abstract

    In this study we reconstruct the evolution of codon usage bias in the chloroplast gene rbcL using a phylogeny of 92 green-plant taxa. We employ a measure of codon usage bias that accounts for chloroplast genomic nucleotide content, as an attempt to limit plausible explanations for patterns of codon bias evolution to selection- or drift-based processes. This measure uses maximum likelihood-ratio tests to compare the performance of two models, one in which a single codon is overrepresented and one in which two codons are overrepresented. The measure allowed us to analyze both the extent of bias in each lineage and the evolution of codon choice across the phylogeny. Despite predictions based primarily on the low G + C content of the chloroplast and the high functional importance of rbcL, we found large differences in the extent of bias, suggesting differential molecular selection that is clade specific. The seed plants and simple leafy liverworts each independently derived a low level of bias in rbcL, perhaps indicating relaxed selectional constraint on molecular changes in the gene. Overrepresentation of a single codon was typically plesiomorphic, and transitions to overrepresentation of two codons occurred commonly across the phylogeny, possibly indicating biochemical selection. The total codon bias in each taxon, when regressed against the total bias of each amino acid, suggested that twofold amino acids play a strong role in inflating the level of codon usage bias in rbcL, despite the fact that twofolds compose a minority of residues in this gene. Those amino acids that contributed most to the total codon usage bias of each taxon are known through amino acid knockout and replacement to be of high functional importance. This suggests that codon usage bias may be constrained by particular amino acids and, thus, may serve as a good predictor of what residues are most important for protein fitness.

    View details for DOI 10.1007/s00239-002-2436-8

    View details for Web of Science ID 000183129100004

    View details for PubMedID 12911031

  • A simple dependence between protein evolution rate and the number of protein-protein interactions BMC EVOLUTIONARY BIOLOGY Fraser, H. B., Wall, D. P., Hirsh, A. E. 2003; 3

    Abstract

    It has been shown for an evolutionarily distant genomic comparison that the number of protein-protein interactions a protein has correlates negatively with their rates of evolution. However, the generality of this observation has recently been challenged. Here we examine the problem using protein-protein interaction data from the yeast Saccharomyces cerevisiae and genome sequences from two other yeast species.In contrast to a previous study that used an incomplete set of protein-protein interactions, we observed a highly significant correlation between number of interactions and evolutionary distance to either Candida albicans or Schizosaccharomyces pombe. This study differs from the previous one in that it includes all known protein interactions from S. cerevisiae, and a larger set of protein evolutionary rates. In both evolutionary comparisons, a simple monotonic relationship was found across the entire range of the number of protein-protein interactions. In agreement with our earlier findings, this relationship cannot be explained by the fact that proteins with many interactions tend to be important to yeast. The generality of these correlations in other kingdoms of life unfortunately cannot be addressed at this time, due to the incompleteness of protein-protein interaction data from organisms other than S. cerevisiae.Protein-protein interactions tend to slow the rate at which proteins evolve. This may be due to structural constraints that must be met to maintain interactions, but more work is needed to definitively establish the mechanism(s) behind the correlations we have observed.

    View details for Web of Science ID 000188122100011

    View details for PubMedID 12769820

  • Use of the nuclear gene glyceraldehyde 3-phosphate dehydrogenase for phylogeny reconstruction of recently diverged lineages in Mitthyridium (Musci : Calymperaceae) MOLECULAR PHYLOGENETICS AND EVOLUTION Wall, D. P. 2002; 25 (1): 10-26

    Abstract

    A portion of the nuclear gene glyceraldehyde 3-phosphate dehydrogenase (gpd) was sequenced in 26 representatives of the paleotropical moss, Mitthyridium, and a group of 20 outgroup taxa to assess its utility for phylogenetic reconstruction compared with the better understood chloroplast markers, rps4 and trnL. Primers based on plant and fungal sequences were designed to amplify gpd in plants universally with the exclusion of fungal contaminants. The piece amplified spanned 4 introns and 3 of 9 exons, based on comparisons with complete sequence from Arabidopsis. Size variation in gpd ranged from 891 to 1007 bp, in part attributable to 6 indels of variable length found within the introns. Intron 6 contributed most of the length variation and contained a variable purine-repeat motif of possible use as a microsatellite. Phylogenetic analyses of the full gpd amplicon yielded well-resolved trees that were in nearly full accord with the trees derived from the cpDNA partitions for analyses of both the ingroup and ingroup + outgroup taxon sets. Pairwise nucleotide substitution rates of gpd were as much as 2.2 times higher than those in rps4 and 2.8 times higher than in trnL. Excision of the introns left suitable numbers of parsimony informative characters and demonstrated that the full gpd amplicon could be compartmentalized to provide resolution for both shallow and deep phylogenetic branches. Exons of gpd were found to behave in a clock-like fashion for the 26 ingroup taxa and select outgroups. In general, gpd was found to hold great promise not only for improving resolution of chloroplast-derived phylogenies, but also for phylogenetic reconstruction of recent, diversifying lineages.

    View details for Web of Science ID 000179028400002

    View details for PubMedID 12383747

  • SELLING EXPERIMENT TREATMENT HASTINGS CENTER REPORT Oldham, R. K. 1990; 20 (6): 43-44

    View details for Web of Science ID A1990EK67800024

    View details for PubMedID 2283290