Academic Appointments

Administrative Appointments

  • Scientific Program Chair, AMIA Summit on Translational Bioinformatics (2011 - 2012)
  • Member, Cancer Institute Informatics Steering Committee (2011 - Present)
  • Advisory Committee Member, Stanford Center for Clinical Informatics (2011 - 2012)
  • Advisory Board Member, Medicine X (2011 - 2012)
  • Assistant Director, Stanford Center for Biomedical Research (BMIR) (2013 - Present)

Honors & Awards

  • Outstanding paper award, Summit on Translational Bioinformatics (March 2008)
  • Outstanding paper award, AMIA Summit on Translational Bioinformatics (March 2009)
  • Distinguished paper award, AMIA Summit on Translational Bioinformatics (March 2011)
  • Biosciences Faculty Award recognizing outstanding teaching contributions, Stanford School of Medicine (June 2012)
  • Ramoni Best paper award, AMIA Summit on Translational Bioinformatics (March 2013)
  • New Investigator Award, American Medical Informatics Association (AMIA) (November 2013)

Professional Education

  • Postdoctoral, Stanford University, Biomedical Informatics (2007)
  • PhD, The Pennsylvania State University, Molecular Medicine (2005)
  • MBBS, Baroda Medical College, Medicine (1999)

Research & Scholarship

Current Research and Scholarly Interests

We develop methods to analyze large unstructured data sets for data-driven medicine. We use ontologies to annotate, index and analyze Big Data in biomedicine for enabling data-driven decision making in medicine and health care. Our research group is part of the Center for Biomedical Informatics Research at Stanford and the National Center for Biomedical Ontology.

Data driven medicine: The goal is to combine machine learning, text-mining, and prior knowledge in medical ontologies to discover hidden trends, build risk models, drive data driven decision making, and comparative effectiveness studies. We have developed methods that transform unstructured patient notes into a de-identified, temporally ordered, patient-feature matrix (Imagine it as row = patient, column = medical concept, 1 = present, 0 = absent). With the resulting high-throughput data, we can monitor for adverse drug events, learn drug-drug interactions, identify off-label drug usage, generate practice-based evidence for difficult-to-test clinical hypotheses, and generate phenotypic fingerprints as well as build predictive models. We have efforts around combining multiple information sources for drug safety surveillance, which were recently the focus of a commentary titled Advancing the Science of Pharmacovigilance.

Annotation Analytics: In order to understand the “gene lists” from analysis of high-throughput data, researchers routinely use Gene Ontology based analyses. With available methods for automated annotation and the existence of over 200 biomedical ontologies, we can stop using just GO and move to enrichment analysis using disease ontologies.


2015-16 Courses

Stanford Advisees

Graduate and Fellowship Programs


All Publications

  • Comment on: "Zoo or savannah? Choice of training ground for evidence-based pharmacovigilance". Drug safety Harpaz, R., DuMouchel, W., Shah, N. H. 2015; 38 (1): 113-114

    View details for DOI 10.1007/s40264-014-0245-9

    View details for PubMedID 25432779

  • Bringing cohort studies to the bedside: framework for a "green button' to support clinical decision-making JOURNAL OF COMPARATIVE EFFECTIVENESS RESEARCH Gallego, B., Walter, S. R., Day, R. O., Dunn, A. G., Sivaraman, V., Shah, N., Longhurst, C. A., Coiera, E. 2015; 4 (3): 191-197

    View details for DOI 10.2217/cer.15.12

    View details for Web of Science ID 000355701500002

  • Analyzing search behavior of healthcare professionals for drug safety surveillance. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Odgers, D. J., Harpaz, R., Callahan, A., Stiglic, G., Shah, N. H. 2015; 20: 306-317


    Post-market drug safety surveillance is hugely important and is a significant challenge despite the existence of adverse event (AE) reporting systems. Here we describe a preliminary analysis of search logs from healthcare professionals as a source for detecting adverse drug events. We annotate search log query terms with biomedical terminologies for drugs and events, and then perform a statistical analysis to identify associations among drugs and events within search sessions. We evaluate our approach using two different types of reference standards consisting of known adverse drug events (ADEs) and negative controls. Our approach achieves a discrimination accuracy of 0.85 in terms of the area under the receiver operator curve (AUC) for the reference set of well-established ADEs and an AUC of 0.68 for the reference set of recently labeled ADEs. We also find that the majority of associations in the reference sets have support in the search log data. Despite these promising results additional research is required to better understand users' search behavior, biasing factors, and the overall utility of analyzing healthcare professional search logs for drug safety surveillance.

    View details for PubMedID 25592591

  • Functional evaluation of out-of-the-box text-mining tools for data-mining tasks JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Jung, K., LePendu, P., Iyer, S., Bauer-Mehren, A., Percha, B., Shah, N. H. 2015; 22 (1): 121-131


    The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications.We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks.There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets.For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice.

    View details for DOI 10.1136/amiajnl-2014-002902

    View details for Web of Science ID 000352771100014

    View details for PubMedID 25336595

  • Toward personalizing treatment for depression: predicting diagnosis and severity JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Huang, S. H., LePendu, P., Iyer, S. V., Tai-Seale, M., Carrell, D., Shah, N. H. 2014; 21 (6): 1069-1075
  • Repurposing cAMP-Modulating Medications to Promote beta-Cell Replication MOLECULAR ENDOCRINOLOGY Zhao, Z., Low, Y. S., Armstrong, N. A., Ryu, J. H., Sun, S. A., Arvanites, A. C., Hollister-Lock, J., Shah, N. H., Weir, G. C., Annes, J. P. 2014; 28 (10): 1682-1697


    Loss of β-cell mass is a cardinal feature of diabetes. Consequently, developing medications to promote β-cell regeneration is a priority. 3'-5'-Cyclic adenosine monophosphate (cAMP) is an intracellular second messenger that modulates β-cell replication. We investigated whether medications that increase cAMP stability or synthesis selectively stimulate β-cell growth. To identify cAMP stabilizing medications that promote β-cell replication we performed high-content screening of a phosphodiesterase-inhibitor (PDE-I) library. PDE3,4 and 10 inhibitors, including dipyridamole, were found to promote β-cell replication in an adenosine receptor-dependent manner. Dipyridamole's action is specific for β-cells and not α-cells. Next we demonstrated that norepinephrine (NE), a physiologic suppressor of cAMP synthesis in β-cells, impairs β-cell replication via activation of α2-adrenergic receptors. Accordingly, mirtazapine, an α2-adrenergic receptor antagonist and antidepressant, prevents NE-dependent suppression of β-cell replication. Interestingly, NE's growth-suppressive effect is modulated by endogenously expressed catecholamine-inactivating enzymes (COMT and MAO) and is dominant over the growth-promoting effects of PDE-Is. Treatment with dipyridamole and/or mirtazapine promote β-cell replication in mice and treatment with dipyridamole is associated with reduced glucose levels in humans. This work provides new mechanistic insights into cAMP-dependent growth regulation of β-cells and highlights the potential of commonly prescribed medications to influence β-cell growth.

    View details for DOI 10.1210/me.2014-1120

    View details for Web of Science ID 000346837000010

  • Text Mining for Adverse Drug Events: the Promise, Challenges, and State of the Art DRUG SAFETY Harpaz, R., Callahan, A., Tamang, S., Low, Y., Odgers, D., Finlayson, S., Jung, K., LePendu, P., Shah, N. H. 2014; 37 (10): 777-790
  • Toward enhanced pharmacovigilance using patient-generated data on the internet. Clinical pharmacology & therapeutics WHITE, R. W., Harpaz, R., Shah, N. H., Dumouchel, W., Horvitz, E. 2014; 96 (2): 239-246


    The promise of augmenting pharmacovigilance with patient-generated data drawn from the Internet was called out by a scientific committee charged with conducting a review of the current and planned pharmacovigilance practices of the US Food and Drug Administration (FDA). To this end, we present a study on harnessing behavioral data drawn from Internet search logs to detect adverse drug reactions (ADRs). By analyzing search queries collected from 80 million consenting users and by using a widely recognized benchmark of ADRs, we found that the performance of ADR detection via search logs is comparable and complementary to detection based on the FDA's adverse event reporting system (AERS). We show that by jointly leveraging data from the AERS and search logs, the accuracy of ADR detection can be improved by 19% relative to the use of each data source independently. The results suggest that leveraging nontraditional sources such as online search logs could supplement existing pharmacovigilance approaches.

    View details for DOI 10.1038/clpt.2014.77

    View details for PubMedID 24713590

  • A 'green button' for using aggregate patient data at the point of care. Health affairs Longhurst, C. A., Harrington, R. A., Shah, N. H. 2014; 33 (7): 1229-1235


    Randomized controlled trials have traditionally been the gold standard against which all other sources of clinical evidence are measured. However, the cost of conducting these trials can be prohibitive. In addition, evidence from the trials frequently rests on narrow patient-inclusion criteria and thus may not generalize well to real clinical situations. Given the increasing availability of comprehensive clinical data in electronic health records (EHRs), some health system leaders are now advocating for a shift away from traditional trials and toward large-scale retrospective studies, which can use practice-based evidence that is generated as a by-product of clinical processes. Other thought leaders in clinical research suggest that EHRs should be used to lower the cost of trials by integrating point-of-care randomization and data capture into clinical processes. We believe that a successful learning health care system will require both approaches, and we suggest a model that resolves this escalating tension: a "green button" function within EHRs to help clinicians leverage aggregate patient data for decision making at the point of care. Giving clinicians such a tool would support patient care decisions in the absence of gold-standard evidence and would help prioritize clinical questions for which EHR-enabled randomization should be carried out. The privacy rule in the Health Insurance Portability and Accountability Act (HIPAA) of 1996 may require revision to support this novel use of patient data.

    View details for DOI 10.1377/hlthaff.2014.0099

    View details for PubMedID 25006150

  • Response to letters regarding article, "unexpected effect of proton pump inhibitors: elevation of the cardiovascular risk factor asymmetric dimethylarginine". Circulation Ghebremariam, Y. T., Lee, J. C., LePendu, P., Erlanson, D. A., Slaviero, A., Shah, N. H., Leiper, J. M., Cooke, J. P. 2014; 129 (13)

    View details for DOI 10.1161/CIRCULATIONAHA.114.009343

    View details for PubMedID 24687654

  • Mining clinical text for signals of adverse drug-drug interactions JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Iyer, S. V., Harpaz, R., LePendu, P., Bauer-Mehren, A., Shah, N. H. 2014; 21 (2): 353-362


    Electronic health records (EHRs) are increasingly being used to complement the FDA Adverse Event Reporting System (FAERS) and to enable active pharmacovigilance. Over 30% of all adverse drug reactions are caused by drug-drug interactions (DDIs) and result in significant morbidity every year, making their early identification vital. We present an approach for identifying DDI signals directly from the textual portion of EHRs.We recognize mentions of drug and event concepts from over 50 million clinical notes from two sites to create a timeline of concept mentions for each patient. We then use adjusted disproportionality ratios to identify significant drug-drug-event associations among 1165 drugs and 14 adverse events. To validate our results, we evaluate our performance on a gold standard of 1698 DDIs curated from existing knowledge bases, as well as with signaling DDI associations directly from FAERS using established methods.Our method achieves good performance, as measured by our gold standard (area under the receiver operator characteristic (ROC) curve >80%), on two independent EHR datasets and the performance is comparable to that of signaling DDIs from FAERS. We demonstrate the utility of our method for early detection of DDIs and for identifying alternatives for risky drug combinations. Finally, we publish a first of its kind database of population event rates among patients on drug combinations based on an EHR corpus.It is feasible to identify DDI signals and estimate the rate of adverse events among patients on drug combinations, directly from clinical text; this could have utility in prioritizing drug interaction surveillance as well as in clinical decision support.

    View details for DOI 10.1136/amiajnl-2013-001612

    View details for Web of Science ID 000331263600023

  • Automated Detection of Off-Label Drug Use PLOS ONE Jung, K., LePendu, P., Chen, W. S., Iyer, S. V., Readhead, B., Dudley, J. T., Shah, N. H. 2014; 9 (2)
  • Automated detection of off-label drug use. PloS one Jung, K., LePendu, P., Chen, W. S., Iyer, S. V., Readhead, B., Dudley, J. T., Shah, N. H. 2014; 9 (2)


    Off-label drug use, defined as use of a drug in a manner that deviates from its approved use defined by the drug's FDA label, is problematic because such uses have not been evaluated for safety and efficacy. Studies estimate that 21% of prescriptions are off-label, and only 27% of those have evidence of safety and efficacy. We describe a data-mining approach for systematically identifying off-label usages using features derived from free text clinical notes and features extracted from two databases on known usage (Medi-Span and DrugBank). We trained a highly accurate predictive model that detects novel off-label uses among 1,602 unique drugs and 1,472 unique indications. We validated 403 predicted uses across independent data sources. Finally, we prioritize well-supported novel usages for further investigation on the basis of drug safety and cost.

    View details for DOI 10.1371/journal.pone.0089324

    View details for PubMedID 24586689

  • Profiling risk factors for chronic uveitis in juvenile idiopathic arthritis: a new model for EHR-based research PEDIATRIC RHEUMATOLOGY Cole, T. S., Frankovich, J., Iyer, S., LePendu, P., Bauer-Mehren, A., Shah, N. H. 2013; 11
  • Mining the ultimate phenome repository NATURE BIOTECHNOLOGY Shah, N. H. 2013; 31 (12): 1095-1097

    View details for DOI 10.1038/nbt.2757

    View details for Web of Science ID 000328251900020

    View details for PubMedID 24316646

  • Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Lyalina, S., Percha, B., LePendu, P., Iyer, S. V., Altman, R. B., Shah, N. H. 2013; 20 (E2): E297-E305
  • Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records. Journal of the American Medical Informatics Association Lyalina, S., Percha, B., LePendu, P., Iyer, S. V., Altman, R. B., Shah, N. H. 2013; 20 (e2): e297-305


    Mental illness is the leading cause of disability in the USA, but boundaries between different mental illnesses are notoriously difficult to define. Electronic medical records (EMRs) have recently emerged as a powerful new source of information for defining the phenotypic signatures of specific diseases. We investigated how EMR-based text mining and statistical analysis could elucidate the phenotypic boundaries of three important neuropsychiatric illnesses-autism, bipolar disorder, and schizophrenia.We analyzed the medical records of over 7000 patients at two facilities using an automated text-processing pipeline to annotate the clinical notes with Unified Medical Language System codes and then searching for enriched codes, and associations among codes, that were representative of the three disorders. We used dimensionality-reduction techniques on individual patient records to understand individual-level phenotypic variation within each disorder, as well as the degree of overlap among disorders.We demonstrate that automated EMR mining can be used to extract relevant drugs and phenotypes associated with neuropsychiatric disorders and characteristic patterns of associations among them. Patient-level analyses suggest a clear separation between autism and the other disorders, while revealing significant overlap between schizophrenia and bipolar disorder. They also enable localization of individual patients within the phenotypic 'landscape' of each disorder.Because EMRs reflect the realities of patient care rather than idealized conceptualizations of disease states, we argue that automated EMR mining can help define the boundaries between different mental illnesses, facilitate cohort building for clinical and genomic studies, and reveal how clear expert-defined disease boundaries are in practice.

    View details for DOI 10.1136/amiajnl-2013-001933

    View details for PubMedID 23956017

  • Response to "Logistic regression in signal detection: another piece added to the puzzle". Clinical pharmacology & therapeutics Harpaz, R., Dumouchel, W., Lependu, P., Bauer-Mehren, A., Ryan, P., Shah, N. H. 2013; 94 (3): 313-?

    View details for DOI 10.1038/clpt.2013.125

    View details for PubMedID 23756371

  • Unexpected Effect of Proton Pump Inhibitors: Elevation of the Cardiovascular Risk Factor Asymmetric Dimethylarginine CIRCULATION Ghebremariam, Y. T., LePendu, P., Lee, J. C., Erlanson, D. A., Slaviero, A., Shah, N. H., Leiper, J., Cooke, J. P. 2013; 128 (8): 845-853


    Proton pump inhibitors (PPIs) are gastric acid-suppressing agents widely prescribed for the treatment of gastroesophageal reflux disease. Recently, several studies in patients with acute coronary syndrome have raised the concern that use of PPIs in these patients may increase their risk of major adverse cardiovascular events. The mechanism of this possible adverse effect is not known. Whether the general population might also be at risk has not been addressed.Plasma asymmetrical dimethylarginine (ADMA) is an endogenous inhibitor of nitric oxide synthase. Elevated plasma ADMA is associated with increased risk for cardiovascular disease, likely because of its attenuation of the vasoprotective effects of endothelial nitric oxide synthase. We find that PPIs elevate plasma ADMA levels and reduce nitric oxide levels and endothelium-dependent vasodilation in a murine model and ex vivo human tissues. PPIs increase ADMA because they bind to and inhibit dimethylarginine dimethylaminohydrolase, the enzyme that degrades ADMA.We present a plausible biological mechanism to explain the association of PPIs with increased major adverse cardiovascular events in patients with unstable coronary syndromes. Of concern, this adverse mechanism is also likely to extend to the general population using PPIs. This finding compels additional clinical investigations and pharmacovigilance directed toward understanding the cardiovascular risk associated with the use of the PPIs in the general population.

    View details for DOI 10.1161/CIRCULATIONAHA.113.003602

    View details for Web of Science ID 000323651700018

    View details for PubMedID 23825361

  • Performance of Pharmacovigilance Signal-Detection Algorithms for the FDA Adverse Event Reporting System CLINICAL PHARMACOLOGY & THERAPEUTICS Harpaz, R., Dumouchel, W., Lependu, P., Bauer-Mehren, A., Ryan, P., Shah, N. H. 2013; 93 (6): 539-546


    Signal-detection algorithms (SDAs) are recognized as vital tools in pharmacovigilance. However, their performance characteristics are generally unknown. By leveraging a unique gold standard recently made public by the Observational Medical Outcomes Partnership (OMOP) and by conducting a unique systematic evaluation, we provide new insights into the diagnostic potential and characteristics of SDAs that are routinely applied to the US Food and Drug Administration (FDA) Adverse Event Reporting System (AERS). We find that SDAs can attain reasonable predictive accuracy in signaling adverse events. Two performance classes emerge, indicating that the class of approaches that address confounding and masking effects benefits safety surveillance. Our study shows that not all events are equally detectable, suggesting that specific events might be monitored more effectively using other data sources. We provide performance guidelines for several operating scenarios to inform the trade-off between sensitivity and specificity for specific use cases. We also propose an approach and demonstrate its application in identifying optimal signaling thresholds, given specific misclassification tolerances.

    View details for DOI 10.1038/clpt.2013.24

    View details for Web of Science ID 000319304800024

    View details for PubMedID 23571771

  • Pharmacovigilance Using Clinical Notes CLINICAL PHARMACOLOGY & THERAPEUTICS LePendu, P., Iyer, S. V., Bauer-Mehren, A., Harpaz, R., MORTENSEN, J. M., Podchiyska, T., Ferris, T. A., Shah, N. H. 2013; 93 (6): 547-555


    With increasing adoption of electronic health records (EHRs), there is an opportunity to use the free-text portion of EHRs for pharmacovigilance. We present novel methods that annotate the unstructured clinical notes and transform them into a deidentified patient-feature matrix encoded using medical terminologies. We demonstrate the use of the resulting high-throughput data for detecting drug-adverse event associations and adverse events associated with drug-drug interactions. We show that these methods flag adverse events early (in most cases before an official alert), allow filtering of spurious signals by adjusting for potential confounding, and compile prevalence information. We argue that analyzing large volumes of free-text clinical notes enables drug safety surveillance using a yet untapped data source. Such data mining can be used for hypothesis generation and for rapid analysis of suspected adverse event risk.

    View details for DOI 10.1038/clpt.2013.47

    View details for Web of Science ID 000319304800025

  • Practice-Based Evidence: Profiling the Safety of Cilostazol by Text-Mining of Clinical Notes PLOS ONE Leeper, N. J., Bauer-Mehren, A., Iyer, S. V., LePendu, P., Olson, C., Shah, N. H. 2013; 8 (5)
  • Combing signals from spontaneous reports and electronic health records for detection of adverse drug reactions JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Harpaz, R., Vilar, S., DuMouchel, W., Salmasian, H., Haerian, K., Shah, N. H., Chase, H. S., Friedman, C. 2013; 20 (3): 413-419


    Data-mining algorithms that can produce accurate signals of potentially novel adverse drug reactions (ADRs) are a central component of pharmacovigilance. We propose a signal-detection strategy that combines the adverse event reporting system (AERS) of the Food and Drug Administration and electronic health records (EHRs) by requiring signaling in both sources. We claim that this approach leads to improved accuracy of signal detection when the goal is to produce a highly selective ranked set of candidate ADRs.Our investigation was based on over 4 million AERS reports and information extracted from 1.2 million EHR narratives. Well-established methodologies were used to generate signals from each source. The study focused on ADRs related to three high-profile serious adverse reactions. A reference standard of over 600 established and plausible ADRs was created and used to evaluate the proposed approach against a comparator.The combined signaling system achieved a statistically significant large improvement over AERS (baseline) in the precision of top ranked signals. The average improvement ranged from 31% to almost threefold for different evaluation categories. Using this system, we identified a new association between the agent, rasburicase, and the adverse event, acute pancreatitis, which was supported by clinical review.The results provide promising initial evidence that combining AERS with EHRs via the framework of replicated signaling can improve the accuracy of signal detection for certain operating scenarios. The use of additional EHR data is required to further evaluate the capacity and limits of this system and to extend the generalizability of these results.

    View details for DOI 10.1136/amiajnl-2012-000930

    View details for Web of Science ID 000317477500003

    View details for PubMedID 23118093

  • Selected papers from the 15th Annual Bio-Ontologies Special Interest Group Meeting. Journal of biomedical semantics Soldatova, L. N., Sansone, S., Dumontier, M., Shah, N. H. 2013; 4: I1-?


    Over the 15 years, the Bio-Ontologies SIG at ISMB has provided a forum for discussion of the latest and most innovative research in the bio-ontologies development, its applications to biomedicine and more generally the organisation, presentation and dissemination of knowledge in biomedicine and the life sciences. The seven papers and the commentary selected for this supplement span a wide range of topics including: web-based querying over multiple ontologies, integration of data, annotating patent records, NCBO Web services, ontology developments for probabilistic reasoning and for physiological processes, and analysis of the progress of annotation and structural GO changes.

    View details for DOI 10.1186/2041-1480-4-S1-I1

    View details for PubMedID 23735191

  • STOP using just GO: a multi-ontology hypothesis generation tool for high throughput experimentation BMC BIOINFORMATICS Wittkop, T., Teravest, E., Evani, U. S., Fleisch, K. M., Berman, A. E., Powell, C., Shah, N. H., Mooney, S. D. 2013; 14


    Gene Ontology (GO) enrichment analysis remains one of the most common methods for hypothesis generation from high throughput datasets. However, we believe that researchers strive to test other hypotheses that fall outside of GO. Here, we developed and evaluated a tool for hypothesis generation from gene or protein lists using ontological concepts present in manually curated text that describes those genes and proteins.As a consequence we have developed the method Statistical Tracking of Ontological Phrases (STOP) that expands the realm of testable hypotheses in gene set enrichment analyses by integrating automated annotations of genes to terms from over 200 biomedical ontologies. While not as precise as manually curated terms, we find that the additional enriched concepts have value when coupled with traditional enrichment analyses using curated terms.Multiple ontologies have been developed for gene and protein annotation, by using a dataset of both manually curated GO terms and automatically recognized concepts from curated text we can expand the realm of hypotheses that can be discovered. The web application STOP is available at

    View details for DOI 10.1186/1471-2105-14-53

    View details for Web of Science ID 000318030400001

    View details for PubMedID 23409969

  • Profiling risk factors for chronic uveitis in juvenile idiopathic arthritis: a new model for EHR-based research. Pediatric rheumatology online journal Cole, T. S., Frankovich, J., Iyer, S., LePendu, P., Bauer-Mehren, A., Shah, N. H. 2013; 11 (1): 45-?


    Juvenile idiopathic arthritis is the most common rheumatic disease in children. Chronic uveitis is a common and serious comorbid condition of juvenile idiopathic arthritis, with insidious presentation and potential to cause blindness. Knowledge of clinical associations will improve risk stratification. Based on clinical observation, we hypothesized that allergic conditions are associated with chronic uveitis in juvenile idiopathic arthritis patients.This study is a retrospective cohort study using Stanford's clinical data warehouse containing data from Lucile Packard Children's Hospital from 2000-2011 to analyze patient characteristics associated with chronic uveitis in a large juvenile idiopathic arthritis cohort. Clinical notes in patients under 16 years of age were processed via a validated text analytics pipeline. Bivariate-associated variables were used in a multivariate logistic regression adjusted for age, gender, and race. Previously reported associations were evaluated to validate our methods. The main outcome measure was presence of terms indicating allergy or allergy medications use overrepresented in juvenile idiopathic arthritis patients with chronic uveitis. Residual text features were then used in unsupervised hierarchical clustering to compare clinical text similarity between patients with and without uveitis.Previously reported associations with uveitis in juvenile idiopathic arthritis patients (earlier age at arthritis diagnosis, oligoarticular-onset disease, antinuclear antibody status, history of psoriasis) were reproduced in our study. Use of allergy medications and terms describing allergic conditions were independently associated with chronic uveitis. The association with allergy drugs when adjusted for known associations remained significant (OR 2.54, 95% CI 1.22-5.4).This study shows the potential of using a validated text analytics pipeline on clinical data warehouses to examine practice-based evidence for evaluating hypotheses formed during patient care. Our study reproduces four known associations with uveitis development in juvenile idiopathic arthritis patients, and reports a new association between allergic conditions and chronic uveitis in juvenile idiopathic arthritis patients.

    View details for DOI 10.1186/1546-0096-11-45

    View details for PubMedID 24299016

  • Automated Detection of Systematic Off-label Drug Use in Free Text of Electronic Medical Records. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science Jung, K., LePendu, P., Shah, N. 2013; 2013: 94-98


    Off-label use of a drug occurs when it is used in a manner that deviates from its FDA label. Studies estimate that 21% of prescriptions are off-label, with only 27% of those uses supported by evidence of safety and efficacy. We have developed methods to detect population level off-label usage using computationally efficient annotation of free text from clinical notes to generate features encoding empirical information about drug-disease mentions. By including additional features encoding prior knowledge about drugs, diseases, and known usage, we trained a highly accurate predictive model that was used to detect novel candidate off-label usages in a very large clinical corpus. We show that the candidate uses are plausible and can be prioritized for further analysis in terms of safety and efficacy.

    View details for PubMedID 24303308

  • Network analysis of unstructured EHR data for clinical research. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science Bauer-Mehren, A., LePendu, P., Iyer, S. V., Harpaz, R., Leeper, N. J., Shah, N. H. 2013; 2013: 14-18


    In biomedical research, network analysis provides a conceptual framework for interpreting data from high-throughput experiments. For example, protein-protein interaction networks have been successfully used to identify candidate disease genes. Recently, advances in clinical text processing and the increasing availability of clinical data have enabled analogous analyses on data from electronic medical records. We constructed networks of diseases, drugs, medical devices and procedures using concepts recognized in clinical notes from the Stanford clinical data warehouse. We demonstrate the use of the resulting networks for clinical research informatics in two ways-cohort construction and outcomes analysis-by examining the safety of cilostazol in peripheral artery disease patients as a use case. We show that the network-based approaches can be used for constructing patient cohorts as well as for analyzing differences in outcomes by comparing with standard methods, and discuss the advantages offered by network-based approaches.

    View details for PubMedID 24303229

  • Practice-based evidence: profiling the safety of cilostazol by text-mining of clinical notes. PloS one Leeper, N. J., Bauer-Mehren, A., Iyer, S. V., LePendu, P., Olson, C., Shah, N. H. 2013; 8 (5)


    Peripheral arterial disease (PAD) is a growing problem with few available therapies. Cilostazol is the only FDA-approved medication with a class I indication for intermittent claudication, but carries a black box warning due to concerns for increased cardiovascular mortality. To assess the validity of this black box warning, we employed a novel text-analytics pipeline to quantify the adverse events associated with Cilostazol use in a clinical setting, including patients with congestive heart failure (CHF).We analyzed the electronic medical records of 1.8 million subjects from the Stanford clinical data warehouse spanning 18 years using a novel text-mining/statistical analytics pipeline. We identified 232 PAD patients taking Cilostazol and created a control group of 1,160 PAD patients not taking this drug using 1∶5 propensity-score matching. Over a mean follow up of 4.2 years, we observed no association between Cilostazol use and any major adverse cardiovascular event including stroke (OR = 1.13, CI [0.82, 1.55]), myocardial infarction (OR = 1.00, CI [0.71, 1.39]), or death (OR = 0.86, CI [0.63, 1.18]). Cilostazol was not associated with an increase in any arrhythmic complication. We also identified a subset of CHF patients who were prescribed Cilostazol despite its black box warning, and found that it did not increase mortality in this high-risk group of patients.This proof of principle study shows the potential of text-analytics to mine clinical data warehouses to uncover 'natural experiments' such as the use of Cilostazol in CHF patients. We envision this method will have broad applications for examining difficult to test clinical hypotheses and to aid in post-marketing drug safety surveillance. Moreover, our observations argue for a prospective study to examine the validity of a drug safety warning that may be unnecessarily limiting the use of an efficacious therapy.

    View details for DOI 10.1371/journal.pone.0063499

    View details for PubMedID 23717437

  • Mining Biomedical Ontologies and Data Using RDF Hypergraphs 2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2013), VOL 1 Liu, H., Dou, D., Jin, R., LePendu, P., Shah, N. 2013: 141-146
  • Chapter 9: Analyses Using Disease Ontologies PLOS COMPUTATIONAL BIOLOGY Shah, N. H., Cole, T., Musen, M. A. 2012; 8 (12)


    Advanced statistical methods used to analyze high-throughput data such as gene-expression assays result in long lists of "significant genes." One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene-set, and is widely used to makes sense of the results of high-throughput experiments. The canonical example of enrichment analysis is when the output dataset is a list of genes differentially expressed in some condition. To determine the biological relevance of a lengthy gene list, the usual solution is to perform enrichment analysis with the GO. We can aggregate the annotating GO concepts for each gene in this list, and arrive at a profile of the biological processes or mechanisms affected by the condition under study. While GO has been the principal target for enrichment analysis, the methods of enrichment analysis are generalizable. We can conduct the same sort of profiling along other ontologies of interest. Just as scientists can ask "Which biological process is over-represented in my set of interesting genes or proteins?" we can also ask "Which disease (or class of diseases) is over-represented in my set of interesting genes or proteins?". For example, by annotating known protein mutations with disease terms from the ontologies in BioPortal, Mort et al. recently identified a class of diseases--blood coagulation disorders--that were associated with a 14-fold depletion in substitutions at O-linked glycosylation sites. With the availability of tools for automatic annotation of datasets with terms from disease ontologies, there is no reason to restrict enrichment analyses to the GO. In this chapter, we will discuss methods to perform enrichment analysis using any ontology available in the biomedical domain. We will review the general methodology of enrichment analysis, the associated challenges, and discuss the novel translational analyses enabled by the existence of public, national computational infrastructure and by the use of disease ontologies in such analyses.

    View details for DOI 10.1371/journal.pcbi.1002827

    View details for Web of Science ID 000312901500032

    View details for PubMedID 23300417

  • Mining the pharmacogenomics literature-a survey of the state of the art BRIEFINGS IN BIOINFORMATICS Hahn, U., Cohen, K. B., Garten, Y., Shah, N. H. 2012; 13 (4): 460-494


    This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.

    View details for DOI 10.1093/bib/bbs018

    View details for Web of Science ID 000306925000007

    View details for PubMedID 22833496

  • The coming age of data-driven medicine: translational bioinformatics' next frontier JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Shah, N. H., Tenenbaum, J. D. 2012; 19 (E1): E2-E4

    View details for DOI 10.1136/amiajnl-2012-000969

    View details for Web of Science ID 000314151400002

    View details for PubMedID 22718035

  • Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Wu, S. T., Liu, H., Li, D., Tao, C., Musen, M. A., Chute, C. G., Shah, N. H. 2012; 19 (E1): E149-E156


    To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources.Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data.For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine-Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106426 and 94788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms.The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.

    View details for DOI 10.1136/amiajnl-2011-000744

    View details for Web of Science ID 000314151400025

    View details for PubMedID 22493050

  • Novel Data-Mining Methodologies for Adverse Drug Event Discovery and Analysis CLINICAL PHARMACOLOGY & THERAPEUTICS Harpaz, R., Dumouchel, W., Shah, N. H., Madigan, D., Ryan, P., Friedman, C. 2012; 91 (6): 1010-1021


    An important goal of the health system is to identify new adverse drug events (ADEs) in the postapproval period. Datamining methods that can transform data into meaningful knowledge to inform patient safety have proven essential for this purpose. New opportunities have emerged to harness data sources that have not been used within the traditional framework. This article provides an overview of recent methodological innovations and data sources used to support ADE discovery and analysis.

    View details for DOI 10.1038/clpt.2012.50

    View details for Web of Science ID 000304245800019

    View details for PubMedID 22549283

  • Using ontology-based annotation to profile disease research JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Liu, Y., Coulet, A., LePendu, P., Shah, N. H. 2012; 19 (E1): E177-E186


    Profiling the allocation and trend of research activity is of interest to funding agencies, administrators, and researchers. However, the lack of a common classification system hinders the comprehensive and systematic profiling of research activities. This study introduces ontology-based annotation as a method to overcome this difficulty. Analyzing over a decade of funding data and publication data, the trends of disease research are profiled across topics, across institutions, and over time.This study introduces and explores the notions of research sponsorship and allocation and shows that leaders of research activity can be identified within specific disease areas of interest, such as those with high mortality or high sponsorship. The funding profiles of disease topics readily cluster themselves in agreement with the ontology hierarchy and closely mirror the funding agency priorities. Finally, four temporal trends are identified among research topics.This work utilizes disease ontology (DO)-based annotation to profile effectively the landscape of biomedical research activity. By using DO in this manner a use-case driven mechanism is also proposed to evaluate the utility of classification hierarchies.

    View details for DOI 10.1136/amiajnl-2011-000631

    View details for Web of Science ID 000314151400029

    View details for PubMedID 22494789

  • The National Center for Biomedical Ontology JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Musen, M. A., Noy, N. F., Shah, N. H., Whetzel, P. L., Chute, C. G., Story, M., Smith, B. 2012; 19 (2): 190-195


    The National Center for Biomedical Ontology is now in its seventh year. The goals of this National Center for Biomedical Computing are to: create and maintain a repository of biomedical ontologies and terminologies; build tools and web services to enable the use of ontologies and terminologies in clinical and translational research; educate their trainees and the scientific community broadly about biomedical ontology and ontology-based technology and best practices; and collaborate with a variety of groups who develop and use ontologies and terminologies in biomedicine. The centerpiece of the National Center for Biomedical Ontology is a web-based resource known as BioPortal. BioPortal makes available for research in computationally useful forms more than 270 of the world's biomedical ontologies and terminologies, and supports a wide range of web services that enable investigators to use the ontologies to annotate and retrieve data, to generate value sets and special-purpose lexicons, and to perform advanced analytics on a wide range of biomedical data.

    View details for DOI 10.1136/amiajnl-2011-000523

    View details for Web of Science ID 000300768100010

    View details for PubMedID 22081220

  • Translational bioinformatics embraces big data. Yearbook of medical informatics Shah, N. H. 2012; 7 (1): 130-134


    We review the latest trends and major developments in translational bioinformatics in the year 2011-2012. Our emphasis is on highlighting the key events in the field and pointing at promising research areas for the future. The key take-home points are: • Translational informatics is ready to revolutionize human health and healthcare using large-scale measurements on individuals. • Data-centric approaches that compute on massive amounts of data (often called "Big Data") to discover patterns and to make clinically relevant predictions will gain adoption. • Research that bridges the latest multimodal measurement technologies with large amounts of electronic healthcare data is increasing; and is where new breakthroughs will occur.

    View details for PubMedID 22890354

  • Using temporal patterns in medical records to discern adverse drug events from indications. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science Liu, Y., LePendu, P., Iyer, S., Shah, N. H. 2012; 2012: 47-56


    Researchers estimate that electronic health record systems record roughly 2-million ambulatory adverse drug events and that patients suffer from adverse drug events in roughly 30% of hospital stays. Some have used structured databases of patient medical records and health insurance claims recently-going beyond the current paradigm of using spontaneous reporting systems like AERS-to detect drug-safety signals. However, most efforts do not use the free-text from clinical notes in monitoring for drug-safety signals. We hypothesize that drug-disease co-occurrences, extracted from ontology-based annotations of the clinical notes, can be examined for statistical enrichment and used for drug safety surveillance. When analyzing such co-occurrences of drugs and diseases, one major challenge is to differentiate whether the disease in a drug-disease pair represents an indication or an adverse event. We demonstrate that it is possible to make this distinction by combining the frequency distribution of the drug, the disease, and the drug-disease pair as well as the temporal ordering of the drugs and diseases in each pair across more than one million patients.

    View details for PubMedID 22779050

  • Selected papers from the 14th Annual Bio-Ontologies Special Interest Group Meeting. Journal of biomedical semantics Soldatova, L. N., Sansone, S., Dumontier, M., Shah, N. H. 2012; 3: I1-?


    Over the 14 years, the Bio-Ontologies SIG at ISMB has provided a forum for discussion of the latest and most innovative research in the bio-ontologies development, its applications to biomedicine and more generally the organisation, presentation and dissemination of knowledge in biomedicine and the life sciences. The seven papers selected for this supplement span a wide range of topics including: web-based querying over multiple ontologies, integration of data from wikis, innovative methods of annotating and mining electronic health records, advances in annotating web documents and biomedical literature, quality control of ontology alignments, and the ontology support for predictive models about toxicity and open access to the toxicity data.

    View details for DOI 10.1186/2041-1480-3-S1-I1

    View details for PubMedID 22541591

  • Annotation Analysis for Testing Drug Safety Signals using Unstructured Clinical Notes. Journal of biomedical semantics LePendu, P., Iyer, S. V., Fairon, C., Shah, N. H. 2012; 3: S5-?


    The electronic surveillance for adverse drug events is largely based upon the analysis of coded data from reporting systems. Yet, the vast majority of electronic health data lies embedded within the free text of clinical notes and is not gathered into centralized repositories. With the increasing access to large volumes of electronic medical data-in particular the clinical notes-it may be possible to computationally encode and to test drug safety signals in an active manner.We describe the application of simple annotation tools on clinical text and the mining of the resulting annotations to compute the risk of getting a myocardial infarction for patients with rheumatoid arthritis that take Vioxx. Our analysis clearly reveals elevated risks for myocardial infarction in rheumatoid arthritis patients taking Vioxx (odds ratio 2.06) before 2005.Our results show that it is possible to apply annotation analysis methods for testing hypotheses about drug safety using electronic medical records.

    View details for DOI 10.1186/2041-1480-3-S1-S5

    View details for PubMedID 22541596

  • Analyzing patterns of drug use in clinical notes for patient safety. AMIA Summits on Translational Science proceedings AMIA Summit on Translational Science LePendu, P., Liu, Y., Iyer, S., Udell, M. R., Shah, N. H. 2012; 2012: 63-70


    Doctors prescribe drugs for indications that are not FDA approved. Research indicates that 21% of prescriptions filled are for off-label indications. Of those, more than 73% lack supporting scientific evidence. Traditional drug safety alerts may not cover usages that are not FDA approved. Therefore, analyzing patterns of off-label drug usage in the clinical setting is an important step toward reducing the incidence of adverse events and for improving patient safety. We applied term extraction tools on the clinical notes of a million patients to compile a database of statistically significant patterns of drug use. We validated some of the usage patterns learned from the data against sources of known on-label and off-label use. Given our ability to quantify adverse event risks using the clinical notes, this will enable us to address patient safety because we can now rank-order off-label drug use and prioritize the search for their adverse event profiles.

    View details for PubMedID 22779054

  • Enabling enrichment analysis with the Human Disease Ontology. Journal of biomedical informatics LePendu, P., Musen, M. A., Shah, N. H. 2011; 44: S31-8


    Advanced statistical methods used to analyze high-throughput data such as gene-expression assays result in long lists of "significant genes." One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene set, and is widely used to make sense of the results of high-throughput experiments. Our goal is to develop and apply general enrichment analysis methods to profile other sets of interest, such as patient cohorts from the electronic medical record, using a variety of ontologies including SNOMED CT, MedDRA, RxNorm, and others. Although it is possible to perform enrichment analysis using ontologies other than the GO, a key pre-requisite is the availability of a background set of annotations to enable the enrichment calculation. In the case of the GO, this background set is provided by the Gene Ontology Annotations. In the current work, we describe: (i) a general method that uses hand-curated GO annotations as a starting point for creating background datasets for enrichment analysis using other ontologies; and (ii) a gene-disease background annotation set - that enables disease-based enrichment - to demonstrate feasibility of our method.

    View details for DOI 10.1016/j.jbi.2011.04.007

    View details for PubMedID 21550421

  • NCBO Resource Index: Ontology-based search and mining of biomedical resources JOURNAL OF WEB SEMANTICS Jonquet, C., LePendu, P., Falconer, S., Coulet, A., Noy, N. F., Musen, M. A., Shah, N. H. 2011; 9 (3): 316-324


    The volume of publicly available data in biomedicine is constantly increasing. However, these data are stored in different formats and on different platforms. Integrating these data will enable us to facilitate the pace of medical discoveries by providing scientists with a unified view of this diverse information. Under the auspices of the National Center for Biomedical Ontology (NCBO), we have developed the Resource Index-a growing, large-scale ontology-based index of more than twenty heterogeneous biomedical resources. The resources come from a variety of repositories maintained by organizations from around the world. We use a set of over 200 publicly available ontologies contributed by researchers in various domains to annotate the elements in these resources. We use the semantics that the ontologies encode, such as different properties of classes, the class hierarchies, and the mappings between ontologies, in order to improve the search experience for the Resource Index user. Our user interface enables scientists to search the multiple resources quickly and efficiently using domain terms, without even being aware that there is semantics "under the hood."

    View details for DOI 10.1016/j.websem.2011.06.005

    View details for Web of Science ID 000300169800007

    View details for PubMedID 21918645

  • BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications NUCLEIC ACIDS RESEARCH Whetzel, P. L., Noy, N. F., Shah, N. H., Alexander, P. R., Nyulas, C., Tudorache, T., Musen, M. A. 2011; 39: W541-W545


    The National Center for Biomedical Ontology (NCBO) is one of the National Centers for Biomedical Computing funded under the NIH Roadmap Initiative. Contributing to the national computing infrastructure, NCBO has developed BioPortal, a web portal that provides access to a library of biomedical ontologies and terminologies ( via the NCBO Web services. BioPortal enables community participation in the evaluation and evolution of ontology content by providing features to add mappings between terms, to add comments linked to specific ontology terms and to provide ontology reviews. The NCBO Web services ( enable this functionality and provide a uniform mechanism to access ontologies from a variety of knowledge representation formats, such as Web Ontology Language (OWL) and Open Biological and Biomedical Ontologies (OBO) format. The Web services provide multi-layered access to the ontology content, from getting all terms in an ontology to retrieving metadata about a term. Users can easily incorporate the NCBO Web services into software applications to generate semantically aware applications and to facilitate structured data collection.

    View details for DOI 10.1093/nar/gkr469

    View details for Web of Science ID 000292325300088

    View details for PubMedID 21672956

  • Computationally translating molecular discoveries into tools for medicine: translational bioinformatics articles now featured in JAMIA JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Butte, A. J., Shah, N. H. 2011; 18 (4): 352-353

    View details for DOI 10.1136/amiajnl-2011-000343

    View details for Web of Science ID 000292061700002

    View details for PubMedID 21672904

  • Mapping between the OBO and OWL ontology languages. Journal of biomedical semantics Tirmizi, S. H., Aitken, S., Moreira, D. A., Mungall, C., Sequeda, J., Shah, N. H., Miranker, D. P. 2011; 2: S3-?


    Ontologies are commonly used in biomedicine to organize concepts to describe domains such as anatomies, environments, experiment, taxonomies etc. NCBO BioPortal currently hosts about 180 different biomedical ontologies. These ontologies have been mainly expressed in either the Open Biomedical Ontology (OBO) format or the Web Ontology Language (OWL). OBO emerged from the Gene Ontology, and supports most of the biomedical ontology content. In comparison, OWL is a Semantic Web language, and is supported by the World Wide Web consortium together with integral query languages, rule languages and distributed infrastructure for information interchange. These features are highly desirable for the OBO content as well. A convenient method for leveraging these features for OBO ontologies is by transforming OBO ontologies to OWL.We have developed a methodology for translating OBO ontologies to OWL using the organization of the Semantic Web itself to guide the work. The approach reveals that the constructs of OBO can be grouped together to form a similar layer cake. Thus we were able to decompose the problem into two parts. Most OBO constructs have easy and obvious equivalence to a construct in OWL. A small subset of OBO constructs requires deeper consideration. We have defined transformations for all constructs in an effort to foster a standard common mapping between OBO and OWL. Our mapping produces OWL-DL, a Description Logics based subset of OWL with desirable computational properties for efficiency and correctness. Our Java implementation of the mapping is part of the official Gene Ontology project source.Our transformation system provides a lossless roundtrip mapping for OBO ontologies, i.e. an OBO ontology may be translated to OWL and back without loss of knowledge. In addition, it provides a roadmap for bridging the gap between the two ontology languages in order to enable the use of ontology content in a language independent manner.

    View details for DOI 10.1186/2041-1480-2-S1-S3

    View details for PubMedID 21388572

  • Selected papers from the 13th Annual Bio-Ontologies Special Interest Group Meeting. Journal of biomedical semantics Soldatova, L. N., Sansone, S., Stephens, S. M., Shah, N. H. 2011; 2: I1-?


    Over the years, the Bio-Ontologies SIG at ISMB has provided a forum for discussion of the latest and most innovative research in the application of ontologies and more generally the organisation, presentation and dissemination of knowledge in biomedicine and the life sciences. The ten papers selected for this supplement are extended versions of the original papers presented at the 2010 SIG. The papers span a wide range of topics including practical solutions for data and knowledge integration for translational medicine, hypothesis based querying , understanding kidney and urinary pathways, mining the pharmacogenomics literature; theoretical research into the orthogonality of biomedical ontologies, the representation of diseases, the representation of research hypotheses, the combination of ontologies and natural language processing for an annotation framework, the generation of textual definitions, and the discovery of gene interaction networks.

    View details for DOI 10.1186/2041-1480-2-S2-I1

    View details for PubMedID 21624154

  • HyQue: evaluating hypotheses using Semantic Web technologies. Journal of biomedical semantics Callahan, A., Dumontier, M., Shah, N. H. 2011; 2: S3-?


    Key to the success of e-Science is the ability to computationally evaluate expert-composed hypotheses for validity against experimental data. Researchers face the challenge of collecting, evaluating and integrating large amounts of diverse information to compose and evaluate a hypothesis. Confronted with rapidly accumulating data, researchers currently do not have the software tools to undertake the required information integration tasks.We present HyQue, a Semantic Web tool for querying scientific knowledge bases with the purpose of evaluating user submitted hypotheses. HyQue features a knowledge model to accommodate diverse hypotheses structured as events and represented using Semantic Web languages (RDF/OWL). Hypothesis validity is evaluated against experimental and literature-sourced evidence through a combination of SPARQL queries and evaluation rules. Inference over OWL ontologies (for type specifications, subclass assertions and parthood relations) and retrieval of facts stored as Bio2RDF linked data provide support for a given hypothesis. We evaluate hypotheses of varying levels of detail about the genetic network controlling galactose metabolism in Saccharomyces cerevisiae to demonstrate the feasibility of deploying such semantic computing tools over a growing body of structured knowledge in Bio2RDF.HyQue is a query-based hypothesis evaluation system that can currently evaluate hypotheses about the galactose metabolism in S. cerevisiae. Hypotheses as well as the supporting or refuting data are represented in RDF and directly linked to one another allowing scientists to browse from data to hypothesis and vice versa. HyQue hypotheses and data are available at

    View details for DOI 10.1186/2041-1480-2-S2-S3

    View details for PubMedID 21624158

  • Integration and publication of heterogeneous text-mined relationships on the Semantic Web. Journal of biomedical semantics Coulet, A., Garten, Y., Dumontier, M., Altman, R. B., Musen, M. A., Shah, N. H. 2011; 2: S10-?


    Advances in Natural Language Processing (NLP) techniques enable the extraction of fine-grained relationships mentioned in biomedical text. The variability and the complexity of natural language in expressing similar relationships causes the extracted relationships to be highly heterogeneous, which makes the construction of knowledge bases difficult and poses a challenge in using these for data mining or question answering.We report on the semi-automatic construction of the PHARE relationship ontology (the PHArmacogenomic RElationships Ontology) consisting of 200 curated relations from over 40,000 heterogeneous relationships extracted via text-mining. These heterogeneous relations are then mapped to the PHARE ontology using synonyms, entity descriptions and hierarchies of entities and roles. Once mapped, relationships can be normalized and compared using the structure of the ontology to identify relationships that have similar semantics but different syntax. We compare and contrast the manual procedure with a fully automated approach using WordNet to quantify the degree of integration enabled by iterative curation and refinement of the PHARE ontology. The result of such integration is a repository of normalized biomedical relationships, named PHARE-KB, which can be queried using Semantic Web technologies such as SPARQL and can be visualized in the form of a biological network.The PHARE ontology serves as a common semantic framework to integrate more than 40,000 relationships pertinent to pharmacogenomics. The PHARE ontology forms the foundation of a knowledge base named PHARE-KB. Once populated with relationships, PHARE-KB (i) can be visualized in the form of a biological network to guide human tasks such as database curation and (ii) can be queried programmatically to guide bioinformatics applications such as the prediction of molecular interactions. PHARE is available at

    View details for DOI 10.1186/2041-1480-2-S2-S10

    View details for PubMedID 21624156

  • Using text to build semantic networks for pharmacogenomics JOURNAL OF BIOMEDICAL INFORMATICS Coulet, A., Shah, N. H., Garten, Y., Musen, M., Altman, R. B. 2010; 43 (6): 1009-1019


    Most pharmacogenomics knowledge is contained in the text of published studies, and is thus not available for automated computation. Natural Language Processing (NLP) techniques for extracting relationships in specific domains often rely on hand-built rules and domain-specific ontologies to achieve good performance. In a new and evolving field such as pharmacogenomics (PGx), rules and ontologies may not be available. Recent progress in syntactic NLP parsing in the context of a large corpus of pharmacogenomics text provides new opportunities for automated relationship extraction. We describe an ontology of PGx relationships built starting from a lexicon of key pharmacogenomic entities and a syntactic parse of more than 87 million sentences from 17 million MEDLINE abstracts. We used the syntactic structure of PGx statements to systematically extract commonly occurring relationships and to map them to a common schema. Our extracted relationships have a 70-87.7% precision and involve not only key PGx entities such as genes, drugs, and phenotypes (e.g., VKORC1, warfarin, clotting disorder), but also critical entities that are frequently modified by these key entities (e.g., VKORC1 polymorphism, warfarin response, clotting disorder treatment). The result of our analysis is a network of 40,000 relationships between more than 200 entity types with clear semantics. This network is used to guide the curation of PGx knowledge and provide a computable resource for knowledge discovery.

    View details for DOI 10.1016/j.jbi.2010.08.005

    View details for Web of Science ID 000285036700017

    View details for PubMedID 20723615

  • A UIMA wrapper for the NCBO annotator BIOINFORMATICS Roeder, C., Jonquet, C., Shah, N. H., Baumgartner, W. A., Verspoor, K., Hunter, L. 2010; 26 (14): 1800-1801


    The Unstructured Information Management Architecture (UIMA) framework and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which wraps the National Center for Biomedical Ontology (NCBO) Annotator-an ontology-based annotation service-to make it available as a component in UIMA workflows.This wrapper is freely available on the web at as part of the UIMA tools distribution from the Center for Computational Pharmacology (CCP) at the University of Colorado School of Medicine. It has been implemented in Java for support on Mac OS X, Linux and MS Windows.

    View details for DOI 10.1093/bioinformatics/btq250

    View details for Web of Science ID 000279474400025

    View details for PubMedID 20505005

  • In Silico Functional Profiling of Human Disease-Associated and Polymorphic Amino Acid Substitutions HUMAN MUTATION Mort, M., Evani, U. S., Krishnan, V. G., Kamati, K. K., Baenziger, P. H., Bagchi, A., Peters, B. J., Sathyesh, R., Li, B., Sun, Y., Xue, B., Shah, N. H., Kann, M. G., Cooper, D. N., Radivojac, P., Mooney, S. D. 2010; 31 (3): 335-346


    An important challenge in translational bioinformatics is to understand how genetic variation gives rise to molecular changes at the protein level that can precipitate both monogenic and complex disease. To this end, we compiled datasets of human disease-associated amino acid substitutions (AAS) in the contexts of inherited monogenic disease, complex disease, functional polymorphisms with no known disease association, and somatic mutations in cancer, and compared them with respect to predicted functional sites in proteins. Using the sequence homology-based tool SIFT to estimate the proportion of deleterious AAS in each dataset, only complex disease AAS were found to be indistinguishable from neutral polymorphic AAS. Investigation of monogenic disease AAS predicted to be nondeleterious by SIFT were characterized by a significant enrichment for inherited AAS within solvent accessible residues, regions of intrinsic protein disorder, and an association with the loss or gain of various posttranslational modifications. Sites of structural and/or functional interest were therefore surmised to constitute useful additional features with which to identify the molecular disruptions caused by deleterious AAS. A range of bioinformatic tools, designed to predict structural and functional sites in protein sequences, were then employed to demonstrate that intrinsic biases exist in terms of the distribution of different types of human AAS with respect to specific structural, functional and pathological features. Our Web tool, designed to potentiate the functional profiling of novel AAS, has been made available at

    View details for DOI 10.1002/humu.21192

    View details for Web of Science ID 000275419900014

    View details for PubMedID 20052762

  • Selected papers from the 12th annual Bio-Ontologies meeting. Journal of biomedical semantics Soldatova, L. N., Lord, P., Sansone, S., Stephens, S. M., Shah, N. H. 2010; 1: I1-?

    View details for DOI 10.1186/2041-1480-1-S1-I1

    View details for PubMedID 20626920

  • Optimize First, Buy Later: Analyzing Metrics to Ramp-Up Very Large Knowledge Bases SEMANTIC WEB-ISWC 2010, PT I LePendu, P., Noy, N. F., Jonquet, C., Alexander, P. R., Shah, N. H., Musen, M. A. 2010; 6496: 486-501
  • The Lexicon Builder Web service: Building Custom Lexicons from two hundred Biomedical Ontologies. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Parai, G. K., Jonquet, C., xu, r., Musen, M. A., Shah, N. H. 2010; 2010: 587-591


    Domain specific biomedical lexicons are extensively used by researchers for natural language processing tasks. Currently these lexicons are created manually by expert curators and there is a pressing need for automated methods to compile such lexicons. The Lexicon Builder Web service addresses this need and reduces the investment of time and effort involved in lexicon maintenance. The service has three components: Inclusion - selects one or several ontologies (or its branches) and includes preferred names and synonym terms; Exclusion - filters terms based on the term's Medline frequency, syntactic type, UMLS semantic type and match with stopwords; Output - aggregates information, handles compression and output formats. Evaluation demonstrates that the service has high accuracy and runtime performance. It is currently being evaluated for several use cases to establish its utility in biomedical information processing tasks. The Lexicon Builder promotes collaboration, sharing and standardization of lexicons amongst researchers by automating the creation, maintainence and cross referencing of custom lexicons.

    View details for PubMedID 21347046

  • Building a biomedical ontology recommender web service. Journal of biomedical semantics Jonquet, C., Musen, M. A., Shah, N. H. 2010; 1: S1-?


    Researchers in biomedical informatics use ontologies and terminologies to annotate their data in order to facilitate data integration and translational discoveries. As the use of ontologies for annotation of biomedical datasets has risen, a common challenge is to identify ontologies that are best suited to annotating specific datasets. The number and variety of biomedical ontologies is large, and it is cumbersome for a researcher to figure out which ontology to use.We present the Biomedical Ontology Recommender web service. The system uses textual metadata or a set of keywords describing a domain of interest and suggests appropriate ontologies for annotating or representing the data. The service makes a decision based on three criteria. The first one is coverage, or the ontologies that provide most terms covering the input text. The second is connectivity, or the ontologies that are most often mapped to by other ontologies. The final criterion is size, or the number of concepts in the ontologies. The service scores the ontologies as a function of scores of the annotations created using the National Center for Biomedical Ontology (NCBO) Annotator web service. We used all the ontologies from the UMLS Metathesaurus and the NCBO BioPortal.We compare and contrast our Recommender by an exhaustive functional comparison to previously published efforts. We evaluate and discuss the results of several recommendation heuristics in the context of three real world use cases. The best recommendations heuristics, rated 'very relevant' by expert evaluators, are the ones based on coverage and connectivity criteria. The Recommender service (alpha version) is available to the community and is embedded into BioPortal.

    View details for DOI 10.1186/2041-1480-1-S1-S1

    View details for PubMedID 20626921

  • An ontology-neutral framework for enrichment analysis. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Tirrell, R., Evani, U., Berman, A. E., Mooney, S. D., Musen, M. A., Shah, N. H. 2010; 2010: 797-801


    Advanced statistical methods used to analyze high-throughput data (e.g. gene-expression assays) result in long lists of "significant genes." One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene-set, and is relevant for and extensible to data analysis with other high-throughput measurement modalities such as proteomics, metabolomics, and tissue-microarray assays. With the availability of tools for automatic ontology-based annotation of datasets with terms from biomedical ontologies besides the GO, we need not restrict enrichment analysis to the GO. We describe, RANSUM - Rich Annotation Summarizer - which performs enrichment analysis using any ontology in the National Center for Biomedical Ontology's (NCBO) BioPortal. We outline the methodology of enrichment analysis, the associated challenges, and discuss novel analyses enabled by RANSUM.

    View details for PubMedID 21347088

  • Extraction of genotype-phenotype-drug relationships from text: from entity recognition to bioinformatics application. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Coulet, A., Shah, N., Hunter, L., Barral, C., Altman, R. B. 2010: 485-487


    Advances in concept recognition and natural language parsing have led to the development of various tools that enable the identification of biomedical entities and relationships between them in text. The aim of the Genotype-Phenotype-Drug Relationship Extraction from Text workshop (or GPD-Rx workshop) is to examine the current state of art and discuss the next steps for making the extraction of relationships between biomedical entities integral to the curation and knowledge management workflow in Pharmacogenomics. The workshop will focus particularly on the extraction of Genotype-Phenotype, Genotype-Drug, and Phenotype-Drug relationships that are of interest to Pharmacogenomics. Extracting and structuring such text-mined relationships is a key to support the evaluation and the validation of multiple hypotheses that emerge from high throughput translational studies spanning multiple measurement modalities. In order to advance this agenda, it is essential that existing relationship extraction methods be compared to one another and that a community wide benchmark corpus emerges; against which future methods can be compared. The workshop aims to bring together researchers working on the automatic or semi-automatic extraction of relationships between biomedical entities from research literature in order to identify the key groups interested in creating such a benchmark.

    View details for PubMedID 19904832

  • A Comprehensive Analysis of Five Million UMLS Metathesaurus Terms Using Eighteen Million MEDLINE Citations. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium xu, r., Musen, M. A., Shah, N. H. 2010; 2010: 907-911


    The Unified Medical Language System (UMLS) Metathesaurus is widely used for biomedical natural language processing (NLP) tasks. In this study, we systematically analyzed UMLS Metathesaurus terms by analyzing their occurrences in over 18 million MEDLINE abstracts. Our goals were: 1. analyze the frequency and syntactic distribution of Metathesaurus terms in MEDLINE; 2. create a filtered UMLS Metathesaurus based on the MEDLINE analysis; 3. augment the UMLS Metathesaurus where each term is associated with metadata on its MEDLINE frequency and syntactic distribution statistics. After MEDLINE frequency-based filtering, the augmented UMLS Metathesaurus contains 518,835 terms and is roughly 13% of its original size. We have shown that the syntactic and frequency information is useful to identify errors in the Metathesaurus. This filtered and augmented UMLS Metathesaurus can potentially be used to improve efficiency and precision of UMLS-based information retrieval and NLP tasks.

    View details for PubMedID 21347110

  • BioPortal: ontologies and integrated data resources at the click of a mouse NUCLEIC ACIDS RESEARCH Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D. L., Storey, M., Chute, C. G., Musen, M. A. 2009; 37: W170-W173


    Biomedical ontologies provide essential domain knowledge to drive data integration, information retrieval, data annotation, natural-language processing and decision support. BioPortal ( is an open repository of biomedical ontologies that provides access via Web services and Web browsers to ontologies developed in OWL, RDF, OBO format and Protégé frames. BioPortal functionality includes the ability to browse, search and visualize ontologies. The Web interface also facilitates community-based participation in the evaluation and evolution of ontology content by providing features to add notes to ontology terms, mappings between terms and ontology reviews based on criteria such as usability, domain coverage, quality of content, and documentation and support. BioPortal also enables integrated search of biomedical data resources such as the Gene Expression Omnibus (GEO),, and ArrayExpress, through the annotation and indexing of these resources with ontologies in BioPortal. Thus, BioPortal not only provides investigators, clinicians, and developers 'one-stop shopping' to programmatically access biomedical ontologies, but also provides support to integrate data from a variety of biomedical resources.

    View details for DOI 10.1093/nar/gkp440

    View details for Web of Science ID 000267889100031

    View details for PubMedID 19483092

  • Ontology-driven indexing of public datasets for translational bioinformatics BMC BIOINFORMATICS Shah, N. H., Jonquet, C., Chiang, A. P., Butte, A. J., Chen, R., Musen, M. A. 2009; 10


    The volume of publicly available genomic scale data is increasing. Genomic datasets in public repositories are annotated with free-text fields describing the pathological state of the studied sample. These annotations are not mapped to concepts in any ontology, making it difficult to integrate these datasets across repositories. We have previously developed methods to map text-annotations of tissue microarrays to concepts in the NCI thesaurus and SNOMED-CT. In this work we generalize our methods to map text annotations of gene expression datasets to concepts in the UMLS. We demonstrate the utility of our methods by processing annotations of datasets in the Gene Expression Omnibus. We demonstrate that we enable ontology-based querying and integration of tissue and gene expression microarray data. We enable identification of datasets on specific diseases across both repositories. Our approach provides the basis for ontology-driven data integration for translational research on gene and protein expression data. Based on this work we have built a prototype system for ontology based annotation and indexing of biomedical data. The system processes the text metadata of diverse resource elements such as gene expression data sets, descriptions of radiology images, clinical-trial reports, and PubMed article abstracts to annotate and index them with concepts from appropriate ontologies. The key functionality of this system is to enable users to locate biomedical data resources related to particular ontology concepts.

    View details for DOI 10.1186/1471-2105-10-S2-S1

    View details for Web of Science ID 000265602500002

    View details for PubMedID 19208184

  • The open biomedical annotator. Summit on translational bioinformatics Jonquet, C., Shah, N. H., Musen, M. A. 2009; 2009: 56-60


    The range of publicly available biomedical data is enormous and is expanding fast. This expansion means that researchers now face a hurdle to extracting the data they need from the large numbers of data that are available. Biomedical researchers have turned to ontologies and terminologies to structure and annotate their data with ontology concepts for better search and retrieval. However, this annotation process cannot be easily automated and often requires expert curators. Plus, there is a lack of easy-to-use systems that facilitate the use of ontologies for annotation. This paper presents the Open Biomedical Annotator (OBA), an ontology-based Web service that annotates public datasets with biomedical ontology concepts based on their textual metadata ( The biomedical community can use the annotator service to tag datasets automatically with ontology terms (from UMLS and NCBO BioPortal ontologies). Such annotations facilitate translational discoveries by integrating annotated data.[1].

    View details for PubMedID 21347171

  • What Four Million Mappings Can Tell You about Two Hundred Ontologies SEMANTIC WEB - ISWC 2009, PROCEEDINGS Ghazvinian, A., Noy, N. F., Jonquet, C., Shah, N., Musen, M. A. 2009; 5823: 229-242
  • Comparison of concept recognizers for building the Open Biomedical Annotator BMC BIOINFORMATICS Shah, N. H., Bhatia, N., Jonquet, C., Rubin, D., Chiang, A. P., Musen, M. A. 2009; 10


    The National Center for Biomedical Ontology (NCBO) is developing a system for automated, ontology-based access to online biomedical resources (Shah NH, et al.: Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinformatics 2009, 10(Suppl 2):S1). The system's indexing workflow processes the text metadata of diverse resources such as datasets from GEO and ArrayExpress to annotate and index them with concepts from appropriate ontologies. This indexing requires the use of a concept-recognition tool to identify ontology concepts in the resource's textual metadata. In this paper, we present a comparison of two concept recognizers - NLM's MetaMap and the University of Michigan's Mgrep. We utilize a number of data sources and dictionaries to evaluate the concept recognizers in terms of precision, recall, speed of execution, scalability and customizability. Our evaluations demonstrate that Mgrep has a clear edge over MetaMap for large-scale service oriented applications. Based on our analysis we also suggest areas of potential improvements for Mgrep. We have subsequently used Mgrep to build the Open Biomedical Annotator service. The Annotator service has access to a large dictionary of biomedical terms derived from the United Medical Language System (UMLS) and NCBO ontologies. The Annotator also leverages the hierarchical structure of the ontologies and their mappings to expand annotations. The Annotator service is available to the community as a REST Web service for creating ontology-based annotations of their data.

    View details for DOI 10.1186/1471-2105-10-S9-S14

    View details for Web of Science ID 000270371700015

    View details for PubMedID 19761568

  • BioPortal: ontologies and data resources with the click of a mouse. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Musen, M. A., Shah, N. H., Noy, N. F., Dai, B. Y., Dorf, M., Griffith, N., Buntrok, J., Jonquet, C., Montegut, M. J., Rubin, D. L. 2008: 1223-1224

    View details for PubMedID 18999306

  • A system for ontology-based annotation of biomedical data DATA INTEGRATION IN THE LIFE SCIENCES, PROCEEDINGS Jonquet, C., Musen, M. A., Shah, N. 2008; 5109: 144-152
  • Comparison of ontology-based semantic-similarity measures. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Lee, W., Shah, N., Sundlass, K., Musen, M. 2008: 384-388


    Semantic-similarity measures quantify concept similarities in a given ontology. Potential applications for these measures include search, data mining, and knowledge discovery in database or decision-support systems that utilize ontologies. To date, there have not been comparisons of the different semantic-similarity approaches on a single ontology. Such a comparison can offer insight on the validity of different approaches. We compared 3 approaches to semantic similarity-metrics (which rely on expert opinion, ontologies only, and information content) with 4 metrics applied to SNOMED-CT. We found that there was poor agreement among those metrics based on information content with the ontology only metric. The metric based only on the ontology structure correlated most with expert opinion. Our results suggest that metrics based on the ontology only may be preferable to information-content-based metrics, and point to the need for more research on validating the different approaches.

    View details for PubMedID 18999312

  • UMLS-Query: a perl module for querying the UMLS. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Shah, N. H., Muse, M. A. 2008: 652-656


    The Metathesaurus from the Unified Medical Language System (UMLS) is a widely used ontology resource, which is mostly used in a relational database form for terminology research, mapping and information indexing. A significant section of UMLS users use a MySQL installation of the metathesaurus and Perl programming language as their access mechanism. We describe UMLS-Query, a Perl module that provides functions for retrieving concept identifiers, mapping text-phrases to Metathesaurus concepts and graph traversal in the Metathesaurus stored in a MySQL database. UMLS-Query can be used to build applications for semi-automated sample annotation, terminology based browsers for tissue sample databases and for terminology research. We describe the results of such uses of UMLS-Query and present the module for others to use.

    View details for PubMedID 18998805

  • The Stanford Tissue Microarray Database NUCLEIC ACIDS RESEARCH Marinelli, R. J., Montgomery, K., Liu, C. L., Shah, N. H., Prapong, W., Nitzberg, M., Zachariah, Z. K., Sherlock, G. J., Natkunam, Y., West, R. B., van de Rijn, M., Brown, P. O., Ball, C. A. 2008; 36: D871-D877


    The Stanford Tissue Microarray Database (TMAD; is a public resource for disseminating annotated tissue images and associated expression data. Stanford University pathologists, researchers and their collaborators worldwide use TMAD for designing, viewing, scoring and analyzing their tissue microarrays. The use of tissue microarrays allows hundreds of human tissue cores to be simultaneously probed by antibodies to detect protein abundance (Immunohistochemistry; IHC), or by labeled nucleic acids (in situ hybridization; ISH) to detect transcript abundance. TMAD archives multi-wavelength fluorescence and bright-field images of tissue microarrays for scoring and analysis. As of July 2007, TMAD contained 205 161 images archiving 349 distinct probes on 1488 tissue microarray slides. Of these, 31 306 images for 68 probes on 125 slides have been released to the public. To date, 12 publications have been based on these raw public data. TMAD incorporates the NCI Thesaurus ontology for searching tissues in the cancer domain. Image processing researchers can extract images and scores for training and testing classification algorithms. The production server uses the Apache HTTP Server, Oracle Database and Perl application code. Source code is available to interested researchers under a no-cost license.

    View details for DOI 10.1093/nar/gkm861

    View details for Web of Science ID 000252545400154

    View details for PubMedID 17989087

  • Biomedical ontologies: a functional perspective BRIEFINGS IN BIOINFORMATICS Rubin, D. L., Shah, N. H., Noy, N. F. 2008; 9 (1): 75-90


    The information explosion in biology makes it difficult for researchers to stay abreast of current biomedical knowledge and to make sense of the massive amounts of online information. Ontologies--specifications of the entities, their attributes and relationships among the entities in a domain of discourse--are increasingly enabling biomedical researchers to accomplish these tasks. In fact, bio-ontologies are beginning to proliferate in step with accruing biological data. The myriad of ontologies being created enables researchers not only to solve some of the problems in handling the data explosion but also introduces new challenges. One of the key difficulties in realizing the full potential of ontologies in biomedical research is the isolation of various communities involved: some workers spend their career developing ontologies and ontology-related tools, while few researchers (biologists and physicians) know how ontologies can accelerate their research. The objective of this review is to give an overview of biomedical ontology in practical terms by providing a functional perspective--describing how bio-ontologies can and are being used. As biomedical scientists begin to recognize the many different ways ontologies enable biomedical research, they will drive the emergence of new computer applications that will help them exploit the wealth of research data now at their fingertips.

    View details for DOI 10.1093/bib/bbm059

    View details for Web of Science ID 000251864600008

    View details for PubMedID 18077472

  • The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration NATURE BIOTECHNOLOGY Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L. J., Eilbeck, K., Ireland, A., Mungall, C. J., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S., Scheuermann, R. H., Shah, N., Whetzel, P. L., Lewis, S. 2007; 25 (11): 1251-1255


    The value of any kind of data is greatly enhanced when it exists in a form that allows it to be integrated with other data. One approach to integration is through the annotation of multiple bodies of data using common controlled vocabularies or 'ontologies'. Unfortunately, the very success of this approach has led to a proliferation of ontologies, which itself creates obstacles to integration. The Open Biomedical Ontologies (OBO) consortium is pursuing a strategy to overcome this problem. Existing OBO ontologies, including the Gene Ontology, are undergoing coordinated reform, and new ontologies are being created on the basis of an evolving set of shared principles governing ontology development. The result is an expanding family of ontologies designed to be interoperable and logically well formed and to incorporate accurate representations of biological reality. We describe this OBO Foundry initiative and provide guidelines for those who might wish to become involved.

    View details for DOI 10.1038/nbt1346

    View details for Web of Science ID 000251086500025

    View details for PubMedID 17989687

  • Current progress in network research: toward reference networks for key model organisms BRIEFINGS IN BIOINFORMATICS Srinivasan, B. S., Shah, N. H., Flannick, J. A., Abeliuk, E., Novak, A. F., Batzoglou, S. 2007; 8 (5): 318-332


    The collection of multiple genome-scale datasets is now routine, and the frontier of research in systems biology has shifted accordingly. Rather than clustering a single dataset to produce a static map of functional modules, the focus today is on data integration, network alignment, interactive visualization and ontological markup. Because of the intrinsic noisiness of high-throughput measurements, statistical methods have been central to this effort. In this review, we briefly survey available datasets in functional genomics, review methods for data integration and network alignment, and describe recent work on using network models to guide experimental validation. We explain how the integration and validation steps spring from a Bayesian description of network uncertainty, and conclude by describing an important near-term milestone for systems biology: the construction of a set of rich reference networks for key model organisms.

    View details for DOI 10.1093/bib/bbm038

    View details for Web of Science ID 000251034700005

    View details for PubMedID 17728341

  • Annotation and query of tissue microarray data using the NCI Thesaurus BMC BIOINFORMATICS Shah, N. H., Rubin, D. L., Espinosa, I., Montgomery, K., Musen, M. A. 2007; 8


    The Stanford Tissue Microarray Database (TMAD) is a repository of data serving a consortium of pathologists and biomedical researchers. The tissue samples in TMAD are annotated with multiple free-text fields, specifying the pathological diagnoses for each sample. These text annotations are not structured according to any ontology, making future integration of this resource with other biological and clinical data difficult.We developed methods to map these annotations to the NCI thesaurus. Using the NCI-T we can effectively represent annotations for about 86% of the samples. We demonstrate how this mapping enables ontology driven integration and querying of tissue microarray data. We have deployed the mapping and ontology driven querying tools at the TMAD site for general use.We have demonstrated that we can effectively map the diagnosis-related terms describing a sample in TMAD to the NCI-T. The NCI thesaurus terms have a wide coverage and provide terms for about 86% of the samples. In our opinion the NCI thesaurus can facilitate integration of this resource with other biological data.

    View details for DOI 10.1186/1471-2105-8-296

    View details for Web of Science ID 000249734300001

    View details for PubMedID 17686183

  • Interpretation errors related to the GO annotation file format. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Moreira, D. A., Shah, N. H., Musen, M. A. 2007: 538-542


    The Gene Ontology (GO) is the most widely used ontology for creating biomedical annotations. GO annotations are statements associating a biological entity with a GO term. These statements comprise a large dataset of biological knowledge that is used widely in biomedical research. GO Annotations are available as "gene association files" from the GO website in a tab-delimited file format (GO Annotation File Format) composed of rows of 15 tab-delimited fields. This simple format lacks the knowledge representation (KR) capabilities to represent unambiguously semantic relationships between each field. This paper demonstrates that this KR shortcoming leads users to interpret the files in ways that can be erroneous. We propose a complementary format to represent GO annotation files as knowledge bases using the W3C recommended Web Ontology Language (OWL).

    View details for PubMedID 18693894

  • Using annotations from controlled vocabularies to find meaningful associations DATA INTEGRATION IN THE LIFE SCIENCES, PROCEEDINGS Lee, W., Raschid, L., Srinivasan, P., Shah, N., Rubin, D., Noy, N. 2007; 4544: 247-263
  • Searching Ontologies Based on Content: Experiments in the Biomedical Domain K-CAP'07: PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE Alani, H., Noy, N. F., Shah, N., Shadbolt, N., Musen, M. A. 2007: 55-62
  • A case study in pathway knowledgebase verification BMC BIOINFORMATICS Racunas, S. A., Shah, N. H., Fedoroff, N. V. 2006; 7


    Biological databases and pathway knowledge-bases are proliferating rapidly. We are developing software tools for computer-aided hypothesis design and evaluation, and we would like our tools to take advantage of the information stored in these repositories. But before we can reliably use a pathway knowledge-base as a data source, we need to proofread it to ensure that it can fully support computer-aided information integration and inference.We design a series of logical tests to detect potential problems we might encounter using a particular knowledge-base, the Reactome database, with a particular computer-aided hypothesis evaluation tool, HyBrow. We develop an explicit formal language from the language implicit in the Reactome data format and specify a logic to evaluate models expressed using this language. We use the formalism of finite model theory in this work. We then use this logic to formulate tests for desirable properties (such as completeness, consistency, and well-formedness) for pathways stored in Reactome. We apply these tests to the publicly available Reactome releases (releases 10 through 14) and compare the results, which highlight Reactome's steady improvement in terms of decreasing inconsistencies. We also investigate and discuss Reactome's potential for supporting computer-aided inference tools.The case study described in this work demonstrates that it is possible to use our model theory based approach to identify problems one might encounter using a knowledge-base to support hypothesis evaluation tools. The methodology we use is general and is in no way restricted to the specific knowledge-base employed in this case study. Future application of this methodology will enable us to compare pathway resources with respect to the generic properties such resources will need to possess if they are to support automated reasoning.

    View details for DOI 10.1186/1471-2105-7-196

    View details for Web of Science ID 000239302400001

    View details for PubMedID 16603083

  • Ontology-based annotation and query of tissue microarray data. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Shah, N. H., Rubin, D. L., Supekar, K. S., Musen, M. A. 2006: 709-713


    The Stanford Tissue Microarray Database (TMAD) is a repository of data amassed by a consortium of pathologists and biomedical researchers. The TMAD data are annotated with multiple free-text fields, specifying the pathological diagnoses for each tissue sample. These annotations are spread out over multiple text fields and are not structured according to any ontology, making it difficult to integrate this resource with other biological and clinical data. We developed methods to map these annotations to the NCI thesaurus and the SNOMED-CT ontologies. Using these two ontologies we can effectively represent about 80% of the annotations in a structured manner. This mapping offers the ability to perform ontology driven querying of the TMAD data. We also found that 40% of annotations can be mapped to terms from both ontologies, providing the potential to align the two ontologies based on experimental data. Our approach provides the basis for a data-driven ontology alignment by mapping annotations of experimental data.

    View details for PubMedID 17238433

  • Temporal evolution of the Arabidopsis oxidative stress response PLANT MOLECULAR BIOLOGY Mahalingam, R., Shah, N., Scrymgeour, A., Fedoroff, N. 2005; 57 (5): 709-730


    We have carried out a detailed analysis of the changes in gene expression levels in Arabidopsis thaliana ecotype Columbia (Col-0) plants during and for 6 h after exposure to ozone (O3) at 350 parts per billion (ppb) for 6 h. This O3 exposure is sufficient to induce a marked transcriptional response and an oxidative burst, but not to cause substantial tissue damage in Col-0 wild-type plants and is within the range encountered in some major metropolitan areas. We have developed analytical and visualization tools to automate the identification of expression profile groups with common gene ontology (GO) annotations based on the sub-cellular localization and function of the proteins encoded by the genes, as well as to automate promoter analysis for such gene groups. We describe application of these methods to identify stress-induced genes whose transcript abundance is likely to be controlled by common regulatory mechanisms and summarized our findings in a temporal model of the stress response.

    View details for DOI 10.1007/s11103-005-2860-4

    View details for Web of Science ID 000231220400007

    View details for PubMedID 15988565

  • HyBrow: a prototype system for computer-aided hypothesis evaluation. Bioinformatics Racunas, S. A., Shah, N. H., Albert, I., Fedoroff, N. V. 2004; 20: i257-64


    Experimental design, hypothesis-testing and model-building in the current data-rich environment require the biologists' to collect, evaluate and integrate large amounts of information of many disparate kinds. Developing a unified framework for the representation and conceptual integration of biological data and processes is a major challenge in bioinformatics because of the variety of available data and the different levels of detail at which biological processes can be considered.We have developed the HyBrow (Hypothesis Browser) system as a prototype bioinformatics tool for designing hypotheses and evaluating them for consistency with existing knowledge. HyBrow consists of a modeling framework with the ability to accommodate diverse biological information sources, an event-based ontology for representing biological processes at different levels of detail, a database to query information in the ontology and programs to perform hypothesis design and evaluation. We demonstrate the HyBrow prototype using the galactose gene network in Saccharomyces cerevisiae as our test system, and evaluate alternative hypotheses for consistency with stored

    View details for PubMedID 15262807

  • HyBrow: a prototype system for computer-aided hypothesis evaluation BIOINFORMATICS Racunas, S. A., Shah, N. H., Albert, I., Fedoroff, N. V. 2004; 20: 257-264
  • CLENCH: a program for calculating Cluster ENriCHment using the Gene Ontology BIOINFORMATICS Shah, N. H., Fedoroff, N. V. 2004; 20 (7): 1196-1197


    Analysis of microarray data most often produces lists of genes with similar expression patterns, which are then subdivided into functional categories for biological interpretation. Such functional categorization is most commonly accomplished using Gene Ontology (GO) categories. Although there are several programs that identify and analyze functional categories for human, mouse and yeast genes, none of them accept Arabidopsis thaliana data. In order to address this need for A.thaliana community, we have developed a program that retrieves GO annotations for A.thaliana genes and performs functional category analysis for lists of genes selected by the user.

    View details for DOI 10.1093/bioinformatics/bth056

    View details for Web of Science ID 000221139700024

    View details for PubMedID 14764555

  • A finite model theory for biological hypotheses 2004 IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE, PROCEEDINGS Racunas, S., Griffin, C., Shah, N. 2004: 616-620
  • A tool-kit for cDNA microarray and promoter analysis BIOINFORMATICS Shah, N. H., King, D. C., Shah, P. N., Fedoroff, N. V. 2003; 19 (14): 1846-1848


    We describe two sets of programs for expediting routine tasks in analysis of cDNA microarray data and promoter sequences. The first set permits bad data points to be flagged with respect to a number of parameters and performs normalization in three different ways. It allows combining of result files into comprehensive data sets, evaluation of the quality of both technical and biological replicates and row and/or column standardization of data matrices. The second set supports mapping ESTs in the genome, identifying the corresponding genes and recovering their promoters, analyzing promoters for transcription factor binding sites, and visual representation of the results. The programs are designed primarily for Arabidopsis thaliana researchers, but can be adapted readily for other model systems. Availability and Supplementary information:

    View details for DOI 10.1093/bioinformatics/btg253

    View details for Web of Science ID 000185701100017

    View details for PubMedID 14512358

  • A contradiction-based framework for testing gene regulation hypotheses PROCEEDINGS OF THE 2003 IEEE BIOINFORMATICS CONFERENCE Racunas, S., Shah, N., Fedoroff, N. V. 2003: 634-638
  • Characterizing the stress/defense transcriptome of Arabidopsis GENOME BIOLOGY Mahalingam, R., Gomez-Buitrago, A., Eckardt, N., Shah, N., Guevara-Garcia, A., Day, P., Raina, R., Fedoroff, N. V. 2003; 4 (3)


    To understand the gene networks that underlie plant stress and defense responses, it is necessary to identify and characterize the genes that respond both initially and as the physiological response to the stress or pathogen develops. We used PCR-based suppression subtractive hybridization to identify Arabidopsis genes that are differentially expressed in response to ozone, bacterial and oomycete pathogens and the signaling molecules salicylic acid (SA) and jasmonic acid.We identified a total of 1,058 differentially expressed genes from eight stress cDNA libraries. Digital northern analysis revealed that 55% of the stress-inducible genes are rarely transcribed in unstressed plants and 17% of them were not previously represented in Arabidopsis expressed sequence tag databases. More than two-thirds of the genes in the stress cDNA collection have not been identified in previous studies as stress/defense response genes. Several stress-responsive cis-elements showed a statistically significant over-representation in the promoters of the genes in the stress cDNA collection. These include W- and G-boxes, the SA-inducible element, the abscisic acid response element and the TGA motif.The stress cDNA collection comprises a broad repertoire of stress-responsive genes encoding proteins that are involved in both the initial and subsequent stages of the physiological response to abiotic stress and pathogens. This set of stress-, pathogen- and hormone-modulated genes is an important resource for understanding the genetic interactions underlying stress signaling and responses and may contribute to the characterization of the stress transcriptome through the construction of standardized specialized arrays.

    View details for Web of Science ID 000182694200009

    View details for PubMedID 12620105

  • StressDB: A locally installable web-based relational microarray database designed for small user communities COMPARATIVE AND FUNCTIONAL GENOMICS Mitra, M., Shah, N., Mueller, L., Pin, S., Fedoroff, N. 2002; 3 (2): 91-96


    We have built a microarray database, StressDB, for management of microarray data from our studies on stress-modulated genes in Arabidopsis. StressDB provides small user groups with a locally installable web-based relational microarray database. It has a simple and intuitive architecture and has been designed for cDNA microarray technology users. StressDB uses Windows 2000 as the centralized database server with Oracle 8i as the relational database management system. It allows users to manage microarray data and data-related biological information over the Internet using a web browser. The source-code is currently available on request from the authors and will soon be made freely available for downloading from our website at

    View details for DOI 10.1002/cfg.153

    View details for Web of Science ID 000175388900001

    View details for PubMedID 18628845

Stanford Medicine Resources: