Russ Biagio Altman is a professor of bioengineering, genetics, & medicine (and of computer science, by courtesy) and past chairman of the Bioengineering Department at Stanford University. His primary research interests are in the application of computing and informatics technologies to problems relevant to medicine. He is particularly interested in methods for understanding drug action at molecular, cellular, organism and population levels. His lab studies how human genetic variation impacts drug response (e.g. Other work focuses on the analysis of biological molecules to understand the action, interaction and adverse events of drugs ( Dr. Altman holds an A.B. from Harvard College, and M.D. from Stanford Medical School, and a Ph.D. in Medical Information Sciences from Stanford. He received the U.S. Presidential Early Career Award for Scientists and Engineers and a National Science Foundation CAREER Award. He is a fellow of the American College of Physicians (ACP), the American College of Medical Informatics (ACMI), the American Institute of Medical and Biological Engineering (AIMBE), and the American Association for the Advancement of Science (AAAS). He is a member of the Institute of Medicine of the National Academies. He is a past-President, founding board member, and a Fellow of the International Society for Computational Biology (ISCB), and a past-President of the American Society for Clinical Pharmacology & Therapeutics (ASCPT). He has chaired the Science Board advising the FDA Commissioner, and currently serves on the NIH Director’s Advisory Committee. He is an organizer of the annual Pacific Symposium on Biocomputing (, and a founder of Personalis, Inc. Dr. Altman is board certified in Internal Medicine and in Clinical Informatics. He received the Stanford Medical School graduate teaching award in 2000, and mentorship award in 2014.

Academic Appointments

Administrative Appointments

  • Member, Biomedical Library and Informatics Research Commitee Study Section (NIH) (2002 - 2005)
  • President, International Society for Computational Biology (2000 - 2001)
  • President, American Society for Clinical Pharmacology and Therapeutics (2013 - 2014)
  • Director, Biomedical Informatics Training Program (2000 - Present)
  • Chairman, Department of Bioengineering (2007 - 2012)
  • Chair, FDA Science Board (2013 - 2014)
  • Member, Advisory Committee to the Director (ACD), NIH (2013 - 2016)

Honors & Awards

  • Fellow, American College of Medical Informatics (1998)
  • Award for Excellence in Graduate Teaching, Stanford Medical School (2000)
  • Post-Doctoral Fellowship, Howard Hughes Medical Institute (1991)
  • U.S. Presidential Early Career Award for Scientists & Engineers, NIH (1997)
  • Fellow, American College of Physicians (1998)
  • Fellow, American Institute for Medical and Biological Engineering (2007)
  • Member, Institute of Medicine of the National Academies (2009)
  • Fellow, International Society for Computational Biology (2010)
  • Fellow, American Association for the Advancement of Science (2014)
  • Stanford Medical School Mentorship Award, Stanford Medical School (2014)

Boards, Advisory Committees, Professional Organizations

  • Co-Editor-in-Chief, Annual Reviews of Biomedical Data Science (2016 - Present)
  • Advisor, Vanderbilt University Medical School (2014 - Present)
  • Advisor, NIH Advisory Committee to the Director (ACD) (2013 - Present)
  • Member, FDA Commissioner Science Board (2011 - 2014)
  • Co-Organizer, Pacific Symposium on Biocomputing ( (1995 - Present)

Professional Education

  • AB, Harvard College, Biochemistry & Molecular Biology (1983)
  • PhD, Stanford University, Medical Information Sciences (1989)
  • MD, Stanford University, Medicine (1990)

Community and International Work

  • Principal Investigator, Capetown, South Africa


    Informatics capacity building in Africa

    Partnering Organization(s)

    NIH Fogarty Center

    Populations Served




    Ongoing Project


    Opportunities for Student Involvement


  • Attending Physician, Medicine, Menlo Park, CA


    Supervision of medical residents.

    Partnering Organization(s)

    Willow Clinic, San Mateo County

    Populations Served

    Local population


    Bay Area

    Ongoing Project


    Opportunities for Student Involvement


Current Research and Scholarly Interests

I am interested in the application of computational technologies to problems in molecular biology of relevance to medicine. In particular, my laboratory focuses on drug response at the molecular level, working in three areas. First, we are building a comprehensive pharmacogenomics knowledge base ( that provides access to information relating genotype to phenotype (in particular, how variation in genetics leads to variation in response to drugs). We are interested in collaboratively discovering and applying new pharmacogenomics knowledge. Second, we are interested in the analysis of three dimensional biological structures. We have methods for analyzing protein structures to recognize and annotate active sites and binding sites, particularly in the context of interactions with small molecule drugs. We are also interested in physics-based simulation of biological structures to understand how their dynamics impact their function ( Finally, we are interested in computational methods for analyzing functional genomics information. We use natural language processing techniques for extracting and summarizing information in the literature, chemoinformatics methods for understanding small molecule function, and machine learning & data mining techniques to understand the molecular responses to drugs.

2015-16 Courses

Stanford Advisees

All Publications

  • Learning the Structure of Biomedical Relationships from Unstructured Text. PLoS computational biology Percha, B., Altman, R. B. 2015; 11 (7)


    The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes). Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC), that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1) pharmacogenomic relationships from PharmGKB and (2) drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining.

    View details for DOI 10.1371/journal.pcbi.1004216

    View details for PubMedID 26219079

  • Relating Essential Proteins to Drug Side-Effects Using Canonical Component Analysis: A Structure-Based Approach JOURNAL OF CHEMICAL INFORMATION AND MODELING Liu, T., Altman, R. B. 2015; 55 (7): 1483-1494


    The molecular mechanism of many drug side-effects is unknown and difficult to predict. Previous methods for explaining side-effects have focused on known drug targets and their pathways. However, low affinity binding to proteins that are not usually considered drug targets may also drive side-effects. In order to assess these alternative targets, we used the 3D structures of 563 essential human proteins systematically to predict binding to 216 drugs. We first benchmarked our affinity predictions with available experimental data. We then combined singular value decomposition and canonical component analysis (SVD-CCA) to predict side-effects based on these novel target profiles. Our method predicts side-effects with good accuracy (average AUC: 0.82 for side effects present in <50% of drug labels). We also noted that side-effect frequency is the most important feature for prediction and can confound efforts at elucidating mechanism; our method allows us to remove the contribution of frequency and isolate novel biological signals. In particular, our analysis produces 2768 triplet associations between 50 essential proteins, 99 drugs, and 77 side-effects. Although experimental validation is difficult because many of our essential proteins do not have validated assays, we nevertheless attempted to validate a subset of these associations using experimental assay data. Our focus on essential proteins allows us to find potential associations that would likely be missed if we used recognized drug targets. Our associations provide novel insights about the molecular mechanisms of drug side-effects and highlight the need for expanded experimental efforts to investigate drug binding to proteins more broadly.

    View details for DOI 10.1021/acs.jcim.5b00030

    View details for Web of Science ID 000358821300020

  • Translational Bioinformatics: Linking the Molecular World to the Clinical World CLINICAL PHARMACOLOGY & THERAPEUTICS Altman, R. B. 2012; 91 (6): 994-1000


    Translational bioinformatics represents the union of translational medicine and bioinformatics. Translational medicine moves basic biological discoveries from the research bench into the patient-care setting and uses clinical observations to inform basic biology. It focuses on patient care, including the creation of new diagnostics, prognostics, prevention strategies, and therapies based on biological discoveries. Bioinformatics involves algorithms to represent, store, and analyze basic biological data, including DNA sequence, RNA expression, and protein and small-molecule abundance within cells. Translational bioinformatics spans these two fields; it involves the development of algorithms to analyze basic molecular and cellular data with an explicit goal of affecting clinical care.

    View details for DOI 10.1038/clpt.2012.49

    View details for Web of Science ID 000304245800017

    View details for PubMedID 22549287

  • Data-Driven Prediction of Drug Effects and Interactions SCIENCE TRANSLATIONAL MEDICINE Tatonetti, N. P., Ye, P. P., Daneshjou, R., Altman, R. B. 2012; 4 (125)


    Adverse drug events remain a leading cause of morbidity and mortality around the world. Many adverse events are not detected during clinical trials before a drug receives approval for use in the clinic. Fortunately, as part of postmarketing surveillance, regulatory agencies and other institutions maintain large collections of adverse event reports, and these databases present an opportunity to study drug effects from patient population data. However, confounding factors such as concomitant medications, patient demographics, patient medical histories, and reasons for prescribing a drug often are uncharacterized in spontaneous reporting systems, and these omissions can limit the use of quantitative signal detection methods used in the analysis of such data. Here, we present an adaptive data-driven approach for correcting these factors in cases for which the covariates are unknown or unmeasured and combine this approach with existing methods to improve analyses of drug effects using three test data sets. We also present a comprehensive database of drug effects (Offsides) and a database of drug-drug interaction side effects (Twosides). To demonstrate the biological use of these new resources, we used them to identify drug targets, predict drug indications, and discover drug class interactions. We then corroborated 47 (P < 0.0001) of the drug class interactions using an independent analysis of electronic medical records. Our analysis suggests that combined treatment with selective serotonin reuptake inhibitors and thiazides is associated with significantly increased incidence of prolonged QT intervals. We conclude that confounding effects from covariates in observational clinical data can be controlled in data analyses and thus improve the detection and prediction of adverse drug effects and interactions.

    View details for DOI 10.1126/scitranslmed.3003377

    View details for Web of Science ID 000301538300005

    View details for PubMedID 22422992

  • Clinical assessment incorporating a personal genome LANCET Ashley, E. A., Butte, A. J., Wheeler, M. T., Chen, R., Klein, T. E., Dewey, F. E., Dudley, J. T., Ormond, K. E., Pavlovic, A., Morgan, A. A., Pushkarev, D., Neff, N. F., Hudgins, L., Gong, L., Hodges, L. M., Berlin, D. S., Thorn, C. F., Sangkuhl, K., Hebert, J. M., Woon, M., Sagreiya, H., Whaley, R., Knowles, J. W., Chou, M. F., Thakuria, J. V., Rosenbaum, A. M., Zaranek, A. W., Church, G. M., Greely, H. T., Quake, S. R., Altman, R. B. 2010; 375 (9725): 1525-1535


    The cost of genomic information has fallen steeply, but the clinical translation of genetic risk estimates remains unclear. We aimed to undertake an integrated analysis of a complete human genome in a clinical context.We assessed a patient with a family history of vascular disease and early sudden death. Clinical assessment included analysis of this patient's full genome sequence, risk prediction for coronary artery disease, screening for causes of sudden cardiac death, and genetic counselling. Genetic analysis included the development of novel methods for the integration of whole genome and clinical risk. Disease and risk analysis focused on prediction of genetic risk of variants associated with mendelian disease, recognised drug responses, and pathogenicity for novel variants. We queried disease-specific mutation databases and pharmacogenomics databases to identify genes and mutations with known associations with disease and drug response. We estimated post-test probabilities of disease by applying likelihood ratios derived from integration of multiple common variants to age-appropriate and sex-appropriate pre-test probabilities. We also accounted for gene-environment interactions and conditionally dependent risks.Analysis of 2.6 million single nucleotide polymorphisms and 752 copy number variations showed increased genetic risk for myocardial infarction, type 2 diabetes, and some cancers. We discovered rare variants in three genes that are clinically associated with sudden cardiac death-TMEM43, DSP, and MYBPC3. A variant in LPA was consistent with a family history of coronary artery disease. The patient had a heterozygous null mutation in CYP2C19 suggesting probable clopidogrel resistance, several variants associated with a positive response to lipid-lowering therapy, and variants in CYP4F2 and VKORC1 that suggest he might have a low initial dosing requirement for warfarin. Many variants of uncertain importance were reported.Although challenges remain, our results suggest that whole-genome sequencing can yield useful and clinically relevant information for individual patients.National Institute of General Medical Sciences; National Heart, Lung And Blood Institute; National Human Genome Research Institute; Howard Hughes Medical Institute; National Library of Medicine, Lucile Packard Foundation for Children's Health; Hewlett Packard Foundation; Breetwor Family Foundation.

    View details for Web of Science ID 000277655100025

    View details for PubMedID 20435227

  • Estimation of the Warfarin Dose with Clinical and Pharmacogenetic Data NEW ENGLAND JOURNAL OF MEDICINE Klein, T. E., Altman, R. B., Eriksson, N., Gage, B. F., Kimmel, S. E., Lee, M. M., Limdi, N. A., Page, D., Roden, D. M., Wagner, M. J., Caldwell, M. D., Johnson, J. A., Chen, Y. T., Wen, M. S., Caraco, Y., Achache, I., Blotnick, S., Muszkat, M., Shin, J. G., Kim, H. S., Suarez-Kurtz, G., Perini, J. A., Silva-Assuncao, E., Anderson, J. L., Horne, B. D., Carlquist, J. F., Caldwell, M. D., Berg, R. L., Burmester, J. K., Goh, B. C., Lee, S. C., Kamali, F., Sconce, E., Daly, A. K., Wu, A. H., Langaee, T. Y., Feng, H., Cavallari, L., Momary, K., Pirmohamed, M., Jorgensen, A., Toh, C. H., Williamson, P., McLeod, H., Evans, J. P., Weck, K. E., Brensinger, C., Nakamura, Y., Mushiroda, T., Veenstra, D., Meckley, L., Rieder, M. J., Rettie, A. E., Wadelius, M., Melhus, H., Stein, C. M., Schwartz, U., Kurnik, D., Deych, E., Lenzini, P., Eby, C., Chen, L. Y., Deloukas, P., Motsinger-Reif, A., Sagreiya, H., Srinivasan, B. S., Lantz, E., Chang, T., Ritchie, M., Lu, L. S., Shin, J. G. 2009; 360 (8): 753-764


    Genetic variability among patients plays an important role in determining the dose of warfarin that should be used when oral anticoagulation is initiated, but practical methods of using genetic information have not been evaluated in a diverse and large population. We developed and used an algorithm for estimating the appropriate warfarin dose that is based on both clinical and genetic data from a broad population base.Clinical and genetic data from 4043 patients were used to create a dose algorithm that was based on clinical variables only and an algorithm in which genetic information was added to the clinical variables. In a validation cohort of 1009 subjects, we evaluated the potential clinical value of each algorithm by calculating the percentage of patients whose predicted dose of warfarin was within 20% of the actual stable therapeutic dose; we also evaluated other clinically relevant indicators.In the validation cohort, the pharmacogenetic algorithm accurately identified larger proportions of patients who required 21 mg of warfarin or less per week and of those who required 49 mg or more per week to achieve the target international normalized ratio than did the clinical algorithm (49.4% vs. 33.3%, P<0.001, among patients requiring < or = 21 mg per week; and 24.8% vs. 7.2%, P<0.001, among those requiring > or = 49 mg per week).The use of a pharmacogenetic algorithm for estimating the appropriate initial dose of warfarin produces recommendations that are significantly closer to the required stable therapeutic dose than those derived from a clinical algorithm or a fixed-dose approach. The greatest benefits were observed in the 46.2% of the population that required 21 mg or less of warfarin per week or 49 mg or more per week for therapeutic anticoagulation.

    View details for Web of Science ID 000263411300005

    View details for PubMedID 19228618

  • Large-scale extraction of gene interactions from full-text literature using DeepDive BIOINFORMATICS Mallory, E. K., Zhang, C., Re, C., Altman, R. B. 2016; 32 (1): 106-113
  • Human Germline CRISPR-Cas Modification: Toward a Regulatory Framework. American journal of bioethics Evitt, N. H., Mascharak, S., Altman, R. B. 2015; 15 (12): 25-29


    CRISPR germline editing therapies (CGETs) hold unprecedented potential to eradicate hereditary disorders. However, the prospect of altering the human germline has sparked a debate over the safety, efficacy, and morality of CGETs, triggering a funding moratorium by the NIH. There is an urgent need for practical paths for the evaluation of these capabilities. We propose a model regulatory framework for CGET research, clinical development, and distribution. Our model takes advantage of existing legal and regulatory institutions but adds elevated scrutiny at each stage of CGET development to accommodate the unique technical and ethical challenges posed by germline editing.

    View details for DOI 10.1080/15265161.2015.1104160

    View details for PubMedID 26632357

  • Unmet needs: Research helps regulators do their jobs SCIENCE TRANSLATIONAL MEDICINE Altman, R. B., Khuri, N., Salit, M., Giacomini, K. M. 2015; 7 (315)
  • Sequence to Medical Phenotypes: A Framework for Interpretation of Human Whole Genome DNA Sequence Data. PLoS genetics Dewey, F. E., Grove, M. E., Priest, J. R., Waggott, D., Batra, P., Miller, C. L., Wheeler, M., Zia, A., Pan, C., Karzcewski, K. J., Miyake, C., Whirl-Carrillo, M., Klein, T. E., Datta, S., Altman, R. B., Snyder, M., Quertermous, T., Ashley, E. A. 2015; 11 (10)


    High throughput sequencing has facilitated a precipitous drop in the cost of genomic sequencing, prompting predictions of a revolution in medicine via genetic personalization of diagnostic and therapeutic strategies. There are significant barriers to realizing this goal that are related to the difficult task of interpreting personal genetic variation. A comprehensive, widely accessible application for interpretation of whole genome sequence data is needed. Here, we present a series of methods for identification of genetic variants and genotypes with clinical associations, phasing genetic data and using Mendelian inheritance for quality control, and providing predictive genetic information about risk for rare disease phenotypes and response to pharmacological therapy in single individuals and father-mother-child trios. We demonstrate application of these methods for disease and drug response prognostication in whole genome sequence data from twelve unrelated adults, and for disease gene discovery in one father-mother-child trio with apparently simplex congenital ventricular arrhythmia. In doing so we identify clinically actionable inherited disease risk and drug response genotypes in pre-symptomatic individuals. We also nominate a new candidate gene in congenital arrhythmia, ATP2B4, and provide experimental evidence of a regulatory role for variants discovered using this framework.

    View details for DOI 10.1371/journal.pgen.1005496

    View details for PubMedID 26448358

  • PharmGKB summary: peginterferon-alpha pathway PHARMACOGENETICS AND GENOMICS Shuldiner, S. R., Gong, L., Muir, A. J., Altman, R. B., Klein, T. E. 2015; 25 (9): 465-474

    View details for DOI 10.1097/FPC.0000000000000158

    View details for Web of Science ID 000359645700006

    View details for PubMedID 26111151

  • High Resolution Prediction of Calcium-Binding Sites in 3D Protein Structures Using FEATURE. Journal of chemical information and modeling Zhou, W., Tang, G. W., Altman, R. B. 2015; 55 (8): 1663-1672


    Metal-binding proteins are ubiquitous in biological systems ranging from enzymes to cell surface receptors. Among the various biologically active metal ions, calcium plays a large role in regulating cellular and physiological changes. With the increasing number of high-quality crystal structures of proteins associated with their metal ion ligands, many groups have built models to identify Ca(2+) sites in proteins, utilizing information such as structure, geometry, or homology to do the inference. We present a FEATURE-based approach in building such a model and show that our model is able to discriminate between nonsites and calcium-binding sites with a very high precision of more than 98%. We demonstrate the high specificity of our model by applying it to test sets constructed from other ions. We also introduce an algorithm to convert high scoring regions into specific site predictions and demonstrate the usage by scanning a test set of 91 calcium-binding protein structures (190 calcium sites). The algorithm has a recall of more than 93% on the test set with predictions found within 3 Å of the actual sites.

    View details for DOI 10.1021/acs.jcim.5b00367

    View details for PubMedID 26226489

  • PharmGKB summary: pathways of acetaminophen metabolism at the therapeutic versus toxic doses PHARMACOGENETICS AND GENOMICS Mazaleuskaya, L. L., Sangkuhl, K., Thorn, C. F., FitzGerald, G. A., Altman, R. B., Klein, T. E. 2015; 25 (8): 416-426
  • An ontology for Autism Spectrum Disorder (ASD) to infer ASD phenotypes from Autism Diagnostic Interview-Revised data. Journal of biomedical informatics Mugzach, O., Peleg, M., Bagley, S. C., Guter, S. J., Cook, E. H., Altman, R. B. 2015; 56: 333-347


    Our goal is to create an ontology that will allow data integration and reasoning with subject data to classify subjects, and based on this classification, to infer new knowledge on Autism Spectrum Disorder (ASD) and related neurodevelopmental disorders (NDD). We take a first step toward this goal by extending an existing autism ontology to allow automatic inference of ASD phenotypes and Diagnostic & Statistical Manual of Mental Disorders (DSM) criteria based on subjects' Autism Diagnostic Interview-Revised (ADI-R) assessment data.Knowledge regarding diagnostic instruments, ASD phenotypes and risk factors was added to augment an existing autism ontology via Ontology Web Language class definitions and semantic web rules. We developed a custom Protégé plugin for enumerating combinatorial OWL axioms to support the many-to-many relations of ADI-R items to diagnostic categories in the DSM. We utilized a reasoner to infer whether 2642 subjects, whose data was obtained from the Simons Foundation Autism Research Initiative, meet DSM-IV-TR (DSM-IV) and DSM-5 diagnostic criteria based on their ADI-R data.We extended the ontology by adding 443 classes and 632 rules that represent phenotypes, along with their synonyms, environmental risk factors, and frequency of comorbidities. Applying the rules on the data set showed that the method produced accurate results: the true positive and true negative rates for inferring autistic disorder diagnosis according to DSM-IV criteria were 1 and 0.065, respectively; the true positive rate for inferring ASD based on DSM-5 criteria was 0.94.The ontology allows automatic inference of subjects' disease phenotypes and diagnosis with high accuracy.The ontology may benefit future studies by serving as a knowledge base for ASD. In addition, by adding knowledge of related NDDs, commonalities and differences in manifestations and risk factors could be automatically inferred, contributing to the understanding of ASD pathophysiology.

    View details for DOI 10.1016/j.jbi.2015.06.026

    View details for PubMedID 26151311

  • Assessment of the Radiation Effects of Cardiac CT Angiography Using Protein and Genetic Biomarkers JACC-CARDIOVASCULAR IMAGING Nguyen, P. K., Lee, W. H., Li, Y. F., Hong, W. X., Hu, S., Chan, C., Liang, G., Nguyen, I., Ong, S., Churko, J., Wang, J., Altman, R. B., Fleischmann, D., Wu, J. C. 2015; 8 (8): 873-884
  • High Resolution Prediction of Calcium-Binding Sites in 3D Protein Structures Using FEATURE JOURNAL OF CHEMICAL INFORMATION AND MODELING Zhou, W., Tang, G. W., Altman, R. B. 2015; 55 (8): 1663-1672
  • Achieving high-sensitivity for clinical applications using augmented exome sequencing GENOME MEDICINE Patwardhan, A., Harris, J., Leng, N., Bartha, G., Church, D. M., Luo, S., Haudenschild, C., Pratt, M., Zook, J., Salit, M., Tirch, J., Morra, M., Chervitz, S., Li, M., Clark, M., Garcia, S., Chandratillake, G., Kirk, S., Ashley, E., Snyder, M., Altman, R., Bustamante, C., Butte, A. J., West, J., Chen, R. 2015; 7
  • PharmGKB summary: Efavirenz pathway, pharmacokinetics. Pharmacogenetics and genomics McDonagh, E. M., Lau, J. L., Alvarellos, M. L., Altman, R. B., Klein, T. E. 2015; 25 (7): 363-376

    View details for DOI 10.1097/FPC.0000000000000145

    View details for PubMedID 25966836

  • Learning the Structure of Biomedical Relationships from Unstructured Text PLOS COMPUTATIONAL BIOLOGY Percha, B., Altman, R. B. 2015; 11 (7)
  • Evidence for Clinical Implementation of Pharmacogenomics in Cardiac Drugs. Mayo Clinic proceedings Kaufman, A. L., Spitz, J., Jacobs, M., Sorrentino, M., Yuen, S., Danahey, K., Saner, D., Klein, T. E., Altman, R. B., Ratain, M. J., O'Donnell, P. H. 2015; 90 (6): 716-729


    To comprehensively assess the pharmacogenomic evidence of routinely used drugs for clinical utility.Between January 2, 2011, and May 31, 2013, we assessed 71 drugs by identifying all drug/genetic variant combinations with published clinical pharmacogenomic evidence. Literature supporting each drug/variant pair was assessed for study design and methods, outcomes, statistical significance, and clinical relevance. Proposed clinical summaries were formally scored using a modified AGREE (Appraisal of Guidelines for Research and Evaluation) II instrument, including recommendation for or against guideline implementation.Positive pharmacogenomic findings were identified for 51 of 71 cardiovascular drugs (71.8%), representing 884 unique drug/variant pairs from 597 publications. After analysis for quality and clinical relevance, 92 drug/variant pairs were proposed for translation into clinical summaries, encompassing 23 drugs (32.4% of drugs reviewed). All were recommended for clinical implementation using AGREE II, with mean ± SD overall quality scores of 5.18±0.91 (of 7.0; range, 3.67-7.0). Drug guidelines had highest mean ± SD scores in AGREE II domain 1 (Scope) (91.9±6.1 of 100) and moderate but still robust mean ± SD scores in domain 3 (Rigor) (73.1±11.1), domain 4 (Clarity) (67.8±12.5), and domain 5 (Applicability) (65.8±10.0). Clopidogrel (CYP2C19), metoprolol (CYP2D6), simvastatin (rs4149056), dabigatran (rs2244613), hydralazine (rs1799983, rs1799998), and warfarin (CYP2C9/VKORC1) were distinguished by the highest scores. Seven of the 9 most commonly prescribed drugs warranted translation guidelines summarizing clinical pharmacogenomic information.Considerable clinically actionable pharmacogenomic information for cardiovascular drugs exists, supporting the idea that consideration of such information when prescribing is warranted.

    View details for DOI 10.1016/j.mayocp.2015.03.016

    View details for PubMedID 26046407

  • PharmGKB summary: very important pharmacogene information for human leukocyte antigen B PHARMACOGENETICS AND GENOMICS Barbarino, J. M., Kroetz, D. L., Klein, T. E., Altman, R. B. 2015; 25 (4): 205-221
  • PharmGKB summary: very important pharmacogene information for CFTR PHARMACOGENETICS AND GENOMICS McDonagh, E. M., Clancy, J. P., Altman, R. B., Klein, T. E. 2015; 25 (3): 149-156
  • Variations in the Binding Pocket of an Inhibitor of the Bacterial Division Protein FtsZ across Genotypes and Species PLOS COMPUTATIONAL BIOLOGY Miguel, A., Hsin, J., Liu, T., Tang, G., Altman, R. B., Huang, K. C. 2015; 11 (3)
  • Ranking Adverse Drug Reactions With Crowdsourcing JOURNAL OF MEDICAL INTERNET RESEARCH Gottlieb, A., Hoehndorf, R., Dumontier, M., Altman, R. B. 2015; 17 (3)

    View details for DOI 10.2196/jmir.3962

    View details for Web of Science ID 000356780900020

  • Variations in the binding pocket of an inhibitor of the bacterial division protein FtsZ across genotypes and species. PLoS computational biology Miguel, A., Hsin, J., Liu, T., Tang, G., Altman, R. B., Huang, K. C. 2015; 11 (3)


    The recent increase in antibiotic resistance in pathogenic bacteria calls for new approaches to drug-target selection and drug development. Targeting the mechanisms of action of proteins involved in bacterial cell division bypasses problems associated with increasingly ineffective variants of older antibiotics; to this end, the essential bacterial cytoskeletal protein FtsZ is a promising target. Recent work on its allosteric inhibitor, PC190723, revealed in vitro activity on Staphylococcus aureus FtsZ and in vivo antimicrobial activities. However, the mechanism of drug action and its effect on FtsZ in other bacterial species are unclear. Here, we examine the structural environment of the PC190723 binding pocket using PocketFEATURE, a statistical method that scores the similarity between pairs of small-molecule binding sites based on 3D structure information about the local microenvironment, and molecular dynamics (MD) simulations. We observed that species and nucleotide-binding state have significant impacts on the structural properties of the binding site, with substantially disparate microenvironments for bacterial species not from the Staphylococcus genus. Based on PocketFEATURE analysis of MD simulations of S. aureus FtsZ bound to GTP or with mutations that are known to confer PC190723 resistance, we predict that PC190723 strongly prefers to bind Staphylococcus FtsZ in the nucleotide-bound state. Furthermore, MD simulations of an FtsZ dimer indicated that polymerization may enhance PC190723 binding. Taken together, our results demonstrate that a drug-binding pocket can vary significantly across species, genetic perturbations, and in different polymerization states, yielding important information for the further development of FtsZ inhibitors.

    View details for DOI 10.1371/journal.pcbi.1004117

    View details for PubMedID 25811761

  • PharmGKB summary: ibuprofen pathways PHARMACOGENETICS AND GENOMICS Mazaleuskaya, L. L., Theken, K. N., Gong, L., Thorn, C. F., FitzGerald, G. A., Altman, R. B., Klein, T. E. 2015; 25 (2): 96-106
  • Enabling the curation of your pharmacogenetic study. Clinical pharmacology & therapeutics MCDONAGH, E. M., Whirl-Carrillo, M., Altman, R. B., Klein, T. E. 2015; 97 (2): 116-119


    As pharmacogenomics becomes integrated into clinical practice, curation of published studies becomes increasingly important. At the Pharmacogenomics Knowledgebase (PharmGKB;, pharmacogenetic associations reported in published articles are manually curated and evaluated. Standard terminologies are used, making findings uniform and unambiguous. Lack of information, clarity, or standards in the original report can make it difficult or impossible to curate. We provide 10 rules to help authors ensure that their results are accurately captured and integrated.

    View details for DOI 10.1002/cpt.15

    View details for PubMedID 25670512

  • Using "big data" to dissect clinical heterogeneity. Circulation Altman, R. B., Ashley, E. A. 2015; 131 (3): 232-233

    View details for DOI 10.1161/CIRCULATIONAHA.114.014106

    View details for PubMedID 25601948

  • PharmGKB summary: very important pharmacogene information for CYP4F2 PHARMACOGENETICS AND GENOMICS Alvarellos, M. L., Sangkuhl, K., Daneshjou, R., Whirl-Carrillo, M., Altman, R. B., Klein, T. E. 2015; 25 (1): 41-47
  • Achieving high-sensitivity for clinical applications using augmented exome sequencing. Genome medicine Patwardhan, A., Harris, J., Leng, N., Bartha, G., Church, D. M., Luo, S., Haudenschild, C., Pratt, M., Zook, J., Salit, M., Tirch, J., Morra, M., Chervitz, S., Li, M., Clark, M., Garcia, S., Chandratillake, G., Kirk, S., Ashley, E., Snyder, M., Altman, R., Bustamante, C., Butte, A. J., West, J., Chen, R. 2015; 7 (1): 71-?


    Whole exome sequencing is increasingly used for the clinical evaluation of genetic disease, yet the variation of coverage and sensitivity over medically relevant parts of the genome remains poorly understood. Several sequencing-based assays continue to provide coverage that is inadequate for clinical assessment.Using sequence data obtained from the NA12878 reference sample and pre-defined lists of medically-relevant protein-coding and noncoding sequences, we compared the breadth and depth of coverage obtained among four commercial exome capture platforms and whole genome sequencing. In addition, we evaluated the performance of an augmented exome strategy, ACE, that extends coverage in medically relevant regions and enhances coverage in areas that are challenging to sequence. Leveraging reference call-sets, we also examined the effects of improved coverage on variant detection sensitivity.We observed coverage shortfalls with each of the conventional exome-capture and whole-genome platforms across several medically interpretable genes. These gaps included areas of the genome required for reporting recently established secondary findings (ACMG) and known disease-associated loci. The augmented exome strategy recovered many of these gaps, resulting in improved coverage in these areas. At clinically-relevant coverage levels (100 % bases covered at ≥20×), ACE improved coverage among genes in the medically interpretable genome (>90 % covered relative to 10-78 % with other platforms), the set of ACMG secondary finding genes (91 % covered relative to 4-75 % with other platforms) and a subset of variants known to be associated with human disease (99 % covered relative to 52-95 % with other platforms). Improved coverage translated into improvements in sensitivity, with ACE variant detection sensitivities (>97.5 % SNVs, >92.5 % InDels) exceeding that observed with conventional whole-exome and whole-genome platforms.Clinicians should consider analytical performance when making clinical assessments, given that even a few missed variants can lead to reporting false negative results. An augmented exome strategy provides a level of coverage not achievable with other platforms, thus addressing concerns regarding the lack of sensitivity in clinically important regions. In clinical applications where comprehensive coverage of medically interpretable areas of the genome requires higher localized sequencing depth, an augmented exome approach offers both cost and performance advantages over other sequencing-based tests.

    View details for DOI 10.1186/s13073-015-0197-4

    View details for PubMedID 26269718

  • Genomics in the clinic: ethical and policy challenges in clinical next-generation sequencing programs at early adopter USA institutions PERSONALIZED MEDICINE Milner, L. C., Garrison, N. A., Cho, M. K., Altman, R. B., Hudgins, L., Galli, S. J., Lowe, H. J., Schrijver, I., Magnus, D. C. 2015; 12 (3): 269-282

    View details for DOI 10.2217/PME.14.88

    View details for Web of Science ID 000355751600011

  • Ranking adverse drug reactions with crowdsourcing. Journal of medical Internet research Gottlieb, A., Hoehndorf, R., Dumontier, M., Altman, R. B. 2015; 17 (3)


    There is no publicly available resource that provides the relative severity of adverse drug reactions (ADRs). Such a resource would be useful for several applications, including assessment of the risks and benefits of drugs and improvement of patient-centered care. It could also be used to triage predictions of drug adverse events.The intent of the study was to rank ADRs according to severity.We used Internet-based crowdsourcing to rank ADRs according to severity. We assigned 126,512 pairwise comparisons of ADRs to 2589 Amazon Mechanical Turk workers and used these comparisons to rank order 2929 ADRs.There is good correlation (rho=.53) between the mortality rates associated with ADRs and their rank. Our ranking highlights severe drug-ADR predictions, such as cardiovascular ADRs for raloxifene and celecoxib. It also triages genes associated with severe ADRs such as epidermal growth-factor receptor (EGFR), associated with glioblastoma multiforme, and SCN1A, associated with epilepsy.ADR ranking lays a first stepping stone in personalized drug risk assessment. Ranking of ADRs using crowdsourcing may have useful clinical and financial implications, and should be further investigated in the context of health care decision making.

    View details for DOI 10.2196/jmir.3962

    View details for PubMedID 25800813

  • A twentieth anniversary tribute to psb. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Hewett, D., Whirl-Carrillo, M., Hunter, L. E., Altman, R. B., Klein, T. E. 2015; 20: 1-7


    PSB brings together top researchers from around the world to exchange research results and address open issues in all aspects of computational biology. PSB 2015 marks the twentieth anniversary of PSB. Reaching a milestone year is an accomplishment well worth celebrating. It is long enough to have seen big changes occur, but recent enough to be relevant for today. As PSB celebrates twenty years of service, we would like to take this opportunity to congratulate the PSB community for your success. We would also like the community to join us in a time of celebration and reflection on this accomplishment.

    View details for PubMedID 25592562

  • PharmGKB summary: gemcitabine pathway PHARMACOGENETICS AND GENOMICS Alvarellos, M. L., Lamba, J., Sangkuhl, K., Thorn, C. F., Wang, L., Klein, D. J., Altman, R. B., Klein, T. E. 2014; 24 (11): 564-574
  • Genetic variant in folate homeostasis is associated with lower warfarin dose in African Americans BLOOD Daneshjou, R., Gamazon, E. R., Burkley, B., Cavallari, L. H., Johnson, J. A., Klein, T. E., Limdi, N., Hillenmeyer, S., Percha, B., Karczewski, K. J., Langaee, T., Patel, S. R., Bustamante, C. D., Altman, R. B., Perera, M. A. 2014; 124 (14): 2298-2305
  • PharmGKB summary: uric acid-lowering drugs pathway, pharmacodynamics PHARMACOGENETICS AND GENOMICS McDonagh, E. M., Thorn, C. F., Callaghan, J. T., Altman, R. B., Klein, T. E. 2014; 24 (9): 464-476
  • PharmGKB summary: very important pharmacogene information for N-acetyltransferase 2 PHARMACOGENETICS AND GENOMICS McDonagh, E. M., Boukouvala, S., Aklillu, E., Hein, D. W., Altman, R. B., Klein, T. E. 2014; 24 (8): 409-425
  • Interpreting the CYP2D6 results from the International Tamoxifen Pharmacogenetics Consortium. Clinical pharmacology & therapeutics Province, M. A., Altman, R. B., Klein, T. E. 2014; 96 (2): 144-146

    View details for DOI 10.1038/clpt.2014.100

    View details for PubMedID 25056393

  • PharmGKB summary: tramadol pathway PHARMACOGENETICS AND GENOMICS Gong, L., Stamer, U. M., Tzvetkov, M. V., Altman, R. B., Klein, T. E. 2014; 24 (7): 374-380
  • Integrating Systems Biology Sources Illuminates Drug Action CLINICAL PHARMACOLOGY & THERAPEUTICS Gottlieb, A., Altman, R. B. 2014; 95 (6): 663-669


    There are significant gaps in our understanding of the pathways by which drugs act. This incomplete knowledge limits our ability to use mechanistic molecular information rationally to repurpose drugs, understand their side effects, and predict their interactions with other drugs. Here, we present DrugRouter, a novel method for generating drug-specific pathways of action by linking target genes, disease genes, and pharmacogenes using gene interaction networks. We construct pathways for more than a hundred drugs and show that the genes included in our pathways (i) co-occur with the query drug in the literature, (ii) significantly overlap or are adjacent to known drug-response pathways, and (iii) are adjacent to genes that are hits in genome-wide association studies assessing drug response. Finally, these computed pathways suggest novel drug-repositioning opportunities (e.g., statins for follicular thyroid cancer), gene-side effect associations, and gene-drug interactions. Thus, DrugRouter generates hypotheses about drug actions using systems biology data.

    View details for DOI 10.1038/clpt.2014.51

    View details for Web of Science ID 000336415300030

    View details for PubMedID 24577151

  • PharmGKB summary: very important pharmacogene information for SLC22A1 PHARMACOGENETICS AND GENOMICS Goswami, S., Gong, L., Giacomini, K., Altman, R. B., Klein, T. E. 2014; 24 (6): 324-328
  • Reconstruction of the Mouse Otocyst and Early Neuroblast Lineage at Single-Cell Resolution CELL Durruthy-Durruthy, R., Gottlieb, A., Hartman, B. H., Waldhaus, J., Laske, R. D., Altman, R., Heller, S. 2014; 157 (4): 964-978


    The otocyst harbors progenitors for most cell types of the mature inner ear. Developmental lineage analyses and gene expression studies suggest that distinct progenitor populations are compartmentalized to discrete axial domains in the early otocyst. Here, we conducted highly parallel quantitative RT-PCR measurements on 382 individual cells from the developing otocyst and neuroblast lineages to assay 96 genes representing established otic markers, signaling-pathway-associated transcripts, and novel otic-specific genes. By applying multivariate cluster, principal component, and network analyses to the data matrix, we were able to readily distinguish the delaminating neuroblasts and to describe progressive states of gene expression in this population at single-cell resolution. It further established a three-dimensional model of the otocyst in which each individual cell can be precisely mapped into spatial expression domains. Our bioinformatic modeling revealed spatial dynamics of different signaling pathways active during early neuroblast development and prosensory domain specification. PAPERFLICK:

    View details for DOI 10.1016/j.cell.2014.03.036

    View details for Web of Science ID 000335765500022

  • PharmGKB summary: abacavir pathway PHARMACOGENETICS AND GENOMICS Barbarino, J. M., Kroetz, D. L., Altman, R. B., Klein, T. E. 2014; 24 (5): 276-282
  • Genotype-guided dosing of vitamin K antagonists. New England journal of medicine Daneshjou, R., Klein, T. E., Altman, R. B. 2014; 370 (18): 1762-1763

    View details for DOI 10.1056/NEJMc1402521#SA4

    View details for PubMedID 24785217

  • Guidelines for investigating causality of sequence variants in human disease NATURE MacArthur, D. G., Manolio, T. A., Dimmock, D. P., Rehm, H. L., Shendure, J., Abecasis, G. R., Adams, D. R., Altman, R. B., Antonarakis, S. E., Ashley, E. A., Barrett, J. C., Biesecker, L. G., Conrad, D. F., Cooper, G. M., Cox, N. J., Daly, M. J., Gerstein, M. B., Goldstein, D. B., Hirschhorn, J. N., Leal, S. M., Pennacchio, L. A., Stamatoyannopoulos, J. A., Sunyaev, S. R., Valle, D., Voight, B. F., Winckler, W., Gunter, C. 2014; 508 (7497): 469-476


    The discovery of rare genetic variants is accelerating, and clear guidelines for distinguishing disease-causing sequence variants from the many potentially functional variants present in any human genome are urgently needed. Without rigorous standards we risk an acceleration of false-positive reports of causality, which would impede the translation of genomic research findings into the clinical diagnostic setting and hinder biological understanding of disease. Here we discuss the key challenges of assessing sequence variants in human disease, integrating both gene-level and variant-level support for causality. We propose guidelines for summarizing confidence in variant pathogenicity and highlight several areas that require further resource development.

    View details for DOI 10.1038/nature13127

    View details for Web of Science ID 000334741600026

  • Knowledge-based Fragment Binding Prediction PLOS COMPUTATIONAL BIOLOGY Tang, G. W., Altman, R. B. 2014; 10 (4)
  • Knowledge-based fragment binding prediction. PLoS computational biology Tang, G. W., Altman, R. B. 2014; 10 (4)


    Target-based drug discovery must assess many drug-like compounds for potential activity. Focusing on low-molecular-weight compounds (fragments) can dramatically reduce the chemical search space. However, approaches for determining protein-fragment interactions have limitations. Experimental assays are time-consuming, expensive, and not always applicable. At the same time, computational approaches using physics-based methods have limited accuracy. With increasing high-resolution structural data for protein-ligand complexes, there is now an opportunity for data-driven approaches to fragment binding prediction. We present FragFEATURE, a machine learning approach to predict small molecule fragments preferred by a target protein structure. We first create a knowledge base of protein structural environments annotated with the small molecule substructures they bind. These substructures have low-molecular weight and serve as a proxy for fragments. FragFEATURE then compares the structural environments within a target protein to those in the knowledge base to retrieve statistically preferred fragments. It merges information across diverse ligands with shared substructures to generate predictions. Our results demonstrate FragFEATURE's ability to rediscover fragments corresponding to the ligand bound with 74% precision and 82% recall on average. For many protein targets, it identifies high scoring fragments that are substructures of known inhibitors. FragFEATURE thus predicts fragments that can serve as inputs to fragment-based drug design or serve as refinement criteria for creating target-specific compound libraries for experimental or computational screening.

    View details for DOI 10.1371/journal.pcbi.1003589

    View details for PubMedID 24762971

  • High Precision Prediction of Functional Sites in Protein Structures PLOS ONE Buturovic, L., Wong, M., Tang, G. W., Altman, R. B., Petkovic, D. 2014; 9 (3)
  • Clinical Interpretation and Implications of Whole-Genome Sequencing JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION Dewey, F. E., Grove, M. E., Pan, C., Goldstein, B. A., Bernstein, J. A., Chaib, H., Merker, J. D., Goldfeder, R. L., Enns, G. M., David, S. P., Pakdaman, N., Ormond, K. E., Caleshu, C., Kingham, K., Klein, T. E., Whirl-Carrillo, M., Sakamoto, K., Wheeler, M. T., Butte, A. J., Ford, J. M., Boxer, L., Ioannidis, J. P., Yeung, A. C., Altman, R. B., Assimes, T. L., Snyder, M., Ashley, E. A., Quertermous, T. 2014; 311 (10): 1035-1044
  • Clinical interpretation and implications of whole-genome sequencing. JAMA Dewey, F. E., Grove, M. E., Pan, C., Goldstein, B. A., Bernstein, J. A., Chaib, H., Merker, J. D., Goldfeder, R. L., Enns, G. M., David, S. P., Pakdaman, N., Ormond, K. E., Caleshu, C., Kingham, K., Klein, T. E., Whirl-Carrillo, M., Sakamoto, K., Wheeler, M. T., Butte, A. J., Ford, J. M., Boxer, L., Ioannidis, J. P., Yeung, A. C., Altman, R. B., Assimes, T. L., Snyder, M., Ashley, E. A., Quertermous, T. 2014; 311 (10): 1035-1045


    Whole-genome sequencing (WGS) is increasingly applied in clinical medicine and is expected to uncover clinically significant findings regardless of sequencing indication.To examine coverage and concordance of clinically relevant genetic variation provided by WGS technologies; to quantitate inherited disease risk and pharmacogenomic findings in WGS data and resources required for their discovery and interpretation; and to evaluate clinical action prompted by WGS findings.An exploratory study of 12 adult participants recruited at Stanford University Medical Center who underwent WGS between November 2011 and March 2012. A multidisciplinary team reviewed all potentially reportable genetic findings. Five physicians proposed initial clinical follow-up based on the genetic findings.Genome coverage and sequencing platform concordance in different categories of genetic disease risk, person-hours spent curating candidate disease-risk variants, interpretation agreement between trained curators and disease genetics databases, burden of inherited disease risk and pharmacogenomic findings, and burden and interrater agreement of proposed clinical follow-up.Depending on sequencing platform, 10% to 19% of inherited disease genes were not covered to accepted standards for single nucleotide variant discovery. Genotype concordance was high for previously described single nucleotide genetic variants (99%-100%) but low for small insertion/deletion variants (53%-59%). Curation of 90 to 127 genetic variants in each participant required a median of 54 minutes (range, 5-223 minutes) per genetic variant, resulted in moderate classification agreement between professionals (Gross κ, 0.52; 95% CI, 0.40-0.64), and reclassified 69% of genetic variants cataloged as disease causing in mutation databases to variants of uncertain or lesser significance. Two to 6 personal disease-risk findings were discovered in each participant, including 1 frameshift deletion in the BRCA1 gene implicated in hereditary breast and ovarian cancer. Physician review of sequencing findings prompted consideration of a median of 1 to 3 initial diagnostic tests and referrals per participant, with fair interrater agreement about the suitability of WGS findings for clinical follow-up (Fleiss κ, 0.24; P < 001).In this exploratory study of 12 volunteer adults, the use of WGS was associated with incomplete coverage of inherited disease genes, low reproducibility of detection of genetic variation with the highest potential clinical effects, and uncertainty about clinically reportable findings. In certain cases, WGS will identify clinically actionable genetic variants warranting early medical intervention. These issues should be considered when determining the role of WGS in clinical medicine.

    View details for DOI 10.1001/jama.2014.1717

    View details for PubMedID 24618965

  • PharmGKB summary: very important pharmacogene information for UGT1A1 PHARMACOGENETICS AND GENOMICS Barbarino, J. M., Haidar, C. E., Klein, T. E., Altman, R. B. 2014; 24 (3): 177-183
  • Environmental and State-Level Regulatory Factors Affect the Incidence of Autism and Intellectual Disability PLOS COMPUTATIONAL BIOLOGY Rzhetsky, A., Bagley, S. C., Wang, K., Lyttle, C. S., Cook, E. H., Altman, R. B., Gibbons, R. D. 2014; 10 (3)


    Many factors affect the risks for neurodevelopmental maladies such as autism spectrum disorders (ASD) and intellectual disability (ID). To compare environmental, phenotypic, socioeconomic and state-policy factors in a unified geospatial framework, we analyzed the spatial incidence patterns of ASD and ID using an insurance claims dataset covering nearly one third of the US population. Following epidemiologic evidence, we used the rate of congenital malformations of the reproductive system as a surrogate for environmental exposure of parents to unmeasured developmental risk factors, including toxins. Adjusted for gender, ethnic, socioeconomic, and geopolitical factors, the ASD incidence rates were strongly linked to population-normalized rates of congenital malformations of the reproductive system in males (an increase in ASD incidence by 283% for every percent increase in incidence of malformations, 95% CI: [91%, 576%], p<6×10(-5)). Such congenital malformations were barely significant for ID (94% increase, 95% CI: [1%, 250%], p = 0.0384). Other congenital malformations in males (excluding those affecting the reproductive system) appeared to significantly affect both phenotypes: 31.8% ASD rate increase (CI: [12%, 52%], p<6×10(-5)), and 43% ID rate increase (CI: [23%, 67%], p<6×10(-5)). Furthermore, the state-mandated rigor of diagnosis of ASD by a pediatrician or clinician for consideration in the special education system was predictive of a considerable decrease in ASD and ID incidence rates (98.6%, CI: [28%, 99.99%], p = 0.02475 and 99% CI: [68%, 99.99%], p = 0.00637 respectively). Thus, the observed spatial variability of both ID and ASD rates is associated with environmental and state-level regulatory factors; the magnitude of influence of compound environmental predictors was approximately three times greater than that of state-level incentives. The estimated county-level random effects exhibited marked spatial clustering, strongly indicating existence of as yet unidentified localized factors driving apparent disease incidence. Finally, we found that the rates of ASD and ID at the county level were weakly but significantly correlated (Pearson product-moment correlation 0.0589, p = 0.00101), while for females the correlation was much stronger (0.197, p<2.26×10(-16)).

    View details for DOI 10.1371/journal.pcbi.1003518

    View details for Web of Science ID 000336509000034

    View details for PubMedID 24625521

  • Coherent functional modules improve transcription factor target identification, cooperativity prediction, and disease association. PLoS genetics Karczewski, K. J., Snyder, M., Altman, R. B., Tatonetti, N. P. 2014; 10 (2)


    Transcription factors (TFs) are fundamental controllers of cellular regulation that function in a complex and combinatorial manner. Accurate identification of a transcription factor's targets is essential to understanding the role that factors play in disease biology. However, due to a high false positive rate, identifying coherent functional target sets is difficult. We have created an improved mapping of targets by integrating ChIP-Seq data with 423 functional modules derived from 9,395 human expression experiments. We identified 5,002 TF-module relationships, significantly improved TF target prediction, and found 30 high-confidence TF-TF associations, of which 14 are known. Importantly, we also connected TFs to diseases through these functional modules and identified 3,859 significant TF-disease relationships. As an example, we found a link between MEF2A and Crohn's disease, which we validated in an independent expression dataset. These results show the power of combining expression data and ChIP-Seq data to remove noise and better extract the associations between TFs, functional modules, and disease.

    View details for DOI 10.1371/journal.pgen.1004122

    View details for PubMedID 24516403

  • Coherent Functional Modules Improve Transcription Factor Target Identification, Cooperativity Prediction, and Disease Association PLOS GENETICS Karczewski, K. J., Snyder, M., Altman, R. B., Tatonetti, N. P. 2014; 10 (2)
  • CYP2D6 Genotype and Adjuvant Tamoxifen: Meta-Analysis of Heterogeneous Study Populations CLINICAL PHARMACOLOGY & THERAPEUTICS Province, M. A., Goetz, M. P., Brauch, H., Flockhare, D. A., Hebert, J. M., Whaley, R., Suman, V. J., Schroth, W., Winter, S., Zembutsu, H., Mushiroda, T., Newman, W. G., Lee, M. M., Ambrosone, C. B., Beckmann, M. W., Choi, J., Dieudonne, A., Fasching, P. A., Ferraldeschi, R., Gong, L., Haschke-Becher, E., Howel, A., Jordan, L. B., Hamann, U., Kiyotani, K., Krippl, P., Lambrechts, D., Latif, A., Langsenlehner, U., Lorizio, W., Neven, P., Nguyen, A. T., Park, B., Purdie, C. A., Quinlan, P., Renner, W., Schmidt, M., Schwab, M., Shin, J., Stingl, J. C., Wegman, P., Wingren, S., Wu, A. H., Ziv, E., ZIRPOLI, G., Thompson, A. M., Jordan, V. C., Nakamura, Y., Altman, R. B., Ames, M. M., Weinshilboum, R. M., Eichelbaum, M., Ingle, J. N., Klein, T. E. 2014; 95 (2): 216-227


    The International Tamoxifen Pharmacogenomics Consortium was established to address the controversy regarding cytochrome P450 2D6 (CYP2D6) status and clinical outcomes in tamoxifen therapy. We performed a meta-analysis on data from 4,973 tamoxifen-treated patients (12 globally distributed sites). Using strict eligibility requirements (postmenopausal women with estrogen receptor-positive breast cancer, receiving 20 mg/day tamoxifen for 5 years, criterion 1); CYP2D6 poor metabolizer status was associated with poorer invasive disease-free survival (IDFS: hazard ratio = 1.25; 95% confidence interval = 1.06, 1.47; P = 0.009). However, CYP2D6 status was not statistically significant when tamoxifen duration, menopausal status, and annual follow-up were not specified (criterion 2, n = 2,443; P = 0.25) or when no exclusions were applied (criterion 3, n = 4,935; P = 0.38). Although CYP2D6 is a strong predictor of IDFS using strict inclusion criteria, because the results are not robust to inclusion criteria (these were not defined a priori), prospective studies are necessary to fully establish the value of CYP2D6 genotyping in tamoxifen therapy.

    View details for DOI 10.1038/clpt.2013.186

    View details for Web of Science ID 000330151100026

    View details for PubMedID 24060820

  • PharmGKB summary: ifosfamide pathways, pharmacokinetics and pharmacodynamics PHARMACOGENETICS AND GENOMICS Lowenberg, D., Thorn, C. F., Desta, Z., Flockhart, D. A., Altman, R. B., Klein, T. E. 2014; 24 (2): 133-138

    View details for DOI 10.1097/FPC.0000000000000019

    View details for Web of Science ID 000330879300006

    View details for PubMedID 24401834

  • PharmGKB summary: venlafaxine pathway PHARMACOGENETICS AND GENOMICS Sangkuhl, K., Stingl, J. C., Turpeinen, M., Altman, R. B., Klein, T. E. 2014; 24 (1): 62-72
  • High precision prediction of functional sites in protein structures. PloS one Buturovic, L., Wong, M., Tang, G. W., Altman, R. B., Petkovic, D. 2014; 9 (3)


    We address the problem of assigning biological function to solved protein structures. Computational tools play a critical role in identifying potential active sites and informing screening decisions for further lab analysis. A critical parameter in the practical application of computational methods is the precision, or positive predictive value. Precision measures the level of confidence the user should have in a particular computed functional assignment. Low precision annotations lead to futile laboratory investigations and waste scarce research resources. In this paper we describe an advanced version of the protein function annotation system FEATURE, which achieved 99% precision and average recall of 95% across 20 representative functional sites. The system uses a Support Vector Machine classifier operating on the microenvironment of physicochemical features around an amino acid. We also compared performance of our method with state-of-the-art sequence-level annotator Pfam in terms of precision, recall and localization. To our knowledge, no other functional site annotator has been rigorously evaluated against these key criteria. The software and predictive models are incorporated into the WebFEATURE service at

    View details for DOI 10.1371/journal.pone.0091240

    View details for PubMedID 24632601

  • Identifying druggable targets by protein microenvironments matching: application to transcription factors. CPT: pharmacometrics & systems pharmacology Liu, T., Altman, R. B. 2014; 3


    Druggability of a protein is its potential to be modulated by drug-like molecules. It is important in the target selection phase. We hypothesize that: (i) known drug-binding sites contain advantageous physicochemical properties for drug binding, or "druggable microenvironments" and (ii) given a target, the presence of multiple druggable microenvironments similar to those seen previously is associated with a high likelihood of druggability. We developed DrugFEATURE to quantify druggability by assessing the microenvironments in potential small-molecule binding sites. We benchmarked DrugFEATURE using two data sets. One data set measures druggability using NMR-based screening. DrugFEATURE correlates well with this metric. The second data set is based on historical drug discovery outcomes. Using the DrugFEATURE cutoffs derived from the first, we accurately discriminated druggable and difficult targets in the second. We further identified novel druggable transcription factors with implications for cancer therapy. DrugFEATURE provides useful insight for drug discovery, by evaluating druggability and suggesting specific regions for interacting with drug-like molecules.CPT: Pharmacometrics Systems Pharmacology (2014) 3, e93; doi:10.1038/psp.2013.66; published online 22 January 2014.

    View details for DOI 10.1038/psp.2013.66

    View details for PubMedID 24452614

  • Path-scan: a reporting tool for identifying clinically actionable variants. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Daneshjou, R., Zappala, Z., Kukurba, K., Boyle, S. M., Ormond, K. E., Klein, T. E., Snyder, M., Bustamante, C. D., Altman, R. B., Montgomery, S. B. 2014; 19: 229-240


    The American College of Medical Genetics and Genomics (ACMG) recently released guidelines regarding the reporting of incidental findings in sequencing data. Given the availability of Direct to Consumer (DTC) genetic testing and the falling cost of whole exome and genome sequencing, individuals will increasingly have the opportunity to analyze their own genomic data. We have developed a web-based tool, PATH-SCAN, which annotates individual genomes and exomes for ClinVar designated pathogenic variants found within the genes from the ACMG guidelines. Because mutations in these genes predispose individuals to conditions with actionable outcomes, our tool will allow individuals or researchers to identify potential risk variants in order to consult physicians or genetic counselors for further evaluation. Moreover, our tool allows individuals to anonymously submit their pathogenic burden, so that we can crowd source the collection of quantitative information regarding the frequency of these variants. We tested our tool on 1092 publicly available genomes from the 1000 Genomes project, 163 genomes from the Personal Genome Project, and 15 genomes from a clinical genome sequencing research project. Excluding the most commonly seen variant in 1000 Genomes, about 20% of all genomes analyzed had a ClinVar designated pathogenic variant that required further evaluation.

    View details for PubMedID 24297550

  • PharmGKB summary: mycophenolic acid pathway PHARMACOGENETICS AND GENOMICS Lamba, V., Sangkuhl, K., Sanghavi, K., Fish, A., Altman, R. B., Klein, T. E. 2014; 24 (1): 73-79

    View details for DOI 10.1097/FPC.0000000000000010

    View details for Web of Science ID 000328629800009

    View details for PubMedID 24220207

  • Cloud-based simulations on Google Exacycle reveal ligand modulation of GPCR activation pathways NATURE CHEMISTRY Kohlhoff, K. J., Shukla, D., Lawrenz, M., Bowman, G. R., Konerding, D. E., Belov, D., Altman, R. B., Pande, V. S. 2014; 6 (1): 15-21


    Simulations can provide tremendous insight into the atomistic details of biological mechanisms, but micro- to millisecond timescales are historically only accessible on dedicated supercomputers. We demonstrate that cloud computing is a viable alternative that brings long-timescale processes within reach of a broader community. We used Google's Exacycle cloud-computing platform to simulate two milliseconds of dynamics of a major drug target, the G-protein-coupled receptor β2AR. Markov state models aggregate independent simulations into a single statistical model that is validated by previous computational and experimental results. Moreover, our models provide an atomistic description of the activation of a G-protein-coupled receptor and reveal multiple activation pathways. Agonists and inverse agonists interact differentially with these pathways, with profound implications for drug design.

    View details for DOI 10.1038/NCHEM.1821

    View details for Web of Science ID 000328951000007

    View details for PubMedID 24345941

  • Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Lyalina, S., Percha, B., LePendu, P., Iyer, S. V., Altman, R. B., Shah, N. H. 2013; 20 (E2): E297-E305
  • PharmGKB summary: very important pharmacogene information for cytochrome P450, family 2, subfamily C, polypeptide 8 PHARMACOGENETICS AND GENOMICS Aquilante, C. L., Niemi, M., Gong, L., Altman, R. B., Klein, T. E. 2013; 23 (12): 721-728

    View details for DOI 10.1097/FPC.0b013e3283653b27

    View details for Web of Science ID 000326971400009

    View details for PubMedID 23962911

  • Genome Wide Analysis of Drug-Induced Torsades de Pointes: Lack of Common Variants with Large Effect Sizes PLOS ONE Behr, E. R., Ritchie, M. D., Tanaka, T., Kaeaeb, S., Crawford, D. C., Nicoletti, P., Floratos, A., Sinner, M. F., Kannankeril, P. J., Wilde, A. A., Bezzina, C. R., Schulze-Bahr, E., Zumhagen, S., Guicheney, P., Bishopric, N. H., Marshall, V., Shakir, S., Dalageorgou, C., Bevan, S., Jamshidi, Y., Bastiaenen, R., Myerburg, R. J., Schott, J., Camm, A. J., Steinbeck, G., Norris, K., Altman, R. B., Tatonetti, N. P., Jeffery, S., Kubo, M., Nakamura, Y., Shen, Y., George, A. L., Roden, D. M. 2013; 8 (11)


    Marked prolongation of the QT interval on the electrocardiogram associated with the polymorphic ventricular tachycardia Torsades de Pointes is a serious adverse event during treatment with antiarrhythmic drugs and other culprit medications, and is a common cause for drug relabeling and withdrawal. Although clinical risk factors have been identified, the syndrome remains unpredictable in an individual patient. Here we used genome-wide association analysis to search for common predisposing genetic variants. Cases of drug-induced Torsades de Pointes (diTdP), treatment tolerant controls, and general population controls were ascertained across multiple sites using common definitions, and genotyped on the Illumina 610k or 1M-Duo BeadChips. Principal Components Analysis was used to select 216 Northwestern European diTdP cases and 771 ancestry-matched controls, including treatment-tolerant and general population subjects. With these sample sizes, there is 80% power to detect a variant at genome-wide significance with minor allele frequency of 10% and conferring an odds ratio of ≥2.7. Tests of association were carried out for each single nucleotide polymorphism (SNP) by logistic regression adjusting for gender and population structure. No SNP reached genome wide-significance; the variant with the lowest P value was rs2276314, a non-synonymous coding variant in C18orf21 (p  =  3×10(-7), odds ratio = 2, 95% confidence intervals: 1.5-2.6). The haplotype formed by rs2276314 and a second SNP, rs767531, was significantly more frequent in controls than cases (p  =  3×10(-9)). Expanding the number of controls and a gene-based analysis did not yield significant associations. This study argues that common genomic variants do not contribute importantly to risk for drug-induced Torsades de Pointes across multiple drugs.

    View details for DOI 10.1371/journal.pone.0078511

    View details for Web of Science ID 000326656200047

    View details for PubMedID 24223155

  • PharmGKB summary: tamoxifen pathway, pharmacokinetics PHARMACOGENETICS AND GENOMICS Klein, D. J., Thorn, C. F., Desta, Z., Flockhart, D. A., Altman, R. B., Klein, T. E. 2013; 23 (11): 643-647
  • PharmGKB summary: very important pharmacogene information for the epidermal growth factor receptor PHARMACOGENETICS AND GENOMICS Hodoglugil, U., Carrillo, M. W., Hebert, J. M., Karachaliou, N., Rosell, R. C., Altman, R. B., Klein, T. E. 2013; 23 (11): 636-642
  • Using molecular features of xenobiotics to predict hepatic gene expression response. Journal of chemical information and modeling Fernald, G. H., Altman, R. B. 2013; 53 (10): 2765-2773


    Despite recent advances in molecular medicine and rational drug design, many drugs still fail because toxic effects arise at the cellular and tissue level. In order to better understand these effects, cellular assays can generate high-throughput measurements of gene expression changes induced by small molecules. However, our understanding of how the chemical features of small molecules influence gene expression is very limited. Therefore, we investigated the extent to which chemical features of small molecules can reliably be associated with significant changes in gene expression. Specifically, we analyzed the gene expression response of rat liver cells to 170 different drugs and searched for genes whose expression could be related to chemical features alone. Surprisingly, we can predict the up-regulation of 87 genes (increased expression of at least 1.5 times compared to controls). We show an average cross-validation predictive area under the receiver operating characteristic curve (AUROC) of 0.7 or greater for each of these 87 genes. We applied our method to an external data set of rat liver gene expression response to a novel drug and achieved an AUROC of 0.7. We also validated our approach by predicting up-regulation of Cytochrome P450 1A2 (CYP1A2) in three drugs known to induce CYP1A2 that were not in our data set. Finally, a detailed analysis of the CYP1A2 predictor allowed us to identify which fragments made significant contributions to the predictive scores.

    View details for DOI 10.1021/ci3005868

    View details for PubMedID 24010729

  • PharmGKB summary: cyclosporine and tacrolimus pathways PHARMACOGENETICS AND GENOMICS Barbarino, J. M., Staatz, C. E., Venkataramanan, R., Klein, T. E., Altman, R. B. 2013; 23 (10): 563-585

    View details for DOI 10.1097/FPC.0b013e328364db84

    View details for Web of Science ID 000324527600007

    View details for PubMedID 23922006

  • A method for inferring medical diagnoses from patient similarities BMC MEDICINE Gottlieb, A., Stein, G. Y., Ruppin, E., Altman, R. B., Sharan, R. 2013; 11
  • PharmGKB summary: methylene blue pathway PHARMACOGENETICS AND GENOMICS McDonagh, E. M., Bautista, J. M., Youngster, I., Altman, R. B., Klein, T. E. 2013; 23 (9): 498-508

    View details for DOI 10.1097/FPC.0b013e32836498f4

    View details for Web of Science ID 000323220200007

    View details for PubMedID 23913015

  • Genetic variants associated with warfarin dose in African-American individuals: a genome-wide association study LANCET Perera, M. A., Cavallari, L. H., Limdi, N. A., Gamazon, E. R., Konkashbaev, A., Daneshjou, R., Pluzhnikov, A., Crawford, D. C., Wang, J., Liu, N., Tatonetti, N., Bourgeois, S., Takahashi, H., Bradford, Y., Burkley, B. M., Desnick, R. J., Halperin, J. L., Khalifa, S. I., Langaee, T. Y., Lubitz, S. A., Nutescu, E. A., Oetjens, M., Shahin, M. H., Patel, S. R., Sagreiya, H., Tector, M., Weck, K. E., Rieder, M. J., Scott, S. A., Wu, A. H., Burmester, J. K., Wadelius, M., Deloukas, P., Wagner, M. J., Mushiroda, T., Kubo, M., Roden, D. M., Cox, N. J., Altman, R. B., Klein, T. E., Nakamura, Y., Johnson, J. A. 2013; 382 (9894): 790-796


    BACKGROUND: VKORC1 and CYP2C9 are important contributors to warfarin dose variability, but explain less variability for individuals of African descent than for those of European or Asian descent. We aimed to identify additional variants contributing to warfarin dose requirements in African Americans. METHODS: We did a genome-wide association study of discovery and replication cohorts. Samples from African-American adults (aged ≥18 years) who were taking a stable maintenance dose of warfarin were obtained at International Warfarin Pharmacogenetics Consortium (IWPC) sites and the University of Alabama at Birmingham (Birmingham, AL, USA). Patients enrolled at IWPC sites but who were not used for discovery made up the independent replication cohort. All participants were genotyped. We did a stepwise conditional analysis, conditioning first for VKORC1 -1639G→A, followed by the composite genotype of CYP2C9*2 and CYP2C9*3. We prespecified a genome-wide significance threshold of p<5×10(-8) in the discovery cohort and p<0·0038 in the replication cohort. FINDINGS: The discovery cohort contained 533 participants and the replication cohort 432 participants. After the prespecified conditioning in the discovery cohort, we identified an association between a novel single nucleotide polymorphism in the CYP2C cluster on chromosome 10 (rs12777823) and warfarin dose requirement that reached genome-wide significance (p=1·51×10(-8)). This association was confirmed in the replication cohort (p=5·04×10(-5)); analysis of the two cohorts together produced a p value of 4·5×10(-12). Individuals heterozygous for the rs12777823 A allele need a dose reduction of 6·92 mg/week and those homozygous 9·34 mg/week. Regression analysis showed that the inclusion of rs12777823 significantly improves warfarin dose variability explained by the IWPC dosing algorithm (21% relative improvement). INTERPRETATION: A novel CYP2C single nucleotide polymorphism exerts a clinically relevant effect on warfarin dose in African Americans, independent of CYP2C9*2 and CYP2C9*3. Incorporation of this variant into pharmacogenetic dosing algorithms could improve warfarin dose prediction in this population. FUNDING: National Institutes of Health, American Heart Association, Howard Hughes Medical Institute, Wisconsin Network for Health Research, and the Wellcome Trust.

    View details for DOI 10.1016/S0140-6736(13)60681-9

    View details for Web of Science ID 000324239200029

  • PharmGKB summary: diuretics pathway, pharmacodynamics PHARMACOGENETICS AND GENOMICS Thorn, C. F., Ellison, D. H., Turner, S. T., Altman, R. B., Klein, T. E. 2013; 23 (8): 449-453

    View details for DOI 10.1097/FPC.0b013e3283636822

    View details for Web of Science ID 000323226500009

    View details for PubMedID 23788015

  • Challenges in the Pharmacogenomic Annotation of Whole Genomes CLINICAL PHARMACOLOGY & THERAPEUTICS Altman, R. B., Whirl-Carrillo, M., Klein, T. E. 2013; 94 (2): 211-213

    View details for DOI 10.1038/clpt.2013.111

    View details for Web of Science ID 000322064400019

    View details for PubMedID 23708745

  • K-Means for Parallel Architectures Using All-Prefix-Sum Sorting and Updating Steps IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS Kohlhoff, K. J., Pande, V. S., Altman, R. B. 2013; 24 (8): 1602-1612
  • Pathway analysis of genome-wide data improves warfarin dose prediction BMC GENOMICS Daneshjou, R., Tatonetti, N. P., Karczewski, K. J., Sagreiya, H., Bourgeois, S., Drozda, K., Burmester, J. K., Tsunoda, T., Nakamura, Y., Kubo, M., Tector, M., Limdi, N. A., Cavallari, L. H., Perera, M., Johnson, J. A., Klein, T. E., Altman, R. B. 2013; 14
  • Collective judgment predicts disease-associated single nucleotide variants BMC GENOMICS Capriotti, E., Altman, R. B., Bromberg, Y. 2013; 14


    In recent years the number of human genetic variants deposited into the publicly available databases has been increasing exponentially. The latest version of dbSNP, for example, contains ~50 million validated Single Nucleotide Variants (SNVs). SNVs make up most of human variation and are often the primary causes of disease. The non-synonymous SNVs (nsSNVs) result in single amino acid substitutions and may affect protein function, often causing disease. Although several methods for the detection of nsSNV effects have already been developed, the consistent increase in annotated data is offering the opportunity to improve prediction accuracy.Here we present a new approach for the detection of disease-associated nsSNVs (Meta-SNP) that integrates four existing methods: PANTHER, PhD-SNP, SIFT and SNAP. We first tested the accuracy of each method using a dataset of 35,766 disease-annotated mutations from 8,667 proteins extracted from the SwissVar database. The four methods reached overall accuracies of 64%-76% with a Matthew's correlation coefficient (MCC) of 0.38-0.53. We then used the outputs of these methods to develop a machine learning based approach that discriminates between disease-associated and polymorphic variants (Meta-SNP). In testing, the combined method reached 79% overall accuracy and 0.59 MCC, ~3% higher accuracy and ~0.05 higher correlation with respect to the best-performing method. Moreover, for the hardest-to-define subset of nsSNVs, i.e. variants for which half of the predictors disagreed with the other half, Meta-SNP attained 8% higher accuracy than the best predictor.Here we find that the Meta-SNP algorithm achieves better performance than the best single predictor. This result suggests that the methods used for the prediction of variant-disease associations are orthogonal, encoding different biologically relevant relationships. Careful combination of predictions from various resources is therefore a good strategy for the selection of high reliability predictions. Indeed, for the subset of nsSNVs where all predictors were in agreement (46% of all nsSNVs in the set), our method reached 87% overall accuracy and 0.73 MCC. Meta-SNP server is freely accessible at

    View details for DOI 10.1186/1471-2164-14-S3-S2

    View details for Web of Science ID 000319869500002

    View details for PubMedID 23819846

  • WS-SNPs& GO: a web server for predicting the deleterious effect of human protein variants using functional annotation BMC GENOMICS Capriotti, E., Calabrese, R., Fariselli, P., Martelli, P. L., Altman, R. B., Casadio, R. 2013; 14


    SNPs&GO is a method for the prediction of deleterious Single Amino acid Polymorphisms (SAPs) using protein functional annotation. In this work, we present the web server implementation of SNPs&GO (WS-SNPs&GO). The server is based on Support Vector Machines (SVM) and for a given protein, its input comprises: the sequence and/or its three-dimensional structure (when available), a set of target variations and its functional Gene Ontology (GO) terms. The output of the server provides, for each protein variation, the probabilities to be associated to human diseases.The server consists of two main components, including updated versions of the sequence-based SNPs&GO (recently scored as one of the best algorithms for predicting deleterious SAPs) and of the structure-based SNPs&GO(3d) programs. Sequence and structure based algorithms are extensively tested on a large set of annotated variations extracted from the SwissVar database. Selecting a balanced dataset with more than 38,000 SAPs, the sequence-based approach achieves 81% overall accuracy, 0.61 correlation coefficient and an Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve of 0.88. For the subset of ~6,600 variations mapped on protein structures available at the Protein Data Bank (PDB), the structure-based method scores with 84% overall accuracy, 0.68 correlation coefficient, and 0.91 AUC. When tested on a new blind set of variations, the results of the server are 79% and 83% overall accuracy for the sequence-based and structure-based inputs, respectively.WS-SNPs&GO is a valuable tool that includes in a unique framework information derived from protein sequence, structure, evolutionary profile, and protein function. WS-SNPs&GO is freely available at

    View details for DOI 10.1186/1471-2164-14-S3-S6

    View details for Web of Science ID 000319869500006

    View details for PubMedID 23819482

  • Web-scale pharmacovigilance: listening to signals from the crowd JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION White, R. W., Tatonetti, N. P., Shah, N. H., Altman, R. B., Horvitz, E. 2013; 20 (3): 404-408


    Adverse drug events cause substantial morbidity and mortality and are often discovered after a drug comes to market. We hypothesized that Internet users may provide early clues about adverse drug events via their online information-seeking. We conducted a large-scale study of Web search log data gathered during 2010. We pay particular attention to the specific drug pairing of paroxetine and pravastatin, whose interaction was reported to cause hyperglycemia after the time period of the online logs used in the analysis. We also examine sets of drug pairs known to be associated with hyperglycemia and those not associated with hyperglycemia. We find that anonymized signals on drug interactions can be mined from search logs. Compared to analyses of other sources such as electronic health records (EHR), logs are inexpensive to collect and mine. The results demonstrate that logs of the search activities of populations of computer users can contribute to drug safety surveillance.

    View details for DOI 10.1136/amiajnl-2012-001482

    View details for Web of Science ID 000317477500001

    View details for PubMedID 23467469

  • Valproic acid pathway: pharmacokinetics and pharmacodynamics PHARMACOGENETICS AND GENOMICS Ghodke-Puranik, Y., Thorn, C. F., Lamba, J. K., Leeder, J. S., Song, W., Birnbaum, A. K., Altman, R. B., Klein, T. E. 2013; 23 (4): 236-241

    View details for DOI 10.1097/FPC.0b013e32835ea0b2

    View details for Web of Science ID 000316109700008

    View details for PubMedID 23407051

  • Informatics confronts drug-drug interactions TRENDS IN PHARMACOLOGICAL SCIENCES Percha, B., Altman, R. B. 2013; 34 (3): 178-184


    Drug-drug interactions (DDIs) are an emerging threat to public health. Recent estimates indicate that DDIs cause nearly 74000 emergency room visits and 195000 hospitalizations each year in the USA. Current approaches to DDI discovery, which include Phase IV clinical trials and post-marketing surveillance, are insufficient for detecting many DDIs and do not alert the public to potentially dangerous DDIs before a drug enters the market. Recent work has applied state-of-the-art computational and statistical methods to the problem of DDIs. Here we review recent developments that encompass a range of informatics approaches in this domain, from the construction of databases for efficient searching of known DDIs to the prediction of novel DDIs based on data from electronic medical records, adverse event reports, scientific abstracts, and other sources. We also explore why DDIs are so difficult to detect and what the future holds for informatics-based approaches to DDI discovery.

    View details for DOI 10.1016/

    View details for Web of Science ID 000316833900008

    View details for PubMedID 23414686

  • Personal genomic measurements: the opportunity for information integration. Clinical pharmacology & therapeutics Altman, R. B. 2013; 93 (1): 21-23


    High-throughput genomic measurements initially emerged for research purposes but are now entering the clinic. The challenge for clinicians is to integrate imperfect genomic measurements with other information sources so as to estimate as closely as possible the probabilities of clinical events (diagnoses, treatment responses, prognoses). Population-based data provide a priori probabilities that can be combined with individual measurements to compute a posteriori estimates using Bayes' rule. Thus, the integration of population science with individual genomic measurements will enable the practice of personalized medicine.

    View details for DOI 10.1038/clpt.2012.203

    View details for PubMedID 23241835

  • Inferring the semantic relationships of words within an ontology using random indexing: applications to pharmacogenomics. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Percha, B., Altman, R. B. 2013; 2013: 1123-1132


    The biomedical literature presents a uniquely challenging text mining problem. Sentences are long and complex, the subject matter is highly specialized with a distinct vocabulary, and producing annotated training data for this domain is time consuming and expensive. In this environment, unsupervised text mining methods that do not rely on annotated training data are valuable. Here we investigate the use of random indexing, an automated method for producing vector-space semantic representations of words from large, unlabeled corpora, to address the problem of term normalization in sentences describing drugs and genes. We show that random indexing produces similarity scores that capture some of the structure of PHARE, a manually curated ontology of pharmacogenomics concepts. We further show that random indexing can be used to identify likely word candidates for inclusion in the ontology, and can help localize these new labels among classes and roles within the ontology.

    View details for PubMedID 24551397

  • Proceedings of Pacific Symposium on Biocomputing 2011. edited by Altman, R., Dunker, K., Hunter, L. 2013
  • PharmGKB: the Pharmacogenomics Knowledge Base. Methods in molecular biology (Clifton, N.J.) Thorn, C. F., Klein, T. E., Altman, R. B. 2013; 1015: 311-320


    The Pharmacogenomics Knowledge Base, PharmGKB, is an interactive tool for researchers investigating how genetic variation affects drug response. The PharmGKB Web site, , displays genotype, molecular, and clinical knowledge integrated into pathway representations and Very Important Pharmacogene (VIP) summaries with links to additional external resources. Users can search and browse the knowledgebase by genes, variants, drugs, diseases, and pathways. Registration is free to the entire research community, but subject to agreement to use for research purposes only and not to redistribute. Registered users can access and download data to aid in the design of future pharmacogenetics and pharmacogenomics studies.

    View details for DOI 10.1007/978-1-62703-435-7_20

    View details for PubMedID 23824865

  • Pathway analysis of genome-wide data improves warfarin dose prediction. BMC genomics Daneshjou, R., Tatonetti, N. P., Karczewski, K. J., Sagreiya, H., Bourgeois, S., Drozda, K., Burmester, J. K., Tsunoda, T., Nakamura, Y., Kubo, M., Tector, M., Limdi, N. A., Cavallari, L. H., Perera, M., Johnson, J. A., Klein, T. E., Altman, R. B. 2013; 14: S11-?


    Many genome-wide association studies focus on associating single loci with target phenotypes. However, in the setting of rare variation, accumulating sufficient samples to assess these associations can be difficult. Moreover, multiple variations in a gene or a set of genes within a pathway may all contribute to the phenotype, suggesting that the aggregation of variations found over the gene or pathway may be useful for improving the power to detect associations.Here, we present a method for aggregating single nucleotide polymorphisms (SNPs) along biologically relevant pathways in order to seek genetic associations with phenotypes. Our method uses all available genetic variants and does not remove those in linkage disequilibrium (LD). Instead, it uses a novel SNP weighting scheme to down-weight the contributions of correlated SNPs. We apply our method to three cohorts of patients taking warfarin: two European descent cohorts and an African American cohort. Although the clinical covariates and key pharmacogenetic loci for warfarin have been characterized, our association metric identifies a significant association with mutations distributed throughout the pathway of warfarin metabolism. We improve dose prediction after using all known clinical covariates and pharmacogenetic variants in VKORC1 and CYP2C9. In particular, we find that at least 1% of the missing heritability in warfarin dose may be due to the aggregated effects of variations in the warfarin metabolic pathway, even though the SNPs do not individually show a significant association.Our method allows researchers to study aggregative SNP effects in an unbiased manner by not preselecting SNPs. It retains all the available information by accounting for LD-structure through weighting, which eliminates the need for LD pruning.

    View details for DOI 10.1186/1471-2164-14-S3-S11

    View details for PubMedID 23819817

  • Introduction to Translational Bioinformatics Collection PLOS COMPUTATIONAL BIOLOGY Altman, R. B. 2012; 8 (12)

    View details for DOI 10.1371/journal.pcbi.1002796

    View details for Web of Science ID 000312901500006

    View details for PubMedID 23300404

  • Chapter 7: Pharmacogenomics PLOS COMPUTATIONAL BIOLOGY Karczewski, K. J., Daneshjou, R., Altman, R. B. 2012; 8 (12)


    There is great variation in drug-response phenotypes, and a "one size fits all" paradigm for drug delivery is flawed. Pharmacogenomics is the study of how human genetic information impacts drug response, and it aims to improve efficacy and reduced side effects. In this article, we provide an overview of pharmacogenetics, including pharmacokinetics (PK), pharmacodynamics (PD), gene and pathway interactions, and off-target effects. We describe methods for discovering genetic factors in drug response, including genome-wide association studies (GWAS), expression analysis, and other methods such as chemoinformatics and natural language processing (NLP). We cover the practical applications of pharmacogenomics both in the pharmaceutical industry and in a clinical setting. In drug discovery, pharmacogenomics can be used to aid lead identification, anticipate adverse events, and assist in drug repurposing efforts. Moreover, pharmacogenomic discoveries show promise as important elements of physician decision support. Finally, we consider the ethical, regulatory, and reimbursement challenges that remain for the clinical implementation of pharmacogenomics.

    View details for DOI 10.1371/journal.pcbi.1002817

    View details for Web of Science ID 000312901500023

    View details for PubMedID 23300409

  • PharmGKB summary: zidovudine pathway PHARMACOGENETICS AND GENOMICS Ghodke, Y., Anderson, P. L., Sangkuhl, K., Lamba, J., Altman, R. B., Klein, T. E. 2012; 22 (12): 891-894

    View details for DOI 10.1097/FPC.0b013e32835879a8

    View details for Web of Science ID 000311031800008

    View details for PubMedID 22960662

  • Impact of the CYP4F2 p.V433M Polymorphism on Coumarin Dose Requirement: Systematic Review and Meta-Analysis CLINICAL PHARMACOLOGY & THERAPEUTICS Danese, E., Montagnana, M., Johnson, J. A., Rettie, A. E., Zambon, C. F., Lubitz, S. A., Suarez-Kurtz, G., Cavallari, L. H., Zhao, L., Huang, M., Nakamura, Y., Mushiroda, T., Kringen, M. K., Borgiani, P., Ciccacci, C., Au, N. T., Langaee, T., Siguret, V., Loriot, M. A., Sagreiya, H., Altman, R. B., Shahin, M. H., Scott, S. A., Khalifa, S. I., Chowbay, B., Suriapranata, I. M., Teichert, M., Stricker, B. H., Taljaard, M., Botton, M. R., Zhang, J. E., Pirmohamed, M., Zhang, X., Carlquist, J. F., Horne, B. D., Lee, M. T., Pengo, V., Guidi, G. C., Minuz, P., Fava, C. 2012; 92 (6): 746-756


    A systematic review and a meta-analysis were performed to quantify the accumulated information from genetic association studies investigating the impact of the CYP4F2 rs2108622 (p.V433M) polymorphism on coumarin dose requirement. An additional aim was to explore the contribution of the CYP4F2 variant in comparison with, as well as after stratification for, the VKORC1 and CYP2C9 variants. Thirty studies involving 9,470 participants met prespecified inclusion criteria. As compared with CC-homozygotes, T-allele carriers required an 8.3% (95% confidence interval (CI): 5.6-11.1%; P < 0.0001) higher mean daily coumarin dose than CC homozygotes to reach a stable international normalized ratio (INR). There was no evidence of publication bias. Heterogeneity among studies was present (I(2) = 43%). Our results show that the CYP4F2 p.V433M polymorphism is associated with interindividual variability in response to coumarin drugs, but with a low effect size that is confirmed to be lower than those contributed by VKORC1 and CYP2C9 polymorphisms.

    View details for DOI 10.1038/clpt.2012.184

    View details for Web of Science ID 000311283400016

    View details for PubMedID 23132553

  • Metformin pathways: pharmacokinetics and pharmacodynamics PHARMACOGENETICS AND GENOMICS Gong, L., Goswami, S., Giacomini, K. M., Altman, R. B., Klein, T. E. 2012; 22 (11): 820-827

    View details for DOI 10.1097/FPC.0b013e3283559b22

    View details for Web of Science ID 000309977100008

    View details for PubMedID 22722338

  • Very important pharmacogene summary for VDR PHARMACOGENETICS AND GENOMICS Poon, A. H., Gong, L., Brasch-Andersen, C., Litonjua, A. A., Raby, B. A., Hamid, Q., Laprise, C., Weiss, S. T., Altman, R. B., Klein, T. E. 2012; 22 (10): 758-763

    View details for DOI 10.1097/FPC.0b013e328354455c

    View details for Web of Science ID 000309115000007

    View details for PubMedID 22588316

  • Implementing Personalized Medicine: Development of a Cost-Effective Customized Pharmacogenetics Genotyping Array CLINICAL PHARMACOLOGY & THERAPEUTICS Johnson, J. A., Burkley, B. M., Langaee, T. Y., Clare-Salzler, M. J., Klein, T. E., Altman, R. B. 2012; 92 (4): 437-439


    Although there is increasing evidence to support the implementation of pharmacogenetics in certain clinical scenarios, the adoption of this approach has been limited. The advent of preemptive and inexpensive testing of critical pharmacogenetic variants may overcome barriers to adoption. We describe the design of a customized array built for the personalized-medicine programs of the University of Florida and Stanford University. We selected key variants for the array using the clinical annotations of the Pharmacogenomics Knowledgebase (PharmGKB), and we included variants in drug metabolism and transporter genes along with other pharmacogenetically important variants.

    View details for DOI 10.1038/clpt.2012.125

    View details for Web of Science ID 000309017000017

    View details for PubMedID 22910441

  • Pharmacogenomics Knowledge for Personalized Medicine CLINICAL PHARMACOLOGY & THERAPEUTICS Whirl-Carrillo, M., MCDONAGH, E. M., Hebert, J. M., Gong, L., Sangkuhl, K., Thorn, C. F., Altman, R. B., Klein, T. E. 2012; 92 (4): 414-417


    The Pharmacogenomics Knowledgebase (PharmGKB) is a resource that collects, curates, and disseminates information about the impact of human genetic variation on drug responses. It provides clinically relevant information, including dosing guidelines, annotated drug labels, and potentially actionable gene-drug associations and genotype-phenotype relationships. Curators assign levels of evidence to variant-drug associations using well-defined criteria based on careful literature review. Thus, PharmGKB is a useful source of high-quality information supporting personalized medicine-implementation projects.

    View details for DOI 10.1038/clpt.2012.96

    View details for Web of Science ID 000309017000009

    View details for PubMedID 22992668

  • The state of the art in text mining and natural language processing for pharmacogenomics JOURNAL OF BIOMEDICAL INFORMATICS Coulet, A., Cohen, K. B., Altman, R. B. 2012; 45 (5): 825-826
  • PharmGKB summary: very important pharmacogene information for cytochrome P-450, family 2, subfamily A, polypeptide 6 PHARMACOGENETICS AND GENOMICS McDonagh, E. M., Wassenaar, C., David, S. P., Tyndale, R. F., Altman, R. B., Whirl-Carrillo, M., Klein, T. E. 2012; 22 (9): 695-708

    View details for DOI 10.1097/FPC.0b013e3283540217

    View details for Web of Science ID 000307652600006

    View details for PubMedID 22547082

  • PharmGKB summary: very important pharmacogene information for GSTT1 PHARMACOGENETICS AND GENOMICS Thorn, C. F., Ji, Y., Weinshilboum, R. M., Altman, R. B., Klein, T. E. 2012; 22 (8): 646-651

    View details for DOI 10.1097/FPC.0b013e3283527c02

    View details for Web of Science ID 000306483500009

    View details for PubMedID 22643671

  • Bioinformatics and variability in drug response: a protein structural perspective JOURNAL OF THE ROYAL SOCIETY INTERFACE Lahti, J. L., Tang, G. W., Capriotti, E., Liu, T., Altman, R. B. 2012; 9 (72): 1409-1437


    Marketed drugs frequently perform worse in clinical practice than in the clinical trials on which their approval is based. Many therapeutic compounds are ineffective for a large subpopulation of patients to whom they are prescribed; worse, a significant fraction of patients experience adverse effects more severe than anticipated. The unacceptable risk-benefit profile for many drugs mandates a paradigm shift towards personalized medicine. However, prior to adoption of patient-specific approaches, it is useful to understand the molecular details underlying variable drug response among diverse patient populations. Over the past decade, progress in structural genomics led to an explosion of available three-dimensional structures of drug target proteins while efforts in pharmacogenetics offered insights into polymorphisms correlated with differential therapeutic outcomes. Together these advances provide the opportunity to examine how altered protein structures arising from genetic differences affect protein-drug interactions and, ultimately, drug response. In this review, we first summarize structural characteristics of protein targets and common mechanisms of drug interactions. Next, we describe the impact of coding mutations on protein structures and drug response. Finally, we highlight tools for analysing protein structures and protein-drug interactions and discuss their application for understanding altered drug responses associated with protein structural variants.

    View details for DOI 10.1098/rsif.2011.0843

    View details for Web of Science ID 000304437400001

    View details for PubMedID 22552919

  • PharmGKB summary: very important pharmacogene information for CYP3A5 PHARMACOGENETICS AND GENOMICS Lamba, J., Hebert, J. M., Schuetz, E. G., Klein, T. E., Altman, R. B. 2012; 22 (7): 555-558

    View details for DOI 10.1097/FPC.0b013e328351d47f

    View details for Web of Science ID 000305429900009

    View details for PubMedID 22407409

  • Editorial: Current progress in Bioinformatics 2012 BRIEFINGS IN BIOINFORMATICS Altman, R. B. 2012; 13 (4): 393-394

    View details for DOI 10.1093/bib/bbs042

    View details for Web of Science ID 000306925000001

    View details for PubMedID 22833494

  • PharmGKB summary: phenytoin pathway PHARMACOGENETICS AND GENOMICS Thorn, C. F., Whirl-Carrillo, M., Leeder, J. S., Klein, T. E., Altman, R. B. 2012; 22 (6): 466-470

    View details for DOI 10.1097/FPC.0b013e32834aeedb

    View details for Web of Science ID 000303769700007

    View details for PubMedID 22569204

  • PharmGKB summary: caffeine pathway PHARMACOGENETICS AND GENOMICS Thorn, C. F., Aklillu, E., McDonagh, E. M., Klein, T. E., Altman, R. B. 2012; 22 (5): 389-395

    View details for DOI 10.1097/FPC.0b013e3283505d5e

    View details for Web of Science ID 000302783800008

    View details for PubMedID 22293536

  • Using ODIN for a PharmGKB revalidation experiment DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION Rinaldi, F., Clematide, S., Garten, Y., Whirl-Carrillo, M., Gong, L., Hebert, J. M., Sangkuhl, K., Thorn, C. F., Klein, T. E., Altman, R. B. 2012


    The need for efficient text-mining tools that support curation of the biomedical literature is ever increasing. In this article, we describe an experiment aimed at verifying whether a text-mining tool capable of extracting meaningful relationships among domain entities can be successfully integrated into the curation workflow of a major biological database. We evaluate in particular (i) the usability of the system's interface, as perceived by users, and (ii) the correlation of the ranking of interactions, as provided by the text-mining system, with the choices of the curators.

    View details for DOI 10.1093/database/bas021

    View details for Web of Science ID 000304924100001

    View details for PubMedID 22529178

  • Celecoxib pathways: pharmacokinetics and pharmacodynamics PHARMACOGENETICS AND GENOMICS Gong, L., Thorn, C. F., Bertagnolli, M. M., Grosser, T., Altman, R. B., Klein, T. E. 2012; 22 (4): 310-318

    View details for DOI 10.1097/FPC.0b013e32834f94cb

    View details for Web of Science ID 000301537400010

    View details for PubMedID 22336956

  • Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes CELL Chen, R., Mias, G. I., Li-Pook-Than, J., Jiang, L., Lam, H. Y., Chen, R., Miriami, E., Karczewski, K. J., Hariharan, M., Dewey, F. E., Cheng, Y., Clark, M. J., Im, H., Habegger, L., Balasubramanian, S., O'Huallachain, M., Dudley, J. T., Hillenmeyer, S., Haraksingh, R., Sharon, D., Euskirchen, G., Lacroute, P., Bettinger, K., Boyle, A. P., Kasowski, M., Grubert, F., Seki, S., Garcia, M., Whirl-Carrillo, M., Gallardo, M., Blasco, M. A., Greenberg, P. L., Snyder, P., Klein, T. E., Altman, R. B., Butte, A. J., Ashley, E. A., Gerstein, M., Nadeau, K. C., Tang, H., Snyder, M. 2012; 148 (6): 1293-1307


    Personalized medicine is expected to benefit from combining genomic information with regular monitoring of physiological states by multiple high-throughput methods. Here, we present an integrative personal omics profile (iPOP), an analysis that combines genomic, transcriptomic, proteomic, metabolomic, and autoantibody profiles from a single individual over a 14 month period. Our iPOP analysis revealed various medical risks, including type 2 diabetes. It also uncovered extensive, dynamic changes in diverse molecular components and biological pathways across healthy and diseased conditions. Extremely high-coverage genomic and transcriptomic data, which provide the basis of our iPOP, revealed extensive heteroallelic changes during healthy and diseased states and an unexpected RNA editing mechanism. This study demonstrates that longitudinal iPOP can be used to interpret healthy and diseased states by connecting genomic information with additional dynamic omics activity.

    View details for DOI 10.1016/j.cell.2012.02.009

    View details for Web of Science ID 000301889500023

    View details for PubMedID 22424236

  • PharmGKB summary: very important pharmacogene information for G6PD PHARMACOGENETICS AND GENOMICS McDonagh, E. M., Thorn, C. F., Bautista, J. M., Youngster, I., Altman, R. B., Klein, T. E. 2012; 22 (3): 219-228

    View details for DOI 10.1097/FPC.0b013e32834eb313

    View details for Web of Science ID 000300409800008

    View details for PubMedID 22237549

  • Simbios: an NIH national center for physics-based simulation of biological structures JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Delp, S. L., Ku, J. P., Pande, V. S., Sherman, M. A., Altman, R. B. 2012; 19 (2): 186-189


    Physics-based simulation provides a powerful framework for understanding biological form and function. Simulations can be used by biologists to study macromolecular assemblies and by clinicians to design treatments for diseases. Simulations help biomedical researchers understand the physical constraints on biological systems as they engineer novel drugs, synthetic tissues, medical devices, and surgical interventions. Although individual biomedical investigators make outstanding contributions to physics-based simulation, the field has been fragmented. Applications are typically limited to a single physical scale, and individual investigators usually must create their own software. These conditions created a major barrier to advancing simulation capabilities. In 2004, we established a National Center for Physics-Based Simulation of Biological Structures (Simbios) to help integrate the field and accelerate biomedical research. In 6 years, Simbios has become a vibrant national center, with collaborators in 16 states and eight countries. Simbios focuses on problems at both the molecular scale and the organismal level, with a long-term goal of uniting these in accurate multiscale simulations.

    View details for DOI 10.1136/amiajnl-2011-000488

    View details for Web of Science ID 000300768100009

    View details for PubMedID 22081222

  • PharmGKB summary: very important pharmacogene information for cytochrome P450, family 2, subfamily C, polypeptide 19 PHARMACOGENETICS AND GENOMICS Scott, S. A., Sangkuhl, K., Shuldiner, A. R., Hulot, J., Thorn, C. F., Altman, R. B., Klein, T. E. 2012; 22 (2): 159-165

    View details for DOI 10.1097/FPC.0b013e32834d4962

    View details for Web of Science ID 000299310600008

    View details for PubMedID 22027650

  • PharmGKB summary: very important pharmacogene information for CYP1A2 PHARMACOGENETICS AND GENOMICS Thorn, C. F., Aklillu, E., Klein, T. E., Altman, R. B. 2012; 22 (1): 73-77

    View details for DOI 10.1097/FPC.0b013e32834c6efd

    View details for Web of Science ID 000298249500009

    View details for PubMedID 21989077

  • Interpretome: a freely available, modular, and secure personal genome interpretation engine. Karczewski, K. J., Tirrell, R. P., Cordero, P., Tatonetti, N. P., Dudley, J. T., Salari, K., Altman, R. B. 2012
  • Discovery and explanation of drug-drug interations via text mining. Percha, B., Garten, Y., Altman, R, B. 2012
  • Chapter 7: Pharmacogenomics. PLoS Comput Biol., PMCID: PMC3531317. Karczewski, K. J., Daneshjou, R., Altman, R, B. 2012; 8 (12): e1002817
  • Mice lacking the beta 2 adrenergic receptor have a unique genetic profile before and after focal brain ischaemia ASN NEURO White, R. E., Palm, C., Xu, L., Ling, E., Ginsburg, M., Daigle, B. J., Han, R., Patterson, A., Altman, R. B., Giffard, R. G. 2012; 4 (5): 343-356

    View details for DOI 10.1042/AN20110020

    View details for Web of Science ID 000308887200005

  • Interpretome: a freely available, modular, and secure personal genome interpretation engine. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Karczewski, K. J., Tirrell, R. P., Cordero, P., Tatonetti, N. P., Dudley, J. T., Salari, K., Snyder, M., Altman, R. B., Kim, S. K. 2012: 339-350


    The decreasing cost of genotyping and genome sequencing has ushered in an era of genomic personalized medicine. More than 100,000 individuals have been genotyped by direct-to-consumer genetic testing services, which offer a glimpse into the interpretation and exploration of a personal genome. However, these interpretations, which require extensive manual curation, are subject to the preferences of the company and are not customizable by the individual. Academic institutions teaching personalized medicine, as well as genetic hobbyists, may prefer to customize their analysis and have full control over the content and method of interpretation. We present the Interpretome, a system for private genome interpretation, which contains all genotype information in client-side interpretation scripts, supported by server-side databases. We provide state-of-the-art analyses for teaching clinical implications of personal genomics, including disease risk assessment and pharmacogenomics. Additionally, we have implemented client-side algorithms for ancestry inference, demonstrating the power of these methods without excessive computation. Finally, the modular nature of the system allows for plugin capabilities for custom analyses. This system will allow for personal genome exploration without compromising privacy, facilitating hands-on courses in genomics and personalized medicine.

    View details for PubMedID 22174289

  • A novel signal detection algorithm for identifying hidden drug-drug interactions in adverse event reports JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Tatonetti, N. P., Fernald, G. H., Altman, R. B. 2012; 19 (1): 79-85


    Adverse drug events (ADEs) are common and account for 770?000 injuries and deaths each year and drug interactions account for as much as 30% of these ADEs. Spontaneous reporting systems routinely collect ADEs from patients on complex combinations of medications and provide an opportunity to discover unexpected drug interactions. Unfortunately, current algorithms for such "signal detection" are limited by underreporting of interactions that are not expected. We present a novel method to identify latent drug interaction signals in the case of underreporting.We identified eight clinically significant adverse events. We used the FDA's Adverse Event Reporting System to build profiles for these adverse events based on the side effects of drugs known to produce them. We then looked for pairs of drugs that match these single-drug profiles in order to predict potential interactions. We evaluated these interactions in two independent data sets and also through a retrospective analysis of the Stanford Hospital electronic medical records.We identified 171 novel drug interactions (for eight adverse event categories) that are significantly enriched for known drug interactions (p=0.0009) and used the electronic medical record for independently testing drug interaction hypotheses using multivariate statistical models with covariates.Our method provides an option for detecting hidden interactions in spontaneous reporting systems by using side effect profiles to infer the presence of unreported adverse events.

    View details for DOI 10.1136/amiajnl-2011-000214

    View details for Web of Science ID 000298848100012

    View details for PubMedID 21676938

  • Discovery and explanation of drug-drug interactions via text mining. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Percha, B., Garten, Y., Altman, R. B. 2012: 410-421


    Drug-drug interactions (DDIs) can occur when two drugs interact with the same gene product. Most available information about gene-drug relationships is contained within the scientific literature, but is dispersed over a large number of publications, with thousands of new publications added each month. In this setting, automated text mining is an attractive solution for identifying gene-drug relationships and aggregating them to predict novel DDIs. In previous work, we have shown that gene-drug interactions can be extracted from Medline abstracts with high fidelity - we extract not only the genes and drugs, but also the type of relationship expressed in individual sentences (e.g. metabolize, inhibit, activate and many others). We normalize these relationships and map them to a standardized ontology. In this work, we hypothesize that we can combine these normalized gene-drug relationships, drawn from a very broad and diverse literature, to infer DDIs. Using a training set of established DDIs, we have trained a random forest classifier to score potential DDIs based on the features of the normalized assertions extracted from the literature that relate two drugs to a gene product. The classifier recognizes the combinations of relationships, drugs and genes that are most associated with the gold standard DDIs, correctly identifying 79.8% of assertions relating interacting drug pairs and 78.9% of assertions relating noninteracting drug pairs. Most significantly, because our text processing method captures the semantics of individual gene-drug relationships, we can construct mechanistic pharmacological explanations for the newly-proposed DDIs. We show how our classifier can be used to explain known DDIs and to uncover new DDIs that have not yet been reported.

    View details for PubMedID 22174296

  • From pharmacogenomic knowledge acquisition to clinical applications: the PharmGKB as a clinical pharmacogenomic biomarker resource BIOMARKERS IN MEDICINE McDonagh, E. M., Whirl-Carrillo, M., Garten, Y., Altman, R. B., Klein, T. E. 2011; 5 (6): 795-806


    The mission of the Pharmacogenomics Knowledge Base (PharmGKB; ) is to collect, encode and disseminate knowledge about the impact of human genetic variations on drug responses. It is an important worldwide resource of clinical pharmacogenomic biomarkers available to all. The PharmGKB website has evolved to highlight our knowledge curation and aggregation over our previous emphasis on collecting primary data. This review summarizes the methods we use to drive this expanded scope of 'Knowledge Acquisition to Clinical Applications', the new features available on our website and our future goals.

    View details for DOI 10.2217/BMM.11.94

    View details for Web of Science ID 000298488200009

    View details for PubMedID 22103613

  • Using Multiple Microenvironments to Find Similar Ligand-Binding Sites: Application to Kinase Inhibitor Binding PLOS COMPUTATIONAL BIOLOGY Liu, T., Altman, R. B. 2011; 7 (12)


    The recognition of cryptic small-molecular binding sites in protein structures is important for understanding off-target side effects and for recognizing potential new indications for existing drugs. Current methods focus on the geometry and detailed chemical interactions within putative binding pockets, but may not recognize distant similarities where dynamics or modified interactions allow one ligand to bind apparently divergent binding pockets. In this paper, we introduce an algorithm that seeks similar microenvironments within two binding sites, and assesses overall binding site similarity by the presence of multiple shared microenvironments. The method has relatively weak geometric requirements (to allow for conformational change or dynamics in both the ligand and the pocket) and uses multiple biophysical and biochemical measures to characterize the microenvironments (to allow for diverse modes of ligand binding). We term the algorithm PocketFEATURE, since it focuses on pockets using the FEATURE system for characterizing microenvironments. We validate PocketFEATURE first by showing that it can better discriminate sites that bind similar ligands from those that do not, and by showing that we can recognize FAD-binding sites on a proteome scale with Area Under the Curve (AUC) of 92%. We then apply PocketFEATURE to evolutionarily distant kinases, for which the method recognizes several proven distant relationships, and predicts unexpected shared ligand binding. Using experimental data from ChEMBL and Ambit, we show that at high significance level, 40 kinase pairs are predicted to share ligands. Some of these pairs offer new opportunities for inhibiting two proteins in a single pathway.

    View details for DOI 10.1371/journal.pcbi.1002326

    View details for Web of Science ID 000299167800043

    View details for PubMedID 22219723

  • PharmGKB summary: carbamazepine pathway PHARMACOGENETICS AND GENOMICS Thorn, C. F., Leckband, S. G., Kelsoe, J., Leeder, J. S., Mueller, D. J., Klein, T. E., Altman, R. B. 2011; 21 (12): 906-910

    View details for DOI 10.1097/FPC.0b013e328348c6f2

    View details for Web of Science ID 000296799900016

    View details for PubMedID 21738081

  • PharmGKB summary: citalopram pharmacokinetics pathway PHARMACOGENETICS AND GENOMICS Sangkuhl, K., Klein, T. E., Altman, R. B. 2011; 21 (11): 769-772

    View details for DOI 10.1097/FPC.0b013e328346063f

    View details for Web of Science ID 000296146400010

    View details for PubMedID 21546862

  • PharmGKB summary: methotrexate pathway PHARMACOGENETICS AND GENOMICS Mikkelsen, T. S., Thorn, C. F., Yang, J. J., Ulrich, C. M., French, D., Zaza, G., Dunnenberger, H. M., Marsh, S., McLeod, H. L., Giacomini, K., Becker, M. L., Gaedigk, R., Leeder, J. S., Kager, L., Relling, M. V., Evans, W., Klein, T. E., Altman, R. B. 2011; 21 (10): 679-686

    View details for DOI 10.1097/FPC.0b013e328343dd93

    View details for Web of Science ID 000294808900008

    View details for PubMedID 21317831

  • Clinical Pharmacogenetics Implementation Consortium Guidelines for CYP2C9 and VKORC1 Genotypes and Warfarin Dosing CLINICAL PHARMACOLOGY & THERAPEUTICS Johnson, J. A., Gong, L., Whirl-Carrillo, M., Gage, B. F., Scott, S. A., Stein, C. M., Anderson, J. L., Kimmel, S. E., Lee, M. T., Pirmohamed, M., Wadelius, M., Klein, T. E., Altman, R. B. 2011; 90 (4): 625-629


    Warfarin is a widely used anticoagulant with a narrow therapeutic index and large interpatient variability in the dose required to achieve target anticoagulation. Common genetic variants in the cytochrome P450-2C9 (CYP2C9) and vitamin K-epoxide reductase complex (VKORC1) enzymes, in addition to known nongenetic factors, account for ~50% of warfarin dose variability. The purpose of this article is to assist in the interpretation and use of CYP2C9 and VKORC1 genotype data for estimating therapeutic warfarin dose to achieve an INR of 2-3, should genotype results be available to the clinician. The Clinical Pharmacogenetics Implementation Consortium (CPIC) of the National Institutes of Health Pharmacogenomics Research Network develops peer-reviewed gene-drug guidelines that are published and updated periodically on based on new developments in the field.(1).

    View details for DOI 10.1038/clpt.2011.185

    View details for Web of Science ID 000295119200035

    View details for PubMedID 21900891

  • A new disease-specific machine learning approach for the prediction of cancer-causing missense variants GENOMICS Capriotti, E., Altman, R. B. 2011; 98 (4): 310-317


    High-throughput genotyping and sequencing techniques are rapidly and inexpensively providing large amounts of human genetic variation data. Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability and have been implicated in several human diseases, including cancer. Amino acid mutations resulting from non-synonymous SNPs in coding regions may generate protein functional changes that affect cell proliferation. In this study, we developed a machine learning approach to predict cancer-causing missense variants. We present a Support Vector Machine (SVM) classifier trained on a set of 3163 cancer-causing variants and an equal number of neutral polymorphisms. The method achieve 93% overall accuracy, a correlation coefficient of 0.86, and area under ROC curve of 0.98. When compared with other previously developed algorithms such as SIFT and CHASM our method results in higher prediction accuracy and correlation coefficient in identifying cancer-causing variants.

    View details for DOI 10.1016/j.ygeno.2011.06.010

    View details for Web of Science ID 000295896300011

    View details for PubMedID 21763417

  • PharmGKB summary: very important pharmacogene information for PTGS2 PHARMACOGENETICS AND GENOMICS Thorn, C. F., Grosser, T., Klein, T. E., Altman, R. B. 2011; 21 (9): 607-613

    View details for DOI 10.1097/FPC.0b013e3283415515

    View details for Web of Science ID 000293731200012

    View details for PubMedID 21063235

  • Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence PLOS GENETICS Dewey, F. E., Chen, R., Cordero, S. P., Ormond, K. E., Caleshu, C., Karczewski, K. J., Whirl-Carrillo, M., Wheeler, M. T., Dudley, J. T., Byrnes, J. K., Cornejo, O. E., Knowles, J. W., Woon, M., Sangkuhl, K., Gong, L., Thorn, C. F., Hebert, J. M., Capriotti, E., David, S. P., Pavlovic, A., West, A., Thakuria, J. V., Ball, M. P., Zaranek, A. W., Rehm, H. L., Church, G. M., West, J. S., Bustamante, C. D., Snyder, M., Altman, R. B., Klein, T. E., Butte, A. J., Ashley, E. A. 2011; 7 (9)


    Whole-genome sequencing harbors unprecedented potential for characterization of individual and family genetic variation. Here, we develop a novel synthetic human reference sequence that is ethnically concordant and use it for the analysis of genomes from a nuclear family with history of familial thrombophilia. We demonstrate that the use of the major allele reference sequence results in improved genotype accuracy for disease-associated variant loci. We infer recombination sites to the lowest median resolution demonstrated to date (< 1,000 base pairs). We use family inheritance state analysis to control sequencing error and inform family-wide haplotype phasing, allowing quantification of genome-wide compound heterozygosity. We develop a sequence-based methodology for Human Leukocyte Antigen typing that contributes to disease risk prediction. Finally, we advance methods for analysis of disease and pharmacogenomic risk across the coding and non-coding genome that incorporate phased variant data. We show these methods are capable of identifying multigenic risk for inherited thrombophilia and informing the appropriate pharmacological therapy. These ethnicity-specific, family-based approaches to interpretation of genetic variation are emblematic of the next generation of genetic risk assessment using whole-genome sequencing.

    View details for DOI 10.1371/journal.pgen.1002280

    View details for Web of Science ID 000295419100031

    View details for PubMedID 21935354

  • Fast Flexible Modeling of RNA Structure Using Internal Coordinates IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Flores, S. C., Sherman, M. A., Bruns, C. M., Eastman, P., Altman, R. B. 2011; 8 (5): 1247-1257


    Modeling the structure and dynamics of large macromolecules remains a critical challenge. Molecular dynamics (MD) simulations are expensive because they model every atom independently, and are difficult to combine with experimentally derived knowledge. Assembly of molecules using fragments from libraries relies on the database of known structures and thus may not work for novel motifs. Coarse-grained modeling methods have yielded good results on large molecules but can suffer from difficulties in creating more detailed full atomic realizations. There is therefore a need for molecular modeling algorithms that remain chemically accurate and economical for large molecules, do not rely on fragment libraries, and can incorporate experimental information. RNABuilder works in the internal coordinate space of dihedral angles and thus has time requirements proportional to the number of moving parts rather than the number of atoms. It provides accurate physics-based response to applied forces, but also allows user-specified forces for incorporating experimental information. A particular strength of RNABuilder is that all Leontis-Westhof basepairs can be specified as primitives by the user to be satisfied during model construction. We apply RNABuilder to predict the structure of an RNA molecule with 160 bases from its secondary structure, as well as experimental information. Our model matches the known structure to 10.2 Angstroms RMSD and has low computational expense.

    View details for DOI 10.1109/TCBB.2010.104

    View details for Web of Science ID 000292681800008

    View details for PubMedID 21778523

  • CAMPAIGN: an open-source library of GPU-accelerated data clustering algorithms BIOINFORMATICS Kohlhoff, K. J., Sosnick, M. H., Hsu, W. T., Pande, V. S., Altman, R. B. 2011; 27 (16): 2322-2323


    Data clustering techniques are an essential component of a good data analysis toolbox. Many current bioinformatics applications are inherently compute-intense and work with very large datasets. Sequential algorithms are inadequate for providing the necessary performance. For this reason, we have created Clustering Algorithms for Massively Parallel Architectures, Including GPU Nodes (CAMPAIGN), a central resource for data clustering algorithms and tools that are implemented specifically for execution on massively parallel processing architectures.CAMPAIGN is a library of data clustering algorithms and tools, written in 'C for CUDA' for Nvidia GPUs. The library provides up to two orders of magnitude speed-up over respective CPU-based clustering algorithms and is intended as an open-source resource. New modules from the community will be accepted into the library and the layout of it is such that it can easily be extended to promising future platforms such as OpenCL.Releases of the CAMPAIGN library are freely available for download under the LGPL from Source code can also be obtained through anonymous subversion access as described on

    View details for DOI 10.1093/bioinformatics/btr386

    View details for Web of Science ID 000293620800028

    View details for PubMedID 21712246

  • Cooperative transcription factor associations discovered using regulatory variation PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Karczewski, K. J., Tatonetti, N. P., Landt, S. G., Yang, X., Slifer, T., Altman, R. B., Snyder, M. 2011; 108 (32): 13353-13358


    Regulation of gene expression at the transcriptional level is achieved by complex interactions of transcription factors operating at their target genes. Dissecting the specific combination of factors that bind each target is a significant challenge. Here, we describe in detail the Allele Binding Cooperativity test, which uses variation in transcription factor binding among individuals to discover combinations of factors and their targets. We developed the ALPHABIT (a large-scale process to hunt for allele binding interacting transcription factors) pipeline, which includes statistical analysis of binding sites followed by experimental validation, and demonstrate that this method predicts transcription factors that associate with NF?B. Our method successfully identifies factors that have been known to work with NF?B (E2A, STAT1, IRF2), but whose global coassociation and sites of cooperative action were not known. In addition, we identify a unique coassociation (EBF1) that had not been reported previously. We present a general approach for discovering combinatorial models of regulation and advance our understanding of the genetic basis of variation in transcription factor binding.

    View details for DOI 10.1073/pnas.1103105108

    View details for Web of Science ID 000293691400076

    View details for PubMedID 21828005

  • Platelet aggregation pathway PHARMACOGENETICS AND GENOMICS Sangkuhl, K., Shuldiner, A. R., Klein, T. E., Altman, R. B. 2011; 21 (8): 516-521

    View details for DOI 10.1097/FPC.0b013e3283406323

    View details for Web of Science ID 000292634200009

    View details for PubMedID 20938371

  • RNA molecules with conserved catalytic cores but variable peripheries fold along unique energetically optimized pathways RNA-A PUBLICATION OF THE RNA SOCIETY Mitra, S., Laederach, A., Golden, B. L., Altman, R. B., Brenowitz, M. 2011; 17 (8): 1589-1603


    Functional and kinetic constraints must be efficiently balanced during the folding process of all biopolymers. To understand how homologous RNA molecules with different global architectures fold into a common core structure we determined, under identical conditions, the folding mechanisms of three phylogenetically divergent group I intron ribozymes. These ribozymes share a conserved functional core defined by topologically equivalent tertiary motifs but differ in their primary sequence, size, and structural complexity. Time-resolved hydroxyl radical probing of the backbone solvent accessible surface and catalytic activity measurements integrated with structural-kinetic modeling reveal that each ribozyme adopts a unique strategy to attain the conserved functional fold. The folding rates are not dictated by the size or the overall structural complexity, but rather by the strength of the constituent tertiary motifs which, in turn, govern the structure, stability, and lifetime of the folding intermediates. A fundamental general principle of RNA folding emerges from this study: The dominant folding flux always proceeds through an optimally structured kinetic intermediate that has sufficient stability to act as a nucleating scaffold while retaining enough conformational freedom to avoid kinetic trapping. Our results also suggest a potential role of naturally selected peripheral A-minor interactions in balancing RNA structural stability with folding efficiency.

    View details for DOI 10.1261/rna.2694811

    View details for Web of Science ID 000292843000016

    View details for PubMedID 21712400

  • Improving the prediction of disease-related variants using protein three-dimensional structure Capriotti, E., Altman, R. B. BIOMED CENTRAL LTD. 2011


    Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability. Non-synonymous SNPs occurring in coding regions result in single amino acid polymorphisms (SAPs) that may affect protein function and lead to pathology. Several methods attempt to estimate the impact of SAPs using different sources of information. Although sequence-based predictors have shown good performance, the quality of these predictions can be further improved by introducing new features derived from three-dimensional protein structures.In this paper, we present a structure-based machine learning approach for predicting disease-related SAPs. We have trained a Support Vector Machine (SVM) on a set of 3,342 disease-related mutations and 1,644 neutral polymorphisms from 784 protein chains. We use SVM input features derived from the protein's sequence, structure, and function. After dataset balancing, the structure-based method (SVM-3D) reaches an overall accuracy of 85%, a correlation coefficient of 0.70, and an area under the receiving operating characteristic curve (AUC) of 0.92. When compared with a similar sequence-based predictor, SVM-3D results in an increase of the overall accuracy and AUC by 3%, and correlation coefficient by 0.06. The robustness of this improvement has been tested on different datasets and in all the cases SVM-3D performs better than previously developed methods even when compared with PolyPhen2, which explicitly considers in input protein structure information.This work demonstrates that structural information can increase the accuracy of disease-related SAPs identification. Our results also quantify the magnitude of improvement on a large dataset. This improvement is in agreement with previously observed results, where structure information enhanced the prediction of protein stability changes upon mutation. Although the structural information contained in the Protein Data Bank is limiting the application and the performance of our structure-based method, we expect that SVM-3D will result in higher accuracy when more structural date become available.

    View details for DOI 10.1186/1471-2105-12-S4-S3

    View details for Web of Science ID 000303930500003

    View details for PubMedID 21992054

  • Doxorubicin pathways: pharmacodynamics and adverse effects PHARMACOGENETICS AND GENOMICS Thorn, C. F., Oshiro, C., Marsh, S., Hernandez-Boussard, T., McLeod, H., Klein, T. E., Altman, R. B. 2011; 21 (7): 440-446

    View details for DOI 10.1097/FPC.0b013e32833ffb56

    View details for Web of Science ID 000291633300011

    View details for PubMedID 21048526

  • Bioinformatics challenges for personalized medicine BIOINFORMATICS Fernald, G. H., Capriotti, E., Daneshjou, R., Karczewski, K. J., Altman, R. B. 2011; 27 (13): 1741-1748


    Widespread availability of low-cost, full genome sequencing will introduce new challenges for bioinformatics.This review outlines recent developments in sequencing technologies and genome analysis methods for application in personalized medicine. New methods are needed in four areas to realize the potential of personalized medicine: (i) processing large-scale robust genomic data; (ii) interpreting the functional effect and the impact of genomic variation; (iii) integrating systems data to relate complex genetic interactions with phenotypes; and (iv) translating these discoveries into medical

    View details for DOI 10.1093/bioinformatics/btr295

    View details for Web of Science ID 000291752600050

    View details for PubMedID 21596790

  • Detecting Drug Interactions From Adverse-Event Reports: Interaction Between Paroxetine and Pravastatin Increases Blood Glucose Levels CLINICAL PHARMACOLOGY & THERAPEUTICS Tatonetti, N. P., Denny, J. C., Murphy, S. N., Fernald, G. H., Krishnan, G., Castro, V., Yue, P., Tsau, P. S., Kohane, I., Roden, D. M., Altman, R. B. 2011; 90 (1): 133-142


    The lipid-lowering agent pravastatin and the antidepressant paroxetine are among the most widely prescribed drugs in the world. Unexpected interactions between them could have important public health implications. We mined the US Food and Drug Administration's (FDA's) Adverse Event Reporting System (AERS) for side-effect profiles involving glucose homeostasis and found a surprisingly strong signal for comedication with pravastatin and paroxetine. We retrospectively evaluated changes in blood glucose in 104 patients with diabetes and 135 without diabetes who had received comedication with these two drugs, using data in electronic medical record (EMR) systems of three geographically distinct sites. We assessed the mean random blood glucose levels before and after treatment with the drugs. We found that pravastatin and paroxetine, when administered together, had a synergistic effect on blood glucose. The average increase was 19 mg/dl (1.0 mmol/l) overall, and in those with diabetes it was 48 mg/dl (2.7 mmol/l). In contrast, neither drug administered singly was associated with such changes in glucose levels. An increase in glucose levels is not a general effect of combined therapy with selective serotonin reuptake inhibitors (SSRIs) and statins.

    View details for DOI 10.1038/clpt.2011.83

    View details for Web of Science ID 000291853800023

    View details for PubMedID 21613990

  • 2010 Translational bioinformatics year in review JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Altman, R. B., Miller, K. S. 2011; 18 (4): 358-366


    A review of 2010 research in translational bioinformatics provides much to marvel at. We have seen notable advances in personal genomics, pharmacogenetics, and sequencing. At the same time, the infrastructure for the field has burgeoned. While acknowledging that, according to researchers, the members of this field tend to be overly optimistic, the authors predict a bright future.

    View details for DOI 10.1136/amiajnl-2011-000328

    View details for Web of Science ID 000292061700004

    View details for PubMedID 21672905

  • PharmGKB summary: dopamine receptor D2 PHARMACOGENETICS AND GENOMICS Mi, H., Thomas, P. D., Ring, H. Z., Jiang, R., Sangkuhl, K., Klein, T. E., Altman, R. B. 2011; 21 (6): 350-356

    View details for DOI 10.1097/FPC.0b013e32833ee605

    View details for Web of Science ID 000290431200007

    View details for PubMedID 20736885

  • PharmGKB summary: cytochrome P450, family 2, subfamily J, polypeptide 2: CYP2J2 PHARMACOGENETICS AND GENOMICS Berlin, D. S., Sangkuhl, K., Klein, T. E., Altman, R. B. 2011; 21 (5): 308-311

    View details for DOI 10.1097/FPC.0b013e32833d1011

    View details for Web of Science ID 000289460200009

    View details for PubMedID 20739908

  • Databases in the Area of Pharmacogenetics HUMAN MUTATION Sim, S. C., Altman, R. B., Ingelman-Sundberg, M. 2011; 32 (5): 526-531


    In the area of pharmacogenetics and personalized health care it is obvious that databases, providing important information of the occurrence and consequences of variant genes encoding drug metabolizing enzymes, drug transporters, drug targets, and other proteins of importance for drug response or toxicity, are of critical value for scientists, physicians, and industry. The primary outcome of the pharmacogenomic field is the identification of biomarkers that can predict drug toxicity and drug response, thereby individualizing and improving drug treatment of patients. The drug in question and the polymorphic gene exerting the impact are the main issues to be searched for in the databases. Here, we review the databases that provide useful information in this respect, of benefit for the development of the pharmacogenomic field.

    View details for DOI 10.1002/humu.21454

    View details for Web of Science ID 000289984100006

    View details for PubMedID 21309040

  • Remote Thioredoxin Recognition Using Evolutionary Conservation and Structural Dynamics STRUCTURE Tang, G. W., Altman, R. B. 2011; 19 (4): 461-470


    The thioredoxin family of oxidoreductases plays an important role in redox signaling and control of protein function. Not only are thioredoxins linked to a variety of disorders, but their stable structure has also seen application in protein engineering. Both sequence-based and structure-based tools exist for thioredoxin identification, but remote homolog detection remains a challenge. We developed a thioredoxin predictor using the approach of integrating sequence with structural information. We combined a sequence-based Hidden Markov Model (HMM) with a molecular dynamics enhanced structure-based recognition method (dynamic FEATURE, DF). This hybrid method (HMMDF) has high precision and recall (0.90 and 0.95, respectively) compared with HMM (0.92 and 0.87, respectively) and DF (0.82 and 0.97, respectively). Dynamic FEATURE is sensitive but struggles to resolve closely related protein families, while HMM identifies these evolutionary differences by compromising sensitivity. Our method applied to structural genomics targets makes a strong prediction of a novel thioredoxin.

    View details for DOI 10.1016/j.str.2011.02.007

    View details for Web of Science ID 000289592600005

    View details for PubMedID 21481770

  • PharmGKB summary: fluoropyrimidine pathways PHARMACOGENETICS AND GENOMICS Thorn, C. F., Marsh, S., Carrillo, M. W., McLeod, H. L., Klein, T. E., Altman, R. B. 2011; 21 (4): 237-242

    View details for DOI 10.1097/FPC.0b013e32833c6107

    View details for Web of Science ID 000288444500010

    View details for PubMedID 20601926

  • Very important pharmacogene summary: ABCB1 (MDR1, P-glycoprotein) PHARMACOGENETICS AND GENOMICS Hodges, L. M., Markova, S. M., Chinn, L. W., Gow, J. M., Kroetz, D. L., Klein, T. E., Altman, R. B. 2011; 21 (3): 152-161

    View details for DOI 10.1097/FPC.0b013e3283385a1c

    View details for Web of Science ID 000286971900007

    View details for PubMedID 20216335

  • Pharmacogenomics: "Noninferiority" Is Sufficient for Initial Implementation CLINICAL PHARMACOLOGY & THERAPEUTICS Altman, R. B. 2011; 89 (3): 348-350


    Recent clinical annotation of a whole-genome sequence suggests that pharmacogenomics (PGx) may be ready for clinical implementation now. This conclusion rests on the recognition that PGx has greatly mitigated risks as compared with using genomics for assessment of disease risk. Failure to recognize these differences can produce unrealistic cost-benefit scenarios and impractical standards of evidence. In many cases, pharmacogenetic tests need only reach reasonable expectations of noninferiority (compared with current prescribing practices) to merit use.

    View details for DOI 10.1038/clpt.2010.310

    View details for Web of Science ID 000287439600011

    View details for PubMedID 21326263

  • PharmGKB: very important pharmacogene - HMGCR PHARMACOGENETICS AND GENOMICS Medina, M. W., Sangkuhl, K., Klein, T. E., Altman, R. B. 2011; 21 (2): 98-101

    View details for DOI 10.1097/FPC.0b013e328336c81b

    View details for Web of Science ID 000286096000006

    View details for PubMedID 20084049

  • Pharmacogenomics: will the promise be fulfilled? NATURE REVIEWS GENETICS Altman, R. B., Kroemer, H. K., McCarty, C. A., Ratain, M. J., Roden, D. 2011; 12 (1): 69-73


    Tools such as genome resequencing and genome-wide association studies have recently been used to uncover a number of variants that affect drug toxicity and efficacy, as well as potential drug targets. But how much closer are we to incorporating pharmacogenomics into routine clinical practice? Five experts discuss how far we have come, and highlight the technological, informatics, educational and practical obstacles that stand in the way of realizing genome-driven medicine.

    View details for DOI 10.1038/nrg2920

    View details for Web of Science ID 000285410500012

    View details for PubMedID 21116304

  • Structural insights into pre-translocation ribosome motions. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Flores, S. C., Altman, R. 2011: 205-211


    Subsequent to the peptidyl transfer step of the translation elongation cycle, the initially formed pre-translocation ribosome, which we refer to here as R(1), undergoes a ratchet-like intersubunit rotation in order to sample a rotated conformation, referred to here as R(F), that is an obligatory intermediate in the translocation of tRNAs and mRNA through the ribosome during the translocation step of the translation elongation cycle. R(F) and the R(1) to R(F) transition are currently the subject of intense research, driven in part by the potential for developing novel antibiotics which trap R(F) or confound the R(1) to R(F) transition. Currently lacking a 3D atomic structure of the R(F) endpoint of the transition, as well as a preliminary conformational trajectory connecting R(1) and R(F), the dynamics of the mechanistically crucial R(1) to R(F) transition remain elusive. The current literature reports fitting of only a few ribosomal RNA (rRNA) and ribosomal protein (r-protein) components into cryogenic electron microscopy (cryo-EM) reconstructions of the Escherichia coli ribosome in RF. In this work we now fit the entire Thermus thermophilus 16S and 23S rRNAs and most of the remaining T. thermophilus r-proteins into a cryo-EM reconstruction of the E. coli ribosome in R(F) in order to build an almost complete model of the T. thermophilus ribosome in R(F) thus allowing a more detailed view of this crucial conformation. The resulting model validates key predictions from the published literature; in particular it recovers intersubunit bridges known to be maintained throughout the R(1) to R(F) transition and results in new intersubunit bridges that are predicted to exist only in R(F). In addition, we use a recently reported E. coli ribosome structure, apparently trapped in an intermediate state along the R(1) to R(F) transition pathway, referred to here as R(2), as a guide to generate a T. thermophilus ribosome in the R(2) state. This demonstrates a multiresolution method for morphing large complexes and provides us with a structural model of R(2) in the species of interest. The generated structural models form the basis for probing the motion of the deacylated tRNA bound at the peptidyl-tRNA binding site (P site) of the pre-translocation ribosome as it moves from its so-called classical P/P configuration to its so-called hybrid P/E configuration as part of the R(1) to R(F) transition. We create a dynamic model of this process which provides structural insights into the functional significance of R(2) as well as detailed atomic information to guide the design of further experiments. The results suggest extensibility to other steps of protein synthesis as well as to spatially larger systems.

    View details for PubMedID 21121048

  • Structural insights into pre-translocation ribosome motions. Flores, S. C., Altman, R, B. 2011
  • Cooperative transcription factor associations discovered using regulatory variation. Karczewski, K. J., Tatonetti, N. P., Landt, S. G., Yang, X., Slifer, T., Altman, R. B. 2011
  • Improving the prediction of disease-related variants using protein three-dimensional structure. BMC Bioinformatics.;12 Suppl4:S3. Epub 2011 Jul 5. PMCID PMC3194195. Capriotti, E., Altman, R, B. 2011
  • Perspective: 2010 Translational bioinformatics year in review. JAMIA., PMCID: PMC3128418. Altman, R. B., Miller, K, S. 2011; 4 (18): 358-366
  • Integration and publication of heterogeneous text-mined relationships on the Semantic Web. Journal of biomedical semantics Coulet, A., Garten, Y., Dumontier, M., Altman, R. B., Musen, M. A., Shah, N. H. 2011; 2: S10-?


    Advances in Natural Language Processing (NLP) techniques enable the extraction of fine-grained relationships mentioned in biomedical text. The variability and the complexity of natural language in expressing similar relationships causes the extracted relationships to be highly heterogeneous, which makes the construction of knowledge bases difficult and poses a challenge in using these for data mining or question answering.We report on the semi-automatic construction of the PHARE relationship ontology (the PHArmacogenomic RElationships Ontology) consisting of 200 curated relations from over 40,000 heterogeneous relationships extracted via text-mining. These heterogeneous relations are then mapped to the PHARE ontology using synonyms, entity descriptions and hierarchies of entities and roles. Once mapped, relationships can be normalized and compared using the structure of the ontology to identify relationships that have similar semantics but different syntax. We compare and contrast the manual procedure with a fully automated approach using WordNet to quantify the degree of integration enabled by iterative curation and refinement of the PHARE ontology. The result of such integration is a repository of normalized biomedical relationships, named PHARE-KB, which can be queried using Semantic Web technologies such as SPARQL and can be visualized in the form of a biological network.The PHARE ontology serves as a common semantic framework to integrate more than 40,000 relationships pertinent to pharmacogenomics. The PHARE ontology forms the foundation of a knowledge base named PHARE-KB. Once populated with relationships, PHARE-KB (i) can be visualized in the form of a biological network to guide human tasks such as database curation and (ii) can be queried programmatically to guide bioinformatics applications such as the prediction of molecular interactions. PHARE is available at

    View details for DOI 10.1186/2041-1480-2-S2-S10

    View details for PubMedID 21624156

  • Bisphosphonates pathway PHARMACOGENETICS AND GENOMICS Gong, L., Altman, R. B., Klein, T. E. 2011; 21 (1): 50-53

    View details for DOI 10.1097/FPC.0b013e328335729c

    View details for Web of Science ID 000285331700007

    View details for PubMedID 20023594

  • Content-based microarray search using differential expression profiles BMC BIOINFORMATICS Engreitz, J. M., Morgan, A. A., Dudley, J. T., Chen, R., Thathoo, R., Altman, R. B., Butte, A. J. 2010; 11


    With the expansion of public repositories such as the Gene Expression Omnibus (GEO), we are rapidly cataloging cellular transcriptional responses to diverse experimental conditions. Methods that query these repositories based on gene expression content, rather than textual annotations, may enable more effective experiment retrieval as well as the discovery of novel associations between drugs, diseases, and other perturbations.We develop methods to retrieve gene expression experiments that differentially express the same transcriptional programs as a query experiment. Avoiding thresholds, we generate differential expression profiles that include a score for each gene measured in an experiment. We use existing and novel dimension reduction and correlation measures to rank relevant experiments in an entirely data-driven manner, allowing emergent features of the data to drive the results. A combination of matrix decomposition and p-weighted Pearson correlation proves the most suitable for comparing differential expression profiles. We apply this method to index all GEO DataSets, and demonstrate the utility of our approach by identifying pathways and conditions relevant to transcription factors Nanog and FoxO3.Content-based gene expression search generates relevant hypotheses for biological inquiry. Experiments across platforms, tissue types, and protocols inform the analysis of new datasets.

    View details for DOI 10.1186/1471-2105-11-603

    View details for Web of Science ID 000286192100001

    View details for PubMedID 21172034

  • KCNH2 pharmacogenomics summary PHARMACOGENETICS AND GENOMICS Oshiro, C., Thorn, C. F., Roden, D. M., Klein, T. E., Altman, R. B. 2010; 20 (12): 775-777

    View details for DOI 10.1097/FPC.0b013e3283349e9c

    View details for Web of Science ID 000284148300006

    View details for PubMedID 20150828

  • Independent component analysis: Mining microarray data for fundamental human gene expression modules JOURNAL OF BIOMEDICAL INFORMATICS Engreitz, J. M., Daigle, B. J., Marshall, J. J., Altman, R. B. 2010; 43 (6): 932-944


    As public microarray repositories rapidly accumulate gene expression data, these resources contain increasingly valuable information about cellular processes in human biology. This presents a unique opportunity for intelligent data mining methods to extract information about the transcriptional modules underlying these biological processes. Modeling cellular gene expression as a combination of functional modules, we use independent component analysis (ICA) to derive 423 fundamental components of human biology from a 9395-array compendium of heterogeneous expression data. Annotation using the Gene Ontology (GO) suggests that while some of these components represent known biological modules, others may describe biology not well characterized by existing manually-curated ontologies. In order to understand the biological functions represented by these modules, we investigate the mechanism of the preclinical anti-cancer drug parthenolide (PTL) by analyzing the differential expression of our fundamental components. Our method correctly identifies known pathways and predicts that N-glycan biosynthesis and T-cell receptor signaling may contribute to PTL response. The fundamental gene modules we describe have the potential to provide pathway-level insight into new gene expression datasets.

    View details for DOI 10.1016/j.jbi.2010.07.001

    View details for Web of Science ID 000285036700009

    View details for PubMedID 20619355

  • Using text to build semantic networks for pharmacogenomics JOURNAL OF BIOMEDICAL INFORMATICS Coulet, A., Shah, N. H., Garten, Y., Musen, M., Altman, R. B. 2010; 43 (6): 1009-1019


    Most pharmacogenomics knowledge is contained in the text of published studies, and is thus not available for automated computation. Natural Language Processing (NLP) techniques for extracting relationships in specific domains often rely on hand-built rules and domain-specific ontologies to achieve good performance. In a new and evolving field such as pharmacogenomics (PGx), rules and ontologies may not be available. Recent progress in syntactic NLP parsing in the context of a large corpus of pharmacogenomics text provides new opportunities for automated relationship extraction. We describe an ontology of PGx relationships built starting from a lexicon of key pharmacogenomic entities and a syntactic parse of more than 87 million sentences from 17 million MEDLINE abstracts. We used the syntactic structure of PGx statements to systematically extract commonly occurring relationships and to map them to a common schema. Our extracted relationships have a 70-87.7% precision and involve not only key PGx entities such as genes, drugs, and phenotypes (e.g., VKORC1, warfarin, clotting disorder), but also critical entities that are frequently modified by these key entities (e.g., VKORC1 polymorphism, warfarin response, clotting disorder treatment). The result of our analysis is a network of 40,000 relationships between more than 200 entity types with clear semantics. This network is used to guide the curation of PGx knowledge and provide a computable resource for knowledge discovery.

    View details for DOI 10.1016/j.jbi.2010.08.005

    View details for Web of Science ID 000285036700017

    View details for PubMedID 20723615

  • SLC19A1 pharmacogenomics summary PHARMACOGENETICS AND GENOMICS Yee, S. W., Gong, L., Badagnani, I., Giacomini, K. M., Klein, T. E., Altman, R. B. 2010; 20 (11): 708-715

    View details for DOI 10.1097/FPC.0b013e32833eca92

    View details for Web of Science ID 000282965100006

    View details for PubMedID 20811316

  • An integrative method for scoring candidate genes from association studies: application to warfarin dosing Tatonetti, N. P., Dudley, J. T., Sagreiya, H., Butte, A. J., Altman, R. B. BIOMED CENTRAL LTD. 2010


    A key challenge in pharmacogenomics is the identification of genes whose variants contribute to drug response phenotypes, which can include severe adverse effects. Pharmacogenomics GWAS attempt to elucidate genotypes predictive of drug response. However, the size of these studies has severely limited their power and potential application. We propose a novel knowledge integration and SNP aggregation approach for identifying genes impacting drug response. Our SNP aggregation method characterizes the degree to which uncommon alleles of a gene are associated with drug response. We first use pre-existing knowledge sources to rank pharmacogenes by their likelihood to affect drug response. We then define a summary score for each gene based on allele frequencies and train linear and logistic regression classifiers to predict drug response phenotypes.We applied our method to a published warfarin GWAS data set comprising 181 individuals. We find that our method can increase the power of the GWAS to identify both VKORC1 and CYP2C9 as warfarin pharmacogenes, where the original analysis had only identified VKORC1. Additionally, we find that our method can be used to discriminate between low-dose (AUROC=0.886) and high-dose (AUROC=0.764) responders.Our method offers a new route for candidate pharmacogene discovery from pharmacogenomics GWAS, and serves as a foundation for future work in methods for predictive pharmacogenomics.

    View details for DOI 10.1186/1471-2105-11-S9-S9

    View details for Web of Science ID 000290218700009

    View details for PubMedID 21044367

  • The utility of general purpose versus specialty clinical databases for research: Warfarin dose estimation from extracted clinical variables JOURNAL OF BIOMEDICAL INFORMATICS Sagreiya, H., Altman, R. B. 2010; 43 (5): 747-751


    There is debate about the utility of clinical data warehouses for research. Using a clinical warfarin dosing algorithm derived from research-quality data, we evaluated the data quality of both a general-purpose database and a coagulation-specific database. We evaluated the functional utility of these repositories by using data extracted from them to predict warfarin dose. We reasoned that high-quality clinical data would predict doses nearly as accurately as research data, while poor-quality clinical data would predict doses less accurately. We evaluated the Mean Absolute Error (MAE) in predicted weekly dose as a metric of data quality. The MAE was comparable between the clinical gold standard (10.1mg/wk) and the specialty database (10.4 mg/wk), but the MAE for the clinical warehouse was 40% greater (14.1mg/wk). Our results indicate that the research utility of clinical data collected in focused clinical settings is greater than that of data collected during general-purpose clinical care.

    View details for DOI 10.1016/j.jbi.2010.03.014

    View details for Web of Science ID 000281927200010

    View details for PubMedID 20363365

  • VKORC1 Pharmacogenomics Summary PHARMACOGENETICS AND GENOMICS Owen, R. P., Gong, L., Sagreiya, H., Klein, T. E., Altman, R. B. 2010; 20 (10): 642-644

    View details for DOI 10.1097/FPC.0b013e32833433b6

    View details for Web of Science ID 000281830900010

    View details for PubMedID 19940803

  • Recent progress in automatically extracting information from the pharmacogenomic literature PHARMACOGENOMICS Garten, Y., Coulet, A., Altman, R. B. 2010; 11 (10): 1467-1489


    The biomedical literature holds our understanding of pharmacogenomics, but it is dispersed across many journals. In order to integrate our knowledge, connect important facts across publications and generate new hypotheses we must organize and encode the contents of the literature. By creating databases of structured pharmocogenomic knowledge, we can make the value of the literature much greater than the sum of the individual reports. We can, for example, generate candidate gene lists or interpret surprising hits in genome-wide association studies. Text mining automatically adds structure to the unstructured knowledge embedded in millions of publications, and recent years have seen a surge in work on biomedical text mining, some specific to pharmacogenomics literature. These methods enable extraction of specific types of information and can also provide answers to general, systemic queries. In this article, we describe the main tasks of text mining in the context of pharmacogenomics, summarize recent applications and anticipate the next phase of text mining applications.

    View details for DOI 10.2217/PGS.10.136

    View details for Web of Science ID 000284199500014

    View details for PubMedID 21047206

  • Thiopurine pathway PHARMACOGENETICS AND GENOMICS Zaza, G., Cheok, M., Krynetskaia, N., Thorn, C., Stocco, G., Hebert, J. M., McLeod, H., Weinshilboum, R. M., Relling, M. V., Evans, W. E., Klein, T. E., Altman, R. B. 2010; 20 (9): 573-574

    View details for DOI 10.1097/FPC.0b013e328334338f

    View details for Web of Science ID 000281295500008

    View details for PubMedID 19952870

  • Turning limited experimental information into 3D models of RNA RNA-A PUBLICATION OF THE RNA SOCIETY Flores, S. C., Altman, R. B. 2010; 16 (9): 1769-1778


    Our understanding of RNA functions in the cell is evolving rapidly. As for proteins, the detailed three-dimensional (3D) structure of RNA is often key to understanding its function. Although crystallography and nuclear magnetic resonance (NMR) can determine the atomic coordinates of some RNA structures, many 3D structures present technical challenges that make these methods difficult to apply. The great flexibility of RNA, its charged backbone, dearth of specific surface features, and propensity for kinetic traps all conspire with its long folding time, to challenge in silico methods for physics-based folding. On the other hand, base-pairing interactions (either in runs to form helices or isolated tertiary contacts) and motifs are often available from relatively low-cost experiments or informatics analyses. We present RNABuilder, a novel code that uses internal coordinate mechanics to satisfy user-specified base pairing and steric forces under chemical constraints. The code recapitulates the topology and characteristic L-shape of tRNA and obtains an accurate noncrystallographic structure of the Tetrahymena ribozyme P4/P6 domain. The algorithm scales nearly linearly with molecule size, opening the door to the modeling of significantly larger structures.

    View details for DOI 10.1261/rna.2112110

    View details for Web of Science ID 000281003900006

    View details for PubMedID 20651028

  • Maternal-fetal and neonatal pharmacogenomics: a review of current literature JOURNAL OF PERINATOLOGY Blumenfeld, Y. J., Reynolds-May, M. F., Altman, R. B., El-Sayed, Y. Y. 2010; 30 (9): 571-579


    Pharmacogenomics, the study of specific genetic variations and their effect on drug response, will likely give rise to many applications in maternal-fetal and neonatal medicine; yet, an understanding of these applications in the field of obstetrics and gynecology and neonatal pediatrics is not widespread. This review describes the underpinnings of the field of pharmacogenomics and summarizes the current pharmacogenomic inquiries in relation to maternal-fetal medicine-including studies on various fetal and neonatal genetic cytochrome P450 (CYP) enzyme variants and their role in drug toxicities (for example, codeine metabolism, sepsis and selective serotonin reuptake inhibitor (SSRI) toxicity). Potential future directions, including alternative drug classification, improvements in drug efficacy and non-invasive pharmacogenomic testing, will also be explored.

    View details for DOI 10.1038/jp.2009.183

    View details for Web of Science ID 000281388500002

    View details for PubMedID 19924131

  • PharmGKB summary: very important pharmacogene information for CYP2B6 PHARMACOGENETICS AND GENOMICS Thorn, C. F., Lamba, J. K., Lamba, V., Klein, T. E., Altman, R. B. 2010; 20 (8): 520-523

    View details for DOI 10.1097/FPC.0b013e32833947c2

    View details for Web of Science ID 000279865400007

    View details for PubMedID 20648701

  • Extending and evaluating a warfarin dosing algorithm that includes CYP4F2 and pooled rare variants of CYP2C9 PHARMACOGENETICS AND GENOMICS Sagrieya, H., Berube, C., Wen, A., Ramakrishnan, R., Mir, A., Hamilton, A., Altman, R. B. 2010; 20 (7): 407-413


    Warfarin dosing remains challenging because of its narrow therapeutic window and large variability in dose response. We sought to analyze new factors involved in its dosing and to evaluate eight dosing algorithms, including two developed by the International Warfarin Pharmacogenetics Consortium (IWPC).we enrolled 108 patients on chronic warfarin therapy and obtained complete clinical and pharmacy records; we genotyped single nucleotide polymorphisms relevant to the VKORC1, CYP2C9, and CYP4F2 genes using integrated fluidic circuits made by Fluidigm.When applying the IWPC pharmacogenetic algorithm to our cohort of patients, the percentage of patients within 1 mg/d of the therapeutic warfarin dose increases from 54% to 63% using clinical factors only, or from 38% using a fixed-dose approach. CYP4F2 adds 4% to the fraction of the variability in dose (R) explained by the IWPC pharmacogenetic algorithm (P<0.05). Importantly, we show that pooling rare variants substantially increases the R for CYP2C9 (rare variants: P=0.0065, R=6%; common variants: P=0.0034, R=7%; rare and common variants: P=0.00018; R=12%), indicating that relatively rare variants not genotyped in genome-wide association studies may be important. In addition, the IWPC pharmacogenetic algorithm and the Gage (2008) algorithm perform best (IWPC: R=50%; Gage: R=49%), and all pharmacogenetic algorithms outperform the IWPC clinical equation (R=22%). VKORC1 and CYP2C9 genotypes did not affect long-term variability in dose. Finally, the Fluidigm platform, a novel warfarin genotyping method, showed 99.65% concordance between different operators and instruments.CYP4F2 and pooled rare variants of CYP2C9 significantly improve the ability to estimate warfarin dose.

    View details for DOI 10.1097/FPC.0b013e328338bac2

    View details for Web of Science ID 000278879400001

    View details for PubMedID 20442691

  • Clopidogrel pathway PHARMACOGENETICS AND GENOMICS Sangkuhl, K., Klein, T. E., Altman, R. B. 2010; 20 (7): 463-465

    View details for DOI 10.1097/FPC.0b013e3283385420

    View details for Web of Science ID 000278879400009

    View details for PubMedID 20440227

  • Very important pharmacogene summary: thiopurine S-methyltransferase PHARMACOGENETICS AND GENOMICS Wang, L., Pelleymounter, L., Weinshilboum, R., Johnson, J. A., Hebert, J. M., Altman, R. B., Klein, T. E. 2010; 20 (6): 401-405

    View details for DOI 10.1097/FPC.0b013e3283352860

    View details for Web of Science ID 000277594800007

    View details for PubMedID 20154640

  • Clinical implementation of pharmacogenomics: overcoming genetic exceptionalism LANCET ONCOLOGY Relling, M. V., Altman, R. B., Goetz, M. P., Evans, W. E. 2010; 11 (6): 507-509
  • Challenges in the clinical application of whole-genome sequencing LANCET Ormond, K. E., Wheeler, M. T., Hudgins, L., Klein, T. E., Butte, A. J., Altman, R. B., Ashley, E. A., Greely, H. T. 2010; 375 (9727): 1749-1751
  • Vascular endothelial growth factor pathway PHARMACOGENETICS AND GENOMICS Maitland, M. L., Lou, X. J., Ramirez, J., Desai, A. A., Berlin, D. S., McLeod, H. L., Weichselbaum, R. R., Ratain, M. J., Altman, R. B., Klein, T. E. 2010; 20 (5): 346-349

    View details for DOI 10.1097/FPC.0b013e3283364ed7

    View details for Web of Science ID 000276704800009

    View details for PubMedID 20124951

  • Cytochrome P450 2C9-CYP2C9 PHARMACOGENETICS AND GENOMICS Van Booven, D., Marsh, S., McLeod, H., Carrillo, M. W., Sangkuhl, K., Klein, T. E., Altman, R. B. 2010; 20 (4): 277-281

    View details for DOI 10.1097/FPC.0b013e3283349e84

    View details for Web of Science ID 000276373800008

    View details for PubMedID 20150829

  • Teaching computers to read the pharmacogenomics literature ... so you don't have to PHARMACOGENOMICS Garten, Y., Altman, R. B. 2010; 11 (4): 515-518

    View details for DOI 10.2217/PGS.10.48

    View details for Web of Science ID 000276769300010

    View details for PubMedID 20350132

  • Pharmacogenomics and bioinformatics: PharmGKB PHARMACOGENOMICS Thorn, C. F., Klein, T. E., Altman, R. B. 2010; 11 (4): 501-505


    The NIH initiated the PharmGKB in April 2000. The primary mission was to create a repository of primary data, tools to track associations between genes and drugs, and to catalog the location and frequency of genetic variations known to impact drug response. Over the past 10 years, new technologies have shifted research from candidate gene pharmacogenetics to phenotype-based pharmacogenomics with a consequent explosion of data. PharmGKB has refocused on curating knowledge rather than housing primary genotype and phenotype data, and now, captures more complex relationships between genes, variants, drugs, diseases and pathways. Going forward, the challenges are to provide the tools and knowledge to plan and interpret genome-wide pharmacogenomics studies, predict gene-drug relationships based on shared mechanisms and support data-sharing consortia investigating clinical applications of pharmacogenomics.

    View details for DOI 10.2217/PGS.10.15

    View details for Web of Science ID 000276769300008

    View details for PubMedID 20350130

  • DNATwist: A Web-Based Tool for Teaching Middle and High School Students About Pharmacogenomics CLINICAL PHARMACOLOGY & THERAPEUTICS Berlin, D. S., Person, M. G., Mittal, A., Oppezzo, M. A., Chin, D. B., Starr, B., Klein, T. E., Schwartz, D. L., Altman, R. B. 2010; 87 (4): 393-395


    DNATwist is a Web-based learning tool (available at that explains pharmacogenomics concepts to middle- and high-school students. Its features include (i) a focus on drug responses of interest to teenagers (e.g., alcohol intolerance), (ii) reusable graphical interfaces that reduce extension costs, and (iii) explanations of molecular and cellular drug responses. In testing, students found the tool and topic understandable and engaging. The tool is being modified for use at the Tech Museum of Innovation in California.

    View details for DOI 10.1038/clpt.2009.303

    View details for Web of Science ID 000276506900009

    View details for PubMedID 20305671

  • Using Pre-existing Microarray Datasets to Increase Experimental Power: Application to Insulin Resistance PLOS COMPUTATIONAL BIOLOGY Daigle, B. J., Deng, A., McLaughlin, T., Cushman, S. W., Cam, M. C., Reaven, G., Tsao, P. S., Altman, R. B. 2010; 6 (3)


    Although they have become a widely used experimental technique for identifying differentially expressed (DE) genes, DNA microarrays are notorious for generating noisy data. A common strategy for mitigating the effects of noise is to perform many experimental replicates. This approach is often costly and sometimes impossible given limited resources; thus, analytical methods are needed which increase accuracy at no additional cost. One inexpensive source of microarray replicates comes from prior work: to date, data from hundreds of thousands of microarray experiments are in the public domain. Although these data assay a wide range of conditions, they cannot be used directly to inform any particular experiment and are thus ignored by most DE gene methods. We present the SVD Augmented Gene expression Analysis Tool (SAGAT), a mathematically principled, data-driven approach for identifying DE genes. SAGAT increases the power of a microarray experiment by using observed coexpression relationships from publicly available microarray datasets to reduce uncertainty in individual genes' expression measurements. We tested the method on three well-replicated human microarray datasets and demonstrate that use of SAGAT increased effective sample sizes by as many as 2.72 arrays. We applied SAGAT to unpublished data from a microarray study investigating transcriptional responses to insulin resistance, resulting in a 50% increase in the number of significant genes detected. We evaluated 11 (58%) of these genes experimentally using qPCR, confirming the directions of expression change for all 11 and statistical significance for three. Use of SAGAT revealed coherent biological changes in three pathways: inflammation, differentiation, and fatty acid synthesis, furthering our molecular understanding of a type 2 diabetes risk factor. We envision SAGAT as a means to maximize the potential for biological discovery from subtle transcriptional responses, and we provide it as a freely available software package that is immediately applicable to any human microarray study.

    View details for DOI 10.1371/journal.pcbi.1000718

    View details for Web of Science ID 000278125200026

    View details for PubMedID 20361040

  • PharmGKB very important pharmacogene: SLCO1B1 PHARMACOGENETICS AND GENOMICS Oshiro, C., Mangravite, L., Klein, T., Altman, R. 2010; 20 (3): 211-216

    View details for DOI 10.1097/FPC.0b013e328333b99c

    View details for Web of Science ID 000275061200007

    View details for PubMedID 19952871

  • Identification of recurring protein structure microenvironments and discovery of novel functional sites around CYS residues BMC STRUCTURAL BIOLOGY Wu, S., Liu, T., Altman, R. B. 2010; 10


    The emergence of structural genomics presents significant challenges in the annotation of biologically uncharacterized proteins. Unfortunately, our ability to analyze these proteins is restricted by the limited catalog of known molecular functions and their associated 3D motifs.In order to identify novel 3D motifs that may be associated with molecular functions, we employ an unsupervised, two-phase clustering approach that combines k-means and hierarchical clustering with knowledge-informed cluster selection and annotation methods. We applied the approach to approximately 20,000 cysteine-based protein microenvironments (3D regions 7.5 A in radius) and identified 70 interesting clusters, some of which represent known motifs (e.g. metal binding and phosphatase activity), and some of which are novel, including several zinc binding sites. Detailed annotation results are available online for all 70 clusters at use of microenvironments instead of backbone geometric criteria enables flexible exploration of protein function space, and detection of recurring motifs that are discontinuous in sequence and diverse in structure. Clustering microenvironments may thus help to functionally characterize novel proteins and better understand the protein structure-function relationship.

    View details for DOI 10.1186/1472-6807-10-4

    View details for Web of Science ID 000275410900001

    View details for PubMedID 20122268

  • PharmGKB summary: very important pharmacogene information for angiotensin-converting enzyme PHARMACOGENETICS AND GENOMICS Thorn, C. F., Klein, T. E., Altman, R. B. 2010; 20 (2): 143-146

    View details for DOI 10.1097/FPC.0b013e3283339bf3

    View details for Web of Science ID 000274306700011

    View details for PubMedID 19898265

  • Extraction of genotype-phenotype-drug relationships from text: from entity recognition to bioinformatics application. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Coulet, A., Shah, N., Hunter, L., Barral, C., Altman, R. B. 2010: 485-487


    Advances in concept recognition and natural language parsing have led to the development of various tools that enable the identification of biomedical entities and relationships between them in text. The aim of the Genotype-Phenotype-Drug Relationship Extraction from Text workshop (or GPD-Rx workshop) is to examine the current state of art and discuss the next steps for making the extraction of relationships between biomedical entities integral to the curation and knowledge management workflow in Pharmacogenomics. The workshop will focus particularly on the extraction of Genotype-Phenotype, Genotype-Drug, and Phenotype-Drug relationships that are of interest to Pharmacogenomics. Extracting and structuring such text-mined relationships is a key to support the evaluation and the validation of multiple hypotheses that emerge from high throughput translational studies spanning multiple measurement modalities. In order to advance this agenda, it is essential that existing relationship extraction methods be compared to one another and that a community wide benchmark corpus emerges; against which future methods can be compared. The workshop aims to bring together researchers working on the automatic or semi-automatic extraction of relationships between biomedical entities from research literature in order to identify the key groups interested in creating such a benchmark.

    View details for PubMedID 19904832

  • Predicting RNA structure by multiple template homology modeling. Flores, S. C., Wan, Y., Russell, R., Altman, R, B. edited by Altman, R., Dunker, K., Hunter, L. 2010
  • Improving the prediction of pharmacogenes using text-derived drug-gene relationships. Garten, Y., Tatonetti, N. P., Altman, R, B. edited by Altman, R., Dunker, K., Hunter, L. 2010
  • Proceedings of Pacific Symposium on Biocomputing 2010. edited by Altman, R., Dunker, K., Hunter, L. 2010
  • Extraction of genotypephenotype- drug relationships from text: from entity recognition to bioinformatics application. Coulet, A., Shah, N., Hunter, L., Barral, C., Altman, R, B. edited by Altman, R., Dunker, K., Hunter, L. 2010
  • An integrative method for scoring candidate genes from association studies: application to warfarin dosing. BMC Bioinformatics., 11 Suppl 9:S9. PMCID: PMC2967750. Tatonetti, N. P., Dudley, J. T., Sagreiya, H., Butte, A. J., Altman, R, B. 2010
  • Predicting RNA structure by multiple template homology modeling. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Flores, S. C., Wan, Y., Russell, R., Altman, R. B. 2010: 216-227


    Despite the importance of 3D structure to understand the myriad functions of RNAs in cells, most RNA molecules remain out of reach of crystallographic and NMR methods. However, certain structural information such as base pairing and some tertiary contacts can be determined readily for many RNAs by bioinformatics or relatively low cost experiments. Further, because RNA structure is highly modular, it is possible to deduce local 3D structure from the solved structures of evolutionarily related RNAs or even unrelated RNAs that share the same module. RNABuilder is a software package that generates model RNA structures by treating the kinematics and forces at separate, multiple levels of resolution. Kinematically, bonds in bases, certain stretches of residues, and some entire molecules are rigid while other bonds remain flexible. Forces act on the rigid bases and selected individual atoms. Here we use RNABuilder to predict the structure of the 200-nucleotide Azoarcus group I intron by homology modeling against fragments of the distantly-related Twort and Tetrahymena group I introns and by incorporating base pairing forces where necessary. In the absence of any information from the solved Azoarcus intron crystal structure, the model accurately depicts the global topology, secondary and tertiary connections, and gives an overall RMSD value of 4.6 A relative to the crystal structure. The accuracy of the model is even higher in the intron core (RMSD = 3.5 A), whereas deviations are modestly larger for peripheral regions that differ more substantially between the different introns. These results lay the groundwork for using this approach for larger and more diverse group I introns, as well for still larger RNAs and RNA-protein complexes such as group II introns and the ribosomal subunits.

    View details for PubMedID 19908374

  • Very important pharmacogene summary ADRB2 PHARMACOGENETICS AND GENOMICS Litonjua, A. A., Gong, L., Duan, Q. L., Shin, J., Moore, M. J., Weiss, S. T., Johnson, J. A., Klein, T. E., Altman, R. B. 2010; 20 (1): 64-69

    View details for DOI 10.1097/FPC.0b013e328333dae6

    View details for Web of Science ID 000273307600008

    View details for PubMedID 19927042

  • Editorial: Current progress in Bioinformatics 2010 BRIEFINGS IN BIOINFORMATICS Altman, R. B. 2010; 11 (1): 1-2

    View details for DOI 10.1093/bib/bbq001

    View details for Web of Science ID 000273866500001

    View details for PubMedID 20097719

  • Improving the prediction of pharmacogenes using text-derived drug-gene relationships. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Garten, Y., Tatonetti, N. P., Altman, R. B. 2010: 305-314


    A critical goal of pharmacogenomics research is to identify genes that can explain variation in drug response. We have previously reported a method that creates a genome-scale ranking of genes likely to interact with a drug. The algorithm uses information about drug structure and indications of use to rank the genes. Although the algorithm has good performance, its performance depends on a curated set of drug-gene relationships that is expensive to create and difficult to maintain. In this work, we assess the utility of text mining in extracting a network of drug-gene relationships automatically. This provides a valuable aggregate source of knowledge, subsequently used as input into the algorithm that ranks potential pharmacogenes. Using a drug-gene network created from sentence-level co-occurrence in the full text of scientific articles, we compared the performance to that of a network created by manual curation of those articles. Under a wide range of conditions, we show that a knowledge base derived from text-mining the literature performs as well as, and sometimes better than, a high-quality, manually curated knowledge base. We conclude that we can use relationships mined automatically from the literature as a knowledgebase for pharmacogenomics relationships. Additionally, when relationships are missed by text mining, our system can accurately extrapolate new relationships with 77.4% precision.

    View details for PubMedID 19908383

  • Knowledge-based instantiation of full atomic detail into coarse-grain RNA 3D structural models BIOINFORMATICS Jonikas, M. A., Radmer, R. J., Altman, R. B. 2009; 25 (24): 3259-3266


    The recent development of methods for modeling RNA 3D structures using coarse-grain approaches creates a need to bridge low- and high-resolution modeling methods. Although they contain topological information, coarse-grain models lack atomic detail, which limits their utility for some applications.We have developed a method for adding full atomic detail to coarse-grain models of RNA 3D structures. Our method [Coarse to Atomic (C2A)] uses geometries observed in known RNA crystal structures. Our method rebuilds full atomic detail from ideal coarse-grain backbones taken from crystal structures to within 1.87-3.31 A RMSD of the full atomic crystal structure. When starting from coarse-grain models generated by the modeling tool NAST, our method builds full atomic structures that are within 1.00 A RMSD of the starting structure. The resulting full atomic structures can be used as starting points for higher resolution modeling, thus bridging high- and low-resolution approaches to modeling RNA 3D structure.Code for the C2A method, as well as the examples discussed in this article, are freely available at

    View details for DOI 10.1093/bioinformatics/btp576

    View details for Web of Science ID 000272464000008

    View details for PubMedID 19812110

  • Prediction of calcium-binding sites by combining loop-modeling with machine learning BMC STRUCTURAL BIOLOGY Liu, T., Altman, R. B. 2009; 9


    Protein ligand-binding sites in the apo state exhibit structural flexibility. This flexibility often frustrates methods for structure-based recognition of these sites because it leads to the absence of electron density for these critical regions, particularly when they are in surface loops. Methods for recognizing functional sites in these missing loops would be useful for recovering additional functional information.We report a hybrid approach for recognizing calcium-binding sites in disordered regions. Our approach combines loop modeling with a machine learning method (FEATURE) for structure-based site recognition. For validation, we compared the performance of our method on known calcium-binding sites for which there are both holo and apo structures. When loops in the apo structures are rebuilt using modeling methods, FEATURE identifies 14 out of 20 crystallographically proven calcium-binding sites. It only recognizes 7 out of 20 calcium-binding sites in the initial apo crystal structures.We applied our method to unstructured loops in proteins from SCOP families known to bind calcium in order to discover potential cryptic calcium binding sites. We built 2745 missing loops and evaluated them for potential calcium binding. We made 102 predictions of calcium-binding sites. Ten predictions are consistent with independent experimental verifications. We found indirect experimental evidence for 14 other predictions. The remaining 78 predictions are novel predictions, some with intriguing potential biological significance. In particular, we see an enrichment of beta-sheet folds with predicted calcium binding sites in the connecting loops on the surface that may be important for calcium-mediated function switches.Protein crystal structures are a potentially rich source of functional information. When loops are missing in these structures, we may be losing important information about binding sites and active sites. We have shown that limited loop modeling (e.g. loops less than 17 residues) combined with pattern matching algorithms can recover functions and propose putative conformations associated with these functions.

    View details for DOI 10.1186/1472-6807-9-72

    View details for Web of Science ID 000273849100001

    View details for PubMedID 20003365

  • Taxane pathway PHARMACOGENETICS AND GENOMICS Oshiro, C., Marsh, S., McLeod, H., Carrillo, M. W., Klein, T., Altman, R. 2009; 19 (12): 979-983

    View details for DOI 10.1097/FPC.0b013e3283335277

    View details for Web of Science ID 000272310800008

    View details for PubMedID 21151855

  • Selective serotonin reuptake inhibitors pathway PHARMACOGENETICS AND GENOMICS Sangkuhl, K., Klein, T. E., Altman, R. B. 2009; 19 (11): 907-909

    View details for DOI 10.1097/FPC.0b013e32833132cb

    View details for Web of Science ID 000271602800010

    View details for PubMedID 19741567

  • Generating Genome-Scale Candidate Gene Lists for Pharmacogenomics CLINICAL PHARMACOLOGY & THERAPEUTICS Hansen, N. T., Brunak, S., Altman, R. B. 2009; 86 (2): 183-189


    A critical task in pharmacogenomics is identifying genes that may be important modulators of drug response. High-throughput experimental methods are often plagued by false positives and do not take advantage of existing knowledge. Candidate gene lists can usefully summarize existing knowledge, but they are expensive to generate manually and may therefore have incomplete coverage. We have developed a method that ranks 12,460 genes in the human genome on the basis of their potential relevance to a specific query drug and its putative indications. Our method uses known gene-drug interactions, networks of gene-gene interactions, and available measures of drug-drug similarity. It ranks genes by building a local network of known interactions and assessing the similarity of the query drug (by both structure and indication) with drugs that interact with gene products in the local network. In a comprehensive benchmark, our method achieves an overall area under the curve of 0.82. To showcase our method, we found novel gene candidates for warfarin, gefitinib, carboplatin, and gemcitabine, and we provide the molecular hypotheses for these predictions.

    View details for DOI 10.1038/clpt.2009.42

    View details for Web of Science ID 000268565100019

    View details for PubMedID 19369935

  • A Double-Blind, Randomized, Saline-Controlled Study of the Efficacy and Safety of EUFLEXXA (R) for Treatment of Painful Osteoarthritis of the Knee, With an Open-Label Safety Extension (The FLEXX Trial) SEMINARS IN ARTHRITIS AND RHEUMATISM Altman, R. D., Rosen, J. E., Bloch, D. A., Hatoum, H. T., Korner, P. 2009; 39 (1): 1-9


    To report the FLEXX trial, the first well-controlled study assessing the safety and efficacy of Euflexxa (1% sodium hyaluronate; IA-BioHA) therapy for knee osteoarthritis (OA) at 26 weeks.This was a randomized, double-blind, multicenter, saline-controlled study. Subjects with chronic knee OA were randomized to 3 weekly intra-articular (IA) injections of either buffered saline (IA-SA) or IA-BioHA (20 mg/2 ml). The primary efficacy outcome was subject recorded difference in least-squares means between IA-BioHA and IA-SA in subjects' change from baseline to week 26 following a 50-foot walk test, measured via 100-mm visual analog scale (VAS). Secondary outcome measures included Osteoarthritis Research Society International responder index, Western Ontario McMaster University Osteoarthritis Index VA 3.1 subscales, patient global assessment, rescue medication, and health-related quality of life (HRQoL) by the SF-36. Safety was assessed by monitoring and reporting vital signs, physical examination of the target knee following injection, adverse events, and concomitant medications.Five hundred eighty-eight subjects were randomized to either IA-BioHA (n = 293) or IA-SA (n = 295), with an 88% 26 week completion rate. No statistical differences were noted between the treatment groups at baseline. In the IA-BioHA group, mean VAS scores decreased by 25.7 mm, compared with 18.5 mm in the IA-SA group. This corresponded to a median reduction of 53% from baseline for IA-BioHA and a 38% reduction for IA-SA. The difference in least-squares means was -6.6 mm (P = 0.002). Secondary outcome measures were consistent with significant improvement in Osteoarthritis Research Society International responder index, HRQoL, and function. Both IA-SA and IA-BioHA injections were well tolerated, with a low incidence of adverse events that were equally distributed between groups. Injection-site reactions were reported by 1 (<1%) subject in the IA-SA group and 2 (1%) in the IA-BioHA group.IA-BioHA therapy resulted in significant OA knee pain relief at 26 weeks compared with IA-SA. Subjects treated with IA-BioHA also experienced significant improvements in joint function, treatment satisfaction, and HRQoL.

    View details for DOI 10.1016/j.semarthrit.2009.04.001

    View details for Web of Science ID 000268735900001

    View details for PubMedID 19539353

  • Improving Structure-Based Function Prediction Using Molecular Dynamics STRUCTURE Glazer, D. S., Radmer, R. J., Altman, R. B. 2009; 17 (7): 919-929


    The number of molecules with solved three-dimensional structure but unknown function is increasing rapidly. Particularly problematic are novel folds with little detectable similarity to molecules of known function. Experimental assays can determine the functions of such molecules, but are time-consuming and expensive. Computational approaches can identify potential functional sites; however, these approaches generally rely on single static structures and do not use information about dynamics. In fact, structural dynamics can enhance function prediction: we coupled molecular dynamics simulations with structure-based function prediction algorithms that identify Ca(2+) binding sites. When applied to 11 challenging proteins, both methods showed substantial improvement in performance, revealing 22 more sites in one case and 12 more in the other, with a modest increase in apparent false positives. Thus, we show that treating molecules as dynamic entities improves the performance of structure-based function prediction methods.

    View details for DOI 10.1016/j.str.2009.05.010

    View details for Web of Science ID 000268214500004

    View details for PubMedID 19604472

  • Antiestrogen pathway (aromatase inhibitor) PHARMACOGENETICS AND GENOMICS Desta, Z., Nguyen, A., Flockhart, D., Skaar, T., Fletcher, R., Weinshilboum, R., Berlin, D. S., Klein, T. E., Altman, R. B. 2009; 19 (7): 554-555

    View details for DOI 10.1097/FPC.0b013e32832e0ec1

    View details for Web of Science ID 000267619000008

    View details for PubMedID 19512956

  • Codeine and morphine pathway PHARMACOGENETICS AND GENOMICS Thorn, C. F., Klein, T. E., Altman, R. B. 2009; 19 (7): 556-558
  • Platinum pathway PHARMACOGENETICS AND GENOMICS Marsh, S., McLeod, H., Dolan, E., Shukla, S. J., Rabik, C. A., Gong, L., Hernandez-Boussard, T., Lou, X. J., Klein, T. E., Altman, R. B. 2009; 19 (7): 563-564

    View details for DOI 10.1097/FPC.0b013e32832e0ed7

    View details for Web of Science ID 000267619000011

    View details for PubMedID 19525887

  • Cytochrome P450 2D6 PHARMACOGENETICS AND GENOMICS Owen, R. P., Sangkuhl, K., Klein, T. E., Altman, R. B. 2009; 19 (7): 559-562

    View details for DOI 10.1097/FPC.0b013e32832e0e97

    View details for Web of Science ID 000267619000010

    View details for PubMedID 19512959

  • Etoposide pathway PHARMACOGENETICS AND GENOMICS Yang, J., Bogni, A., Schuetz, E. G., Ratain, M., Dolan, M. E., McLeod, H., Gong, L., Thorn, C., Relling, M. V., Klein, T. E., Altman, R. B. 2009; 19 (7): 552-553

    View details for DOI 10.1097/FPC.0b013e32832e0e7f

    View details for Web of Science ID 000267619000007

    View details for PubMedID 19512958

  • Direct-to-Consumer Genetic Testing: Failure Is Not an Option CLINICAL PHARMACOLOGY & THERAPEUTICS Altman, R. B. 2009; 86 (1): 15-17


    Direct-to-consumer genetic testing is an unavoidable consequence of our ability to cheaply and accurately measure the genome. Some are troubled by the loss of control over how and when this information is disclosed to individuals, but it is difficult to imagine any way to prevent the wide availability of these data. Therefore, the key challenge is to set up social, educational, and technical means to support individuals who have access to their genome.

    View details for DOI 10.1038/clpt.2009.63

    View details for Web of Science ID 000267225200003

    View details for PubMedID 19536117

  • New feature: pathways and important genes from PharmGKB PHARMACOGENETICS AND GENOMICS Elchelbaum, M., Altman, R. B., Ratain, M., Klein, T. E. 2009; 19 (6): 403-403
  • Very important pharmacogene summary: sulfotransferase 1A1 PHARMACOGENETICS AND GENOMICS Hildebrandt, M., Adjei, A., Weinshilbou, R., Johnson, J. A., Berlin, D. S., Klein, T. E., Altman, R. B. 2009; 19 (6): 404-406

    View details for DOI 10.1097/FPC.0b013e32832e042e

    View details for Web of Science ID 000266575500002

    View details for PubMedID 19451861

  • Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text BMC BIOINFORMATICS Garten, Y., Altman, R. B. 2009; 10


    Pharmacogenomics studies the relationship between genetic variation and the variation in drug response phenotypes. The field is rapidly gaining importance: it promises drugs targeted to particular subpopulations based on genetic background. The pharmacogenomics literature has expanded rapidly, but is dispersed in many journals. It is challenging, therefore, to identify important associations between drugs and molecular entities--particularly genes and gene variants, and thus these critical connections are often lost. Text mining techniques can allow us to convert the free-style text to a computable, searchable format in which pharmacogenomic concepts (such as genes, drugs, polymorphisms, and diseases) are identified, and important links between these concepts are recorded. Availability of full text articles as input into text mining engines is key, as literature abstracts often do not contain sufficient information to identify these pharmacogenomic associations.Thus, building on a tool called Textpresso, we have created the Pharmspresso tool to assist in identifying important pharmacogenomic facts in full text articles. Pharmspresso parses text to find references to human genes, polymorphisms, drugs and diseases and their relationships. It presents these as a series of marked-up text fragments, in which key concepts are visually highlighted. To evaluate Pharmspresso, we used a gold standard of 45 human-curated articles. Pharmspresso identified 78%, 61%, and 74% of target gene, polymorphism, and drug concepts, respectively.Pharmspresso is a text analysis tool that extracts pharmacogenomic concepts from the literature automatically and thus captures our current understanding of gene-drug interactions in a computable form. We have made Pharmspresso available at

    View details for DOI 10.1186/1471-2105-10-S2-S6

    View details for Web of Science ID 000265602500007

    View details for PubMedID 19208194

  • Coarse-grained modeling of large RNA molecules with knowledge-based potentials and structural filters RNA-A PUBLICATION OF THE RNA SOCIETY Jonikas, M. A., Radmer, R. J., Laederach, A., Das, R., Pearlman, S., Herschlag, D., Altman, R. B. 2009; 15 (2): 189-199


    Understanding the function of complex RNA molecules depends critically on understanding their structure. However, creating three-dimensional (3D) structural models of RNA remains a significant challenge. We present a protocol (the nucleic acid simulation tool [NAST]) for RNA modeling that uses an RNA-specific knowledge-based potential in a coarse-grained molecular dynamics engine to generate plausible 3D structures. We demonstrate NAST's capabilities by using only secondary structure and tertiary contact predictions to generate, cluster, and rank structures. Representative structures in the best ranking clusters averaged 8.0 +/- 0.3 A and 16.3 +/- 1.0 A RMSD for the yeast phenylalanine tRNA and the P4-P6 domain of the Tetrahymena thermophila group I intron, respectively. The coarse-grained resolution allows us to model large molecules such as the 158-residue P4-P6 or the 388-residue T. thermophila group I intron. One advantage of NAST is the ability to rank clusters of structurally similar decoys based on their compatibility with experimental data. We successfully used ideal small-angle X-ray scattering data and both ideal and experimental solvent accessibility data to select the best cluster of structures for both tRNA and P4-P6. Finally, we used NAST to build in missing loops in the crystal structures of the Azoarcus and Twort ribozymes, and to incorporate crystallographic data into the Michel-Westhof model of the T. thermophila group I intron, creating an integrated model of the entire molecule. Our software package is freely available at

    View details for DOI 10.1261/rna.1270809

    View details for Web of Science ID 000262463200001

    View details for PubMedID 19144906

  • TOWARDS A CYTOKINE-CELL INTERACTION KNOWLEDGEBASE OF THE ADAPTIVE IMMUNE SYSTEM PACIFIC SYMPOSIUM ON BIOCOMPUTING 2009 Shen-Orr, S. S., Goldberger, O., Garten, Y., Rosenberg-Hasson, Y., Lovelace, P. A., Hirschberg, D. L., Altman, R. B., Davis, M. M., Butte, A. J. 2009: 439-450


    The immune system of higher organisms is, by any standard, complex. To date, using reductionist techniques, immunologists have elucidated many of the basic principles of how the immune system functions, yet our understanding is still far from complete. In an era of high throughput measurements, it is already clear that the scientific knowledge we have accumulated has itself grown larger than our ability to cope with it, and thus it is increasingly important to develop bioinformatics tools with which to navigate the complexity of the information that is available to us. Here, we describe ImmuneXpresso, an information extraction system, tailored for parsing the primary literature of immunology and relating it to experimental data. The immune system is very much dependent on the interactions of various white blood cells with each other, either in synaptic contacts, at a distance using cytokines or chemokines, or both. Therefore, as a first approximation, we used ImmuneXpresso to create a literature derived network of interactions between cells and cytokines. Integration of cell-specific gene expression data facilitates cross-validation of cytokine mediated cell-cell interactions and suggests novel interactions. We evaluate the performance of our automatically generated multi-scale model against existing manually curated data, and show how this system can be used to guide experimentalists in interpreting multi-scale, experimental data. Our methodology is scalable and can be generalized to other systems.

    View details for Web of Science ID 000263639700041

    View details for PubMedID 19209721

  • Proceedings of Pacific Symposium on Biocomputing 2009. edited by Altman, R., Dunker, K., Hunter, L. 2009
  • The International Warfarin Pharmacogenetics Consortium. Warfarin Dosing UsingClinical and Pharmacogenetic Data. New England Journal of Medicine. Altman, R, B. 2009; 8 (360): 753-64
  • Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text. BMC Bioinformatics., 10 Suppl 2:S6. PMCID: PMC2646239. Garten, Y., Altman, R, B. 2009
  • New feature: pathways and important genes from PharmGKB. Pharmacogenetics and genomics Eichelbaum, M., Altman, R. B., Ratain, M., Klein, T. E. 2009; 19 (6): 403

    View details for PubMedID 20161212

  • Predicting drug side-effects by chemical systems biology GENOME BIOLOGY Tatonetti, N. P., Liu, T., Altman, R. B. 2009; 10 (9)


    New approaches to predicting ligand similarity and protein interactions can explain unexpected observations of drug inefficacy or side-effects.

    View details for DOI 10.1186/gb-2009-10-9-238

    View details for Web of Science ID 000271425300004

    View details for PubMedID 19723347

  • A general framework for dose optimization. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Turcott, R. G., Sagreiya, H., Ashley, E. A., Altman, R. B., Das, A. K. 2009; 2009: 656-660


    Dose optimization is a ubiquitous challenge in clinical practice and includes both pharmacologic and non-pharmacologic interventions. Methods for the statistical assessment of optimum dosing are lacking. We developed a generic framework for dose titration and demonstrated its application in two domains. Optimum warfarin dose was estimated from clinical titration data. In addition, cardiac pacemaker interval optimization was conducted using three conventional techniques. For both data types, optima were obtained from mathematical functions fit to the raw data. The precision of the estimated optima was quantified using bootstrapping. In pacing optimization, the observed precision varied significantly among the techniques, suggesting that impedance cardiography is superior to commonly used echocardiographic methods. The average 95% confidence interval of the estimated optimum warfarin dose was +/-18%, suggesting that titration within this range is of limited utility. By identifying statistically ineffective interventions, objective analysis of optimization data may both improve outcomes and reduce healthcare costs.

    View details for PubMedID 20351936

  • Efficient Algorithms to Explore Conformation Spaces of Flexible Protein Loops IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Yao, P., Dhanik, A., Marz, N., Propper, R., Kou, C., Liu, G., van den Bedem, H., Latombe, J., Halperin-Landsberg, I., Altman, R. B. 2008; 5 (4): 534-545


    Several applications in biology - e.g., incorporation of protein flexibility in ligand docking algorithms, interpretation of fuzzy X-ray crystallographic data, and homology modeling - require computing the internal parameters of a flexible fragment (usually, a loop) of a protein in order to connect its termini to the rest of the protein without causing any steric clash. One must often sample many such conformations in order to explore and adequately represent the conformational range of the studied loop. While sampling must be fast, it is made difficult by the fact that two conflicting constraints - kinematic closure and clash avoidance - must be satisfied concurrently. This paper describes two efficient and complementary sampling algorithms to explore the space of closed clash-free conformations of a flexible protein loop. The "seed sampling" algorithm samples broadly from this space, while the "deformation sampling" algorithm uses seed conformations as starting points to explore the conformation space around them at a finer grain. Computational results are presented for various loops ranging from 5 to 25 residues. More specific results also show that the combination of the sampling algorithms with a functional site prediction software (FEATURE) makes it possible to compute and recognize calcium-binding loop conformations. The sampling algorithms are implemented in a toolkit (LoopTK), which is available at

    View details for DOI 10.1109/TCBB.2008.96

    View details for Web of Science ID 000260433100007

    View details for PubMedID 18989041

  • PharmGKB: an integrated resource of pharmacogenomic data and knowledge. Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] Gong, L., Owen, R. P., Gor, W., Altman, R. B., Klein, T. E. 2008; Chapter 14: Unit14 7-?


    The PharmGKB is a publicly available online resource that aims to facilitate understanding how genetic variation contributes to variation in drug response. It is not only a repository of pharmacogenomics primary data, but it also provides fully curated knowledge including drug pathways, annotated pharmacogene summaries, and relationships among genes, drugs, and diseases. This unit describes how to navigate the PharmGKB Web site to retrieve detailed information on genes and important variants, as well as their relationship to drugs and diseases. It also includes protocols on our drug-centered pathway, annotated pharmacogene summaries, and our Web services for downloading the underlying data. Workflow on how to use PharmGKB to facilitate design of the pharmacogenomic study is also described in this unit.

    View details for DOI 10.1002/0471250953.bi1407s23

    View details for PubMedID 18819074

  • The Simbios National Center: Systems biology in motion PROCEEDINGS OF THE IEEE Schmidt, J. P., Delp, S. L., Sherman, M. A., Taylor, C. A., Pande, V. S., Altman, R. B. 2008; 96 (8): 1266-1280
  • High-throughput single-nucleotide structural mapping by capillary automated footprinting analysis NUCLEIC ACIDS RESEARCH Mitra, S., Shcherbakova, I. V., Altman, R. B., Brenowitz, M., Laederach, A. 2008; 36 (11)


    The use of capillary electrophoresis with fluorescently labeled nucleic acids revolutionized DNA sequencing, effectively fueling the genomic revolution. We present an application of this technology for the high-throughput structural analysis of nucleic acids by chemical and enzymatic mapping ('footprinting'). We achieve the throughput and data quality necessary for genomic-scale structural analysis by combining fluorophore labeling of nucleic acids with novel quantitation algorithms. We implemented these algorithms in the CAFA (capillary automated footprinting analysis) open-source software that is downloadable gratis from The accuracy, throughput and reproducibility of CAFA analysis are demonstrated using hydroxyl radical footprinting of RNA. The versatility of CAFA is illustrated by dimethyl sulfate mapping of RNA secondary structure and DNase I mapping of a protein binding to a specific sequence of DNA. Our experimental and computational approach facilitates the acquisition of high-throughput chemical probing data for solution structural analysis of nucleic acids.

    View details for DOI 10.1093/nar/gkn267

    View details for Web of Science ID 000257188700033

    View details for PubMedID 18477638

  • Interview: Russ Altman speaks to Shreeya Nanda, Commissioning Editor. Pharmacogenomics Altman, R. B. 2008; 9 (6): 663-665


    Russ Biagio Altman is a professor of bioengineering, genetics, and medicine (and of computer science by courtesy) and chairman of the Bioengineering Department at Stanford University, CA, USA. His primary research interests are in the application of computing technology to basic molecular biological problems of relevance to medicine. He is currently developing techniques for collaborative scientific computation over the internet, including novel user interfaces to biological data, particularly for pharmacogenomics. Other work focuses on the analysis of functional microenvironments within macromolecules and the application of algorithms for determining the structure, dynamics and function of biological macromolecules. Dr Altman holds an MD from Stanford Medical School, a PhD in medical information sciences from Stanford, and an AB from Harvard College, MA, USA. He has been the recipient of the US Presidential Early Career Award for Scientists and Engineers and a National Science Foundation CAREER Award. He is a fellow of the American College of Physicians and the American College of Medical Informatics. He is a past-president and founding board member of the International Society for Computational Biology and an organizer of the annual Pacific Symposium on Biocomputing. He leads one of seven NIH-supported National Centers for Biomedical Computation, focusing on physics-based simulation of biological structures. He won the Stanford Medical School graduate teaching award in 2000.

    View details for DOI 10.2217/14622416.9.6.663

    View details for Web of Science ID 000256961700006

    View details for PubMedID 18518843

  • iTools: A Framework for Classification, Categorization and Integration of Computational Biology Resources PLOS ONE Dinov, I. D., Rubin, D., Lorensen, W., Dugan, J., Ma, J., Murphy, S., Kirschner, B., Bug, W., Sherman, M., Floratos, A., Kennedy, D., Jagadish, H. V., Schmidt, J., Athey, B., Califano, A., Musen, M., Altman, R., Kikinis, R., Kohane, I., Delp, S., Parker, D. S., Toga, A. W. 2008; 3 (5)


    The advancement of the computational biology field hinges on progress in three fundamental directions--the development of new computational algorithms, the availability of informatics resource management infrastructures and the capability of tools to interoperate and synergize. There is an explosion in algorithms and tools for computational biology, which makes it difficult for biologists to find, compare and integrate such resources. We describe a new infrastructure, iTools, for managing the query, traversal and comparison of diverse computational biology resources. Specifically, iTools stores information about three types of resources--data, software tools and web-services. The iTools design, implementation and resource meta-data content reflect the broad research, computational, applied and scientific expertise available at the seven National Centers for Biomedical Computing. iTools provides a system for classification, categorization and integration of different computational biology resources across space-and-time scales, biomedical problems, computational infrastructures and mathematical foundations. A large number of resources are already iTools-accessible to the community and this infrastructure is rapidly growing. iTools includes human and machine interfaces to its resource meta-data repository. Investigators or computer programs may utilize these interfaces to search, compare, expand, revise and mine meta-data descriptions of existent computational biology resources. We propose two ways to browse and display the iTools dynamic collection of resources. The first one is based on an ontology of computational biology resources, and the second one is derived from hyperbolic projections of manifolds or complex structures onto planar discs. iTools is an open source project both in terms of the source code development as well as its meta-data content. iTools employs a decentralized, portable, scalable and lightweight framework for long-term resource management. We demonstrate several applications of iTools as a framework for integrated bioinformatics. iTools and the complete details about its specifications, usage and interfaces are available at the iTools web page

    View details for DOI 10.1371/journal.pone.0002265

    View details for Web of Science ID 000262268500012

    View details for PubMedID 18509477

  • M-BISON: Microarray-based integration of data sources using networks BMC BIOINFORMATICS Daigle, B. J., Altman, R. B. 2008; 9


    The accurate detection of differentially expressed (DE) genes has become a central task in microarray analysis. Unfortunately, the noise level and experimental variability of microarrays can be limiting. While a number of existing methods partially overcome these limitations by incorporating biological knowledge in the form of gene groups, these methods sacrifice gene-level resolution. This loss of precision can be inappropriate, especially if the desired output is a ranked list of individual genes. To address this shortcoming, we developed M-BISON (Microarray-Based Integration of data SOurces using Networks), a formal probabilistic model that integrates background biological knowledge with microarray data to predict individual DE genes.M-BISON improves signal detection on a range of simulated data, particularly when using very noisy microarray data. We also applied the method to the task of predicting heat shock-related differentially expressed genes in S. cerevisiae, using an hsf1 mutant microarray dataset and conserved yeast DNA sequence motifs. Our results demonstrate that M-BISON improves the analysis quality and makes predictions that are easy to interpret in concert with incorporated knowledge. Specifically, M-BISON increases the AUC of DE gene prediction from .541 to .623 when compared to a method using only microarray data, and M-BISON outperforms a related method, GeneRank. Furthermore, by analyzing M-BISON predictions in the context of the background knowledge, we identified YHR124W as a potentially novel player in the yeast heat shock response.This work provides a solid foundation for the principled integration of imperfect biological knowledge with gene expression data and other high-throughput data sources.

    View details for DOI 10.1186/1471-2105-9-214

    View details for Web of Science ID 000256421800001

    View details for PubMedID 18439292

  • The chemical genomic portrait of yeast: Uncovering a phenotype for all genes SCIENCE Hillenmeyer, M. E., Fung, E., Wildenhain, J., Pierce, S. E., Hoon, S., Lee, W., Proctor, M., St Onge, R. P., Tyers, M., Koller, D., Altman, R. B., Davis, R. W., Nislow, C., Giaever, G. 2008; 320 (5874): 362-365


    Genetics aims to understand the relation between genotype and phenotype. However, because complete deletion of most yeast genes ( approximately 80%) has no obvious phenotypic consequence in rich medium, it is difficult to study their functions. To uncover phenotypes for this nonessential fraction of the genome, we performed 1144 chemical genomic assays on the yeast whole-genome heterozygous and homozygous deletion collections and quantified the growth fitness of each deletion strain in the presence of chemical or environmental stress conditions. We found that 97% of gene deletions exhibited a measurable growth phenotype, suggesting that nearly all genes are essential for optimal growth in at least one condition.

    View details for DOI 10.1126/science.1150021

    View details for Web of Science ID 000255026100040

    View details for PubMedID 18420932

  • PharmGKB and the international warfarin pharmacogenetlics consortium: The changing role for pharmacogenomic databases and single-drug pharmacogenetics HUMAN MUTATION Owen, R. P., Altman, R. B., Klein, T. E. 2008; 29 (4): 456-460


    PharmGKB, the pharmacogenetics and pharmacogenomics knowledge base ( is a publicly available online resource dedicated to the dissemination of how genetic variation leads to variation in drug responses. The goals of PharmGKB are to describe relationships between genes, drugs, and diseases, and to generate knowledge to catalyze pharmacogenetic and pharmacogenomic research. PharmGKB delivers knowledge in the form of curated literature annotations, drug pathway diagrams, and very important pharmacogene (VIP) summaries. Recently, PharmGKB has embraced a new role--broker of pharmacogenomic data for data sharing consortia. In particular, we have helped create the International Warfarin Pharmacogenetics Consortium (IWPC), which is devoted to pooling genotype and phenotype data relevant to the anticoagulant warfarin. PharmGKB has embraced the challenge of continuing to maintain its original mission while taking an active role in the formation of pharmacogenetic consortia.

    View details for DOI 10.1002/humu.20731

    View details for Web of Science ID 000254800400002

    View details for PubMedID 18330919

  • Structural inference of native and partially folded RNA by high-throughput contact mapping PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Dast, R., Kudaravalli, M., Jonikas, M., Laederach, A., Fong, R., Schwans, J. P., Baker, D., Piccirilli, J. A., Altman, R. B., Herschlag, D. 2008; 105 (11): 4144-4149


    The biological behaviors of ribozymes, riboswitches, and numerous other functional RNA molecules are critically dependent on their tertiary folding and their ability to sample multiple functional states. The conformational heterogeneity and partially folded nature of most of these states has rendered their characterization by high-resolution structural approaches difficult or even intractable. Here we introduce a method to rapidly infer the tertiary helical arrangements of large RNA molecules in their native and non-native solution states. Multiplexed hydroxyl radical (.OH) cleavage analysis (MOHCA) enables the high-throughput detection of numerous pairs of contacting residues via random incorporation of radical cleavage agents followed by two-dimensional gel electrophoresis. We validated this technology by recapitulating the unfolded and native states of a well studied model RNA, the P4-P6 domain of the Tetrahymena ribozyme, at subhelical resolution. We then applied MOHCA to a recently discovered third state of the P4-P6 RNA that is stabilized by high concentrations of monovalent salt and whose partial order precludes conventional techniques for structure determination. The three-dimensional portrait of a compact, non-native RNA state reveals a well ordered subset of native tertiary contacts, in contrast to the dynamic but otherwise similar molten globule states of proteins. With its applicability to nearly any solution state, we expect MOHCA to be a powerful tool for illuminating the many functional structures of large RNA molecules and RNA/protein complexes.

    View details for DOI 10.1073/pnas.0709032105

    View details for Web of Science ID 000254263300015

    View details for PubMedID 18322008

  • MScanner: a classifier for retrieving medline citations BMC BIOINFORMATICS Poulter, G. L., Rubin, D. L., Altman, R. B., Seoighe, C. 2008; 9


    Keyword searching through PubMed and other systems is the standard means of retrieving information from Medline. However, ad-hoc retrieval systems do not meet all of the needs of databases that curate information from literature, or of text miners developing a corpus on a topic that has many terms indicative of relevance. Several databases have developed supervised learning methods that operate on a filtered subset of Medline, to classify Medline records so that fewer articles have to be manually reviewed for relevance. A few studies have considered generalisation of Medline classification to operate on the entire Medline database in a non-domain-specific manner, but existing applications lack speed, available implementations, or a means to measure performance in new domains.MScanner is an implementation of a Bayesian classifier that provides a simple web interface for submitting a corpus of relevant training examples in the form of PubMed IDs and returning results ranked by decreasing probability of relevance. For maximum speed it uses the Medical Subject Headings (MeSH) and journal of publication as a concise document representation, and takes roughly 90 seconds to return results against the 16 million records in Medline. The web interface provides interactive exploration of the results, and cross validated performance evaluation on the relevant input against a random subset of Medline. We describe the classifier implementation, cross validate it on three domain-specific topics, and compare its performance to that of an expert PubMed query for a complex topic. In cross validation on the three sample topics against 100,000 random articles, the classifier achieved excellent separation of relevant and irrelevant article score distributions, ROC areas between 0.97 and 0.99, and averaged precision between 0.69 and 0.92.MScanner is an effective non-domain-specific classifier that operates on the entire Medline database, and is suited to retrieving topics for which many features may indicate relevance. Its web interface simplifies the task of classifying Medline citations, compared to building a pre-filter and classifier specific to the topic. The data sets and open source code used to obtain the results in this paper are available on-line and as supplementary material, and the web interface may be accessed at

    View details for DOI 10.1186/1471-2105-9-108

    View details for Web of Science ID 000254012100001

    View details for PubMedID 18284683

  • An XML-based interchange format for genotype-phenotype data HUMAN MUTATION Whirl-Carrillo, M., Woon, M., Thorn, C. E., Klein, T. E., Altman, R. B. 2008; 29 (2): 212-219


    Recent advances in high-throughput genotyping and phenotyping have accelerated the creation of pharmacogenomic data. Consequently, the community requires standard formats to exchange large amounts of diverse information. To facilitate the transfer of pharmacogenomics data between databases and analysis packages, we have created a standard XML (eXtensible Markup Language) schema that describes both genotype and phenotype data as well as associated metadata. The schema accommodates information regarding genes, drugs, diseases, experimental methods, genomic/RNA/protein sequences, subjects, subject groups, and literature. The Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB; has used this XML schema for more than 5 years to accept and process submissions containing more than 1,814,139 SNPs on 20,797 subjects using 8,975 assays. Although developed in the context of pharmacogenomics, the schema is of general utility for exchange of genotype and phenotype data. We have written syntactic and semantic validators to check documents using this format. The schema and code for validation is available to the community at (last accessed: 8 October 2007).

    View details for DOI 10.1002/humu.20662

    View details for Web of Science ID 000253033000002

    View details for PubMedID 17994540

  • PharmGKB: UNDERSTANDING THE EFFECTS OF INDIVIDUAL GENETIC VARIANTS DRUG METABOLISM REVIEWS Sangkuhl, K., Berlin, D. S., Altman, R. B., Klein, T. E. 2008; 40 (4): 539-551


    The Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB: is devoted to disseminating primary data and knowledge in pharmacogenetics and pharmacogenomics. We are annotating the genes that are most important for drug response and present this information in the form of Very Important Pharmacogene (VIP) summaries, pathway diagrams, and curated literature. The PharmGKB currently contains information on over 500 drugs, 500 diseases, and 700 genes with genotyped variants. New features focus on capturing the phenotypic consequences of individual genetic variants. These features link variant genotypes to phenotypes, increase the breadth of pharmacogenomics literature curated, and visualize single-nucleotide polymorphisms on a gene's three-dimensional protein structure.

    View details for DOI 10.1080/03602530802413338

    View details for Web of Science ID 000260325500002

    View details for PubMedID 18949600

  • The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science., PMCID: PMC2794835 Hillenmeyer, M., Fung, E., Wildenhain, J., Pierce, S., Hoon, S., Lee, W., Altman, R. B. 2008; 5874 (320): 362-5
  • PharmGKB and the International Warfarin Pharmacogenetics Consortium: the changing role for pharmacogenomic databases and single-drug pharmacogenetics. Hum Mutat. Owen, R., Altman, R., Klein, T. 2008; 29 (4): 456-60
  • Combining molecular dynamics and machine learning to improve protein function recognition Glazer, D., Radmer, R., Altman, R. edited by Altman, R., Dunker, K., Hunter, L. 2008
  • Proceedings of Pacific Symposium on Biocomputing 2008. edited by Altman, R., Dunker, K., Hunter, L. 2008
  • Structural inference of native and partially folded RNA by high-throughput contact mapping. Das, R., Kudaravalli, M., Jonikas, M., Laederach, A., Fong, R., Schwans, J., Altman, R. B. 2008
  • The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications. BMC Genomics., 9 Suppl 2:S2. PMCID: PMC2559884. Halperin, I., Glazer, D., Wu, S., Altman, R. 2008
  • The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications. BMC genomics Halperin, I., Glazer, D. S., Wu, S., Altman, R. B. 2008; 9: S2-?


    Structural genomics efforts contribute new protein structures that often lack significant sequence and fold similarity to known proteins. Traditional sequence and structure-based methods may not be sufficient to annotate the molecular functions of these structures. Techniques that combine structural and functional modeling can be valuable for functional annotation. FEATURE is a flexible framework for modeling and recognition of functional sites in macromolecular structures. Here, we present an overview of the main components of the FEATURE framework, and describe the recent developments in its use. These include automating training sets selection to increase functional coverage, coupling FEATURE to structural diversity generating methods such as molecular dynamics simulations and loop modeling methods to improve performance, and using FEATURE in large-scale modeling and structure determination efforts.

    View details for DOI 10.1186/1471-2164-9-S2-S2

    View details for PubMedID 18831785

  • Semiautomated and rapid quantification of nucleic acid footprinting and structure mapping experiments NATURE PROTOCOLS Laederach, A., Das, R., Vicens, Q., Pearlman, S. M., Brenowitz, M., Herschlag, D., Altman, R. B. 2008; 3 (9): 1395-1401


    We have developed protocols for rapidly quantifying the band intensities from nucleic acid chemical mapping gels at single-nucleotide resolution. These protocols are implemented in the software SAFA (semi-automated footprinting analysis) that can be downloaded without charge from The protocols implemented in SAFA have five steps: (i) lane identification, (ii) gel rectification, (iii) band assignment, (iv) model fitting and (v) band-intensity normalization. SAFA enables the rapid quantitation of gel images containing thousands of discrete bands, thereby eliminating a bottleneck to the analysis of chemical mapping experiments. An experienced user of the software can quantify a gel image in approximately 20 min. Although SAFA was developed to analyze hydroxyl radical (*OH) footprints, it effectively quantifies the gel images obtained with other types of chemical mapping probes. We also present a series of tutorial movies that illustrate the best practices and different steps in the SAFA analysis as a supplement to this protocol.

    View details for DOI 10.1038/nprot.2008.134

    View details for Web of Science ID 000258424100003

    View details for PubMedID 18772866

  • Combining molecular dynamics and machine learning to improve protein function recognition. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Glazer, D. S., Radmer, R. J., Altman, R. B. 2008: 332-343


    As structural genomics efforts succeed in solving protein structures with novel folds, the number of proteins with known structures but unknown functions increases. Although experimental assays can determine the functions of some of these molecules, they can be expensive and time consuming. Computational approaches can assist in identifying potential functions of these molecules. Possible functions can be predicted based on sequence similarity, genomic context, expression patterns, structure similarity, and combinations of these. We investigated whether simulations of protein dynamics can expose functional sites that are not apparent to the structure-based function prediction methods in static crystal structures. Focusing on Ca2+ binding, we coupled a machine learning tool that recognizes functional sites, FEATURE, with Molecular Dynamics (MD) simulations. Treating molecules as dynamic entities can improve the ability of structure-based function prediction methods to annotate possible functional sites.

    View details for PubMedID 18229697

  • Commentaries on "Informatics and Medicine: From Molecules to Populations" METHODS OF INFORMATION IN MEDICINE Altman, R. B., Balling, R., Brinkley, J. F., Coiera, E., Consorti, F., Dhansay, M. A., Geissbuhler, A., Hersh, W., Kwankam, S. Y., Lorenzi, N. M., Martin-Sanchez, E., Mihalas, G. I., Shahar, Y., Takabayashi, K., Wiederhold, G. 2008; 47 (4): 296-317


    To discuss interdisciplinary research and education in the context of informatics and medicine by commenting on the paper of Kuhn et al. "Informatics and Medicine: From Molecules to Populations".Inviting an international group of experts in biomedical and health informatics and related disciplines to comment on this paper.The commentaries include a wide range of reasoned arguments and original position statements which, while strongly endorsing the educational needs identified by Kuhn et al., also point out fundamental challenges that are very specific to the unusual combination of scientific, technological, personal and social problems characterizing biomedical informatics. They point to the ultimate objectives of managing difficult human health problems, which are unlikely to yield to technological solutions alone. The psychological, societal, and environmental components of health and disease are emphasized by several of the commentators, setting the stage for further debate and constructive suggestions.

    View details for Web of Science ID 000258751400003

    View details for PubMedID 18690363

  • The Simbios National Center: Systems Biology in Motion. Proceedings of the IEEE. Institute of Electrical and Electronics Engineers Schmidt, J. P., Delp, S. L., Sherman, M. A., Taylor, C. A., Pande, V. S., Altman, R. B. 2008; 96 (8): 1266


    Physics-based simulation is needed to understand the function of biological structures and can be applied across a wide range of scales, from molecules to organisms. Simbios (the National Center for Physics-Based Simulation of Biological Structures, is one of seven NIH-supported National Centers for Biomedical Computation. This article provides an overview of the mission and achievements of Simbios, and describes its place within systems biology. Understanding the interactions between various parts of a biological system and integrating this information to understand how biological systems function is the goal of systems biology. Many important biological systems comprise complex structural systems whose components interact through the exchange of physical forces, and whose movement and function is dictated by those forces. In particular, systems that are made of multiple identifiable components that move relative to one another in a constrained manner are multibody systems. Simbios' focus is creating methods for their simulation. Simbios is also investigating the biomechanical forces that govern fluid flow through deformable vessels, a central problem in cardiovascular dynamics. In this application, the system is governed by the interplay of classical forces, but the motion is distributed smoothly through the materials and fluids, requiring the use of continuum methods. In addition to the research aims, Simbios is working to disseminate information, software and other resources relevant to biological systems in motion.

    View details for PubMedID 20107615

  • The SeqFEATURE library of 3D functional site models: comparison to existing methods and applications to protein function annotation GENOME BIOLOGY Wu, S., Liang, M. P., Altman, R. B. 2008; 9 (1)


    Structural genomics efforts have led to increasing numbers of novel, uncharacterized protein structures with low sequence identity to known proteins, resulting in a growing need for structure-based function recognition tools. Our method, SeqFEATURE, robustly models protein functions described by sequence motifs using a structural representation. We built a library of models that shows good performance compared to other methods. In particular, SeqFEATURE demonstrates significant improvement over other methods when sequence and structural similarity are low.

    View details for DOI 10.1186/gb-2008-9-1-r8

    View details for Web of Science ID 000253779800016

    View details for PubMedID 18197987

  • The ethics of characterizing difference: guiding principles on using racial categories in human genetics GENOME BIOLOGY Lee, S. S., Mountain, J., Koenig, B., Altman, R., Brown, M., Camarillo, A., Cavalli-Sforza, L., Cho, M., Eberhardt, J., Feldman, M., Ford, R., Greely, H., King, R., Markus, H., Satz, D., Snipp, M., Steele, C., Underhill, P. 2008; 9 (7)


    We are a multidisciplinary group of Stanford faculty who propose ten principles to guide the use of racial and ethnic categories when characterizing group differences in research into human genetic variation.

    View details for DOI 10.1186/gb-2008-9-7-404

    View details for Web of Science ID 000258773600005

    View details for PubMedID 18638359

  • The pharmacogenetics and pharmacogenomics knowledge base: accentuating the knowledge NUCLEIC ACIDS RESEARCH Hernandez-Boussard, T., Whirl-Carrillo, M., Hebert, J. M., Gong, L., Owen, R., Gong, M., Gor, W., Liu, F., Truong, C., Whaley, R., Woon, M., Zhou, T., Altman, R. B., Klein, T. E. 2008; 36: D913-D918


    PharmGKB is a knowledge base that captures the relationships between drugs, diseases/phenotypes and genes involved in pharmacokinetics (PK) and pharmacodynamics (PD). This information includes literature annotations, primary data sets, PK and PD pathways, and expert-generated summaries of PK/PD relationships between drugs, diseases/phenotypes and genes. PharmGKB's website is designed to effectively disseminate knowledge to meet the needs of our users. PharmGKB currently has literature annotations documenting the relationship of over 500 drugs, 450 diseases and 600 variant genes. In order to meet the needs of whole genome studies, PharmGKB has added new functionalities, including browsing the variant display by chromosome and cytogenetic locations, allowing the user to view variants not located within a gene. We have developed new infrastructure for handling whole genome data, including increased methods for quality control and tools for comparison across other data sources, such as dbSNP, JSNP and HapMap data. PharmGKB has also added functionality to accept, store, display and query high throughput SNP array data. These changes allow us to capture more structured information on phenotypes for better cataloging and comparison of data. PharmGKB is available at

    View details for DOI 10.1093/nar/gkm1009

    View details for Web of Science ID 000252545400160

    View details for PubMedID 18032438

  • Text mining for biology - the way forward: opinions from leading scientists GENOME BIOLOGY Altman, R. B., Bergman, C. M., Blake, J., Blaschke, C., Cohen, A., Gannon, F., Grivell, L., Hahn, U., Hersh, W., Hirschman, L., Jensen, L. J., Krallinger, M., Mons, B., O'Donoghue, S. I., Peitsch, M. C., Rebholz-Schuhmann, D., Shatkay, H., Valencia, A. 2008; 9


    This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify several broad themes, including the possibility of fusing literature and biological databases through text mining; the need for user interfaces tailored to different classes of users and supporting community-based annotation; the importance of scaling text mining technology and inserting it into larger workflows; and suggestions for additional challenge evaluations, new applications, and additional resources needed to make progress.

    View details for DOI 10.1186/gb-2008-9-S2-S7

    View details for Web of Science ID 000278173900007

    View details for PubMedID 18834498

  • Robust recognition of zinc binding sites in proteins PROTEIN SCIENCE Ebert, J. C., Altman, R. B. 2008; 17 (1): 54-65


    Metals play a variety of roles in biological processes, and hence their presence in a protein structure can yield vital functional information. Because the residues that coordinate a metal often undergo conformational changes upon binding, detection of binding sites based on simple geometric criteria in proteins without bound metal is difficult. However, aspects of the physicochemical environment around a metal binding site are often conserved even when this structural rearrangement occurs. We have developed a Bayesian classifier using known zinc binding sites as positive training examples and nonmetal binding regions that nonetheless contain residues frequently observed in zinc sites as negative training examples. In order to allow variation in the exact positions of atoms, we average a variety of biochemical and biophysical properties in six concentric spherical shells around the site of interest. At a specificity of 99.8%, this method achieves 75.5% sensitivity in unbound proteins at a positive predictive value of 73.6%. We also test its accuracy on predicted protein structures obtained by homology modeling using templates with 30%-50% sequence identity to the target sequences. At a specificity of 99.8%, we correctly identify at least one zinc binding site in 65.5% of modeled proteins. Thus, in many cases, our model is accurate enough to identify metal binding sites in proteins of unknown structure for which no high sequence identity homologs of known structure exist. Both the source code and a Web interface are available to the public at

    View details for DOI 10.1110/ps.073138508

    View details for Web of Science ID 000251834500007

    View details for PubMedID 18042678

  • Predicting allosteric communication in myosin via a pathway of conserved residues JOURNAL OF MOLECULAR BIOLOGY Tang, S., Liao, J., Dunn, A. R., Altman, R. B., Spudich, J. A., Schmidt, J. P. 2007; 373 (5): 1361-1373


    We present a computational method that predicts a pathway of residues that mediate protein allosteric communication. The pathway is predicted using only a combination of distance constraints between contiguous residues and evolutionary data. We applied this analysis to find pathways of conserved residues connecting the myosin ATP binding site to the lever arm. These pathway residues may mediate the allosteric communication that couples ATP hydrolysis to the lever arm recovery stroke. Having examined pre-stroke conformations of Dictyostelium, scallop, and chicken myosin II as well as Dictyostelium myosin I, we observed a conserved pathway traversing switch II and the relay helix, which is consistent with the understood need for allosteric communication in this conformation. We also examined post-rigor and rigor conformations across several myosin species. Although initial residues of these paths are more heterogeneous, all but one of these paths traverse a consistent set of relay helix residues to reach the beginning of the lever arm. We discuss our results in the context of structural elements and reported mutational experiments, which substantiate the significance of the pre-stroke pathways. Our method provides a simple, computationally efficient means of predicting a set of residues that mediate allosteric communication. We provide a refined, downloadable application and source code (on to share this tool with the wider community (

    View details for DOI 10.1016/j.jmb.2007.08.059

    View details for Web of Science ID 000250712600021

    View details for PubMedID 17900617

  • Ontological issues in pharmacogenomics MONIST Altman, R. B. 2007; 90 (4): 523-533
  • The education potential of the pharmacogenetics and pharmacogenomics knowledge base (PharmGKB) CLINICAL PHARMACOLOGY & THERAPEUTICS Owen, R. P., Klein, T. E., Altman, R. B. 2007; 82 (4): 472-475


    The pharmacogenetics and pharmacogenomics knowledge base (PharmGKB, is a publicly available internet resource dedicated to the integration, annotation, and aggregation of pharmacogenomic knowledge. PharmGKB is a repository for pharmacogenetic and pharmacogenomic data, and curators provide integrated knowledge in terms of gene summaries, pathways, and annotated literature. Although PharmGKB is primarily directed toward catalyzing new research, it also has utility as a source of information for education about pharmacogenomics.

    View details for DOI 10.1038/sj.clpt.6100332

    View details for Web of Science ID 000249636500024

    View details for PubMedID 17713470

  • Current progress in bioinformatics 2007 BRIEFINGS IN BIOINFORMATICS Altman, R. B. 2007; 8 (5): 277-278

    View details for DOI 10.1093/bib/bbm041

    View details for Web of Science ID 000251034700001

    View details for PubMedID 17724063

  • Using surface envelopes to constrain molecular modeling PROTEIN SCIENCE Dugan, J. M., Altman, R. B. 2007; 16 (7): 1266-1273


    Molecular density information (as measured by electron microscopic reconstructions or crystallographic density maps) can be a powerful source of information for molecular modeling. Molecular density constrains models by specifying where atoms should and should not be. Low-resolution density information can often be obtained relatively quickly, and there is a need for methods that use it effectively. We have previously described a method for scoring molecular models with surface envelopes to discriminate between plausible and implausible fits. We showed that we could successfully filter out models with the wrong shape based on this discrimination power. Ideally, however, surface information should be used during the modeling process to constrain the conformations that are sampled. In this paper, we describe an extension of our method for using shape information during computational modeling. We use the envelope scoring metric as part of an objective function in a global optimization that also optimizes distances and angles while avoiding collisions. We systematically tested surface representations of proteins (using all nonhydrogen heavy atoms) with different abundance of distance information and showed that the root mean square deviation (RMSD) of models built with envelope information is consistently improved, particularly in data sets with relatively small sets of short-range distances.

    View details for DOI 10.1110/ps.062733407

    View details for Web of Science ID 000247465400004

    View details for PubMedID 17586766

  • Genetic nondiscrimination legislation: a critical prerequisite for pharmacogenomics data sharing PHARMACOGENOMICS Altman, R. B., Benowitz, N., Gurwitz, D., Lunshof, J., Relling, M., Lamba, J., Wieben, E., Mooney, S., Giacomini, K., Weiss, S., Johnson, J. A., McLeod, H., Flockhart, D., Weinsbilboum, R., Shuldiner, A. R., Roden, D., Krauss, R. M., Ratain, M. 2007; 8 (5): 519-519

    View details for DOI 10.2217/14622416.8.5.519

    View details for Web of Science ID 000246464800017

    View details for PubMedID 17465717

  • Coplanar and coaxial orientations of RNA bases and helices RNA-A PUBLICATION OF THE RNA SOCIETY Laederach, A., Chan, J. M., Schwartzman, A., Willgohs, E., Altman, R. B. 2007; 13 (5): 643-650


    Electrostatic interactions, base-pairing, and especially base-stacking dominate RNA three-dimensional structures. In an A-form RNA helix, base-stacking results in nearly perfect parallel orientations of all bases in the helix. Interestingly, when an RNA structure containing multiple helices is visualized at the atomic level, it is often possible to find an orientation such that only the edges of most bases are visible. This suggests that a general aspect of higher level RNA structure is a coplanar arrangement of base-normal vectors. We have analyzed all solved RNA crystal structures to determine the degree to which RNA base-normal vectors are globally coplanar. Using a statistical test based on the Watson-Girdle distribution, we determined that 330 out of 331 known RNA structures show statistically significant (p < 0.05; false discovery rate [FDR] = 0.05) coplanar normal vector orientations. Not surprisingly, 94% of the helices in RNA show bipolar arrangements of their base-normal vectors (p < 0.05). This allows us to compute a mean axis for each helix and compare their orientations within an RNA structure. This analysis revealed that 62% (208/331) of the RNA structures exhibit statistically significant coaxial packing of helices (p < 0.05, FDR = 0.08). Further analysis reveals that the bases in hairpin loops and junctions are also generally planar. This work demonstrates coplanar base orientation and coaxial helix packing as an emergent behavior of RNA structure and may be useful as a structural modeling constraint.

    View details for DOI 10.1261/rna.381407

    View details for Web of Science ID 000245882400002

    View details for PubMedID 17339576

  • Distinct contribution of electrostatics, initial conformational ensemble, and macromolecular stability in RNA folding PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Laederach, A., Shcherbakova, I., Jonikas, M. A., Altman, R. B., Brenowitz, M. 2007; 104 (17): 7045-7050


    We distinguish the contribution of the electrostatic environment, initial conformational ensemble, and macromolecular stability on the folding mechanism of a large RNA using a combination of time-resolved "Fast Fenton" hydroxyl radical footprinting and exhaustive kinetic modeling. This integrated approach allows us to define the folding landscape of the L-21 Tetrahymena thermophila group I intron structurally and kinetically from its earliest steps with unprecedented accuracy. Distinct parallel pathways leading the RNA to its native form upon its Mg(2+)-induced folding are observed. The structures of the intermediates populating the pathways are not affected by variation of the concentration and type of background monovalent ions (electrostatic environment) but are altered by a mutation that destabilizes one domain of the ribozyme. Experiments starting from different conformational ensembles but folding under identical conditions show that whereas the electrostatic environment modulates molecular flux through different pathways, the initial conformational ensemble determines the partitioning of the flux. This study showcases a robust approach for the development of kinetic models from collections of local structural probes.

    View details for DOI 10.1073/pnas.0608765104

    View details for Web of Science ID 000246024700031

    View details for PubMedID 17438287

  • PharmGKB: a logical home for knowledge relating genotype to drug response phenotype NATURE GENETICS Altman, R. B. 2007; 39 (4): 426-426

    View details for Web of Science ID 000245271200003

    View details for PubMedID 17392795

  • The Pharmacogenetics Research Network: From SNP discovery to clinical drug response CLINICAL PHARMACOLOGY & THERAPEUTICS Giacomini, K. M., Brett, C. M., Altman, R. B., Benowitz, N. L., Dolan, M. E., Flockhart, D. A., Johnson, J. A., Hayes, D. F., Klein, T., Krauss, R. M., Kroetz, D. L., McLeod, H. L., Nguyen, A. T., Ratain, M. J., RELLING, M. V., Reus, V., Roden, D. M., Schaefer, C. A., Shuldiner, A. R., Skaar, T., Tantisira, K., Tyndale, R. F., Wang, L., Weinshilboum, R. M., Weiss, S. T., Zineh, I. 2007; 81 (3): 328-345


    The NIH Pharmacogenetics Research Network (PGRN) is a collaborative group of investigators with a wide range of research interests, but all attempting to correlate drug response with genetic variation. Several research groups concentrate on drugs used to treat specific medical disorders (asthma, depression, cardiovascular disease, addiction of nicotine, and cancer), whereas others are focused on specific groups of proteins that interact with drugs (membrane transporters and phase II drug-metabolizing enzymes). The diverse scientific information is stored and annotated in a publicly accessible knowledge base, the Pharmacogenetics and Pharmacogenomics Knowledge base (PharmGKB). This report highlights selected achievements and scientific approaches as well as hypotheses about future directions of each of the groups within the PGRN. Seven major topics are included: informatics (PharmGKB), cardiovascular, pulmonary, addiction, cancer, transport, and metabolism.

    View details for DOI 10.1038/sj.clpt.6100087

    View details for Web of Science ID 000244850300011

    View details for PubMedID 17339863

  • Biomedical informatics training at Stanford in the 21st century JOURNAL OF BIOMEDICAL INFORMATICS Altman, R. B., Klein, T. E. 2007; 40 (1): 55-58


    The Stanford Biomedical Informatics training program began with a focus on clinical informatics, and has now evolved into a general program of biomedical informatics training, including clinical informatics, bioinformatics and imaging informatics. The program offers PhD, MS, distance MS, certificate programs, and is now affiliated with an undergraduate major in biomedical computation. Current dynamics include (1) increased activity in informatics within other training programs in biology and the information sciences (2) increased desire among informatics students to gain laboratory experience, (3) increased demand for computational collaboration among biomedical researchers, and (4) interaction with the newly formed Department of Bioengineering at Stanford University. The core focus on research training-the development and application of novel informatics methods for biomedical research-keeps the program centered in the midst of this period of growth and diversification.

    View details for DOI 10.1016/j.jbi.2006.02.005

    View details for Web of Science ID 000243216000007

    View details for PubMedID 16564233

  • The PharmGKB: integration, aggregation, and annotation of pharmacogenomic data and knowledge CLINICAL PHARMACOLOGY & THERAPEUTICS Hodge, A. E., Altman, R. B., Klein, T. E. 2007; 81 (1): 21-24


    The Pharmacogenetics and Pharmacogenomics Knowledge Base, PharmGKB (, curates pharmacogenetic and pharmacogenomic information to generate knowledge concerning the relationships among genes, drugs, and diseases, and the effects of gene variation on these relationships. PharmGKB curators collect information on genotype-phenotype relationships both from the literature and from the deposition of primary research data into our database. Their goal is to catalyze pharmacogenetic and pharmacogenomic research.

    View details for DOI 10.1038/sj.clpt.6100048

    View details for Web of Science ID 000242874200010

    View details for PubMedID 17185992

  • The FEATURE framework for protein function annotation: modelling new functions, improving performance, and extending to novel applications BMC GENOMICS Halperin, I., Glazer, D. S., Wu, S., Altman, R. B. 2007; 9
  • In Current Pharmacogenomics Thorn, C., Whirl-Carrillo, M., Klein, T., Altman, R. Bentham Science Publishers.. 2007
  • PharmGKB: integration, aggregation, and annotation of pharmacogenomic data and knowledge. Clin Pharmacol Ther. Hodge, A., Altman, R., Klein, T. 2007; 1 (81): 21-4
  • The education potential of the pharmacogenetics and pharmacogenomics knowledge base (PharmGKB). Clin Pharmacol Ther. Owen, R., Klein, T., Altman, R. 2007; 82 (4): 472-5
  • Clustering protein environments for function prediction: finding PROSITE motifs in 3D. BMC Bioinformatics., 8 Suppl 4:S10. PMCID: PMC1892080. Yoon, S., Ebert, J., Chung, E., De Micheli, G., Altman, R. 2007
  • Proceedings of Pacific Symposium on Biocomputing 2007. edited by Altman, R., Dunker, K., Hunter, L. 2007
  • Distinct contribution of electrostatics, initial conformational ensemble, and macromolecular stability in RNA folding. Laederach, A., Shcherbakova, I., Jonikas, M., Altman, R., Brenowitz, M. 2007
  • ST WeissandI Zineh for the Pharmacogenetics Research Network. The Pharmacogenetics Research Network: From SNP Discovery to Clinical Drug Response. Clinical Pharmacology & Therapeutics. Giacomini, K. M., Brett, C. M., Altman, R. B., Benowitz, N. L., Dolan, M. E., Flockhart, D. A. 2007; 81: 328-345
  • Clustering protein environments for function prediction: finding PROSITE motifs in 3D BMC BIOINFORMATICS Yoon, S., Ebert, J. C., Chung, E., De Micheli, G., Altman, R. B. 2007; 8


    Structural genomics initiatives are producing increasing numbers of three-dimensional (3D) structures for which there is little functional information. Structure-based annotation of molecular function is therefore becoming critical. We previously presented FEATURE, a method for describing microenvironments around functional sites in proteins. However, FEATURE uses supervised machine learning and so is limited to building models for sites of known importance and location. We hypothesized that there are a large number of sites in proteins that are associated with function that have not yet been recognized. Toward that end, we have developed a method for clustering protein microenvironments in order to evaluate the potential for discovering novel sites that have not been previously identified.We have prototyped a computational method for rapid clustering of millions of microenvironments in order to discover residues whose surrounding environments are similar and which may therefore share a functional or structural role. We clustered nearly 2,000,000 environments from 9,600 protein chains and defined 4,550 clusters. As a preliminary validation, we asked whether known 3D environments associated with PROSITE motifs were "rediscovered". We found examples of clusters highly enriched for residues that share PROSITE sequence motifs.Our results demonstrate that we can cluster protein environments successfully using a simplified representation and K-means clustering algorithm. The rediscovery of known 3D motifs allows us to calibrate the size and intercluster distances that characterize useful clusters. This information will then allow us to find new clusters with similar characteristics that represent novel structural or functional sites.

    View details for DOI 10.1183/1471-2105-8-S4-S10

    View details for Web of Science ID 000247557800010

    View details for PubMedID 17570144

  • Extracting Subject Demographic Information From Abstracts of Randomized Clinical Trial Reports MEDINFO 2007: PROCEEDINGS OF THE 12TH WORLD CONGRESS ON HEALTH (MEDICAL) INFORMATICS, PTS 1 AND 2 xu, r., Garten, Y., Supekar, K. S., Das, A. K., Altman, R. B., Garber, A. M. 2007; 129: 550-554


    In order to make more informed healthcare decisions, consumers need information systems that deliver accurate and reliable information about their illnesses and potential treatments. Reports of randomized clinical trials (RCTs) provide reliable medical evidence about the efficacy of treatments. Current methods to access, search for, and retrieve RCTs are keyword-based, time-consuming, and suffer from poor precision. Personalized semantic search and medical evidence summarization aim to solve this problem. The performance of these approaches may improve if they have access to study subject descriptors (e.g. age, gender, and ethnicity), trial sizes, and diseases/symptoms studied. We have developed a novel method to automatically extract such subject demographic information from RCT abstracts. We used text classification augmented with a Hidden Markov Model to identify sentences containing subject demographics, and subsequently these sentences were parsed using Natural Language Processing techniques to extract relevant information. Our results show accuracy levels of 82.5%, 92.5%, and 92.0% for extraction of subject descriptors, trial sizes, and diseases/symptoms descriptors respectively.

    View details for Web of Science ID 000272064000111

    View details for PubMedID 17911777

  • Integrating large-scale genotype and phenotype data OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY Hernandez-Boussard, T., Woon, M., Klein, T. E., Altman, R. B. 2006; 10 (4): 545-554


    With the completion of the Human Genome Project, a new emphasis is focusing on the sequence variation and the resulting phenotype. The number of data available from genomic studies addressing this relationship is rapidly growing. In order to analyze these data as a whole, they need to be integrated, aggregated and annotated in a timely manner. The Pharmacogenetics and Pharmacogenomics Knowledge Base PharmGKB; () assembles and disseminates these data and their associated metadata that are needed for unambiguous identification and replication. Assembling these data in a timely manner is challenging, and the scalability of these data produce major challenges for a knowledge base such as PharmGKB. However, it is only through rapid global meta-annotation of these data that we will understand the relationship between specific genotype(s) and the related phenotype. PharmGKB has confronted these challenges, and these experiences and solutions can benefit all genome communities.

    View details for Web of Science ID 000243893500009

    View details for PubMedID 17233563

  • Pharmacogenomics: Challenges and opportunities ANNALS OF INTERNAL MEDICINE Roden, D. M., Altman, R. B., Benowitz, N. L., Flockhart, D. A., Giacomini, K. M., Johnson, J. A., Krauss, R. M., McLeod, H. L., Ratain, M. J., Relling, M. V., Ring, H. Z., Shuldiner, A. R., Weinshilboum, R. M., Weiss, S. T. 2006; 145 (10): 749-757


    The outcome of drug therapy is often unpredictable, ranging from beneficial effects to lack of efficacy to serious adverse effects. Variations in single genes are 1 well-recognized cause of such unpredictability, defining the field of pharmacogenetics (see Glossary). Such variations may involve genes controlling drug metabolism, drug transport, disease susceptibility, or drug targets. The sequencing of the human genome and the cataloguing of variants across human genomes are the enabling resources for the nascent field of pharmacogenomics (see Glossary), which tests the idea that genomic variability underlies variability in drug responses. However, there are many challenges that must be overcome to apply rapidly accumulating genomic information to understand variable drug responses, including defining candidate genes and pathways; relating disease genes to drug response genes; precisely defining drug response phenotypes; and addressing analytic, ethical, and technological issues involved in generation and management of large drug response data sets. Overcoming these challenges holds the promise of improving new drug development and ultimately individualizing the selection of appropriate drugs and dosages for individual patients.

    View details for Web of Science ID 000242387100004

    View details for PubMedID 17116919

  • Annual progress in bioinformatics 2006 BRIEFINGS IN BIOINFORMATICS Altman, R. B. 2006; 7 (3): 209-210

    View details for DOI 10.1093/bib/bbl029

    View details for Web of Science ID 000240964500001

  • The incidentalome - A threat to genomic medicine JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION Kohane, I. S., Masys, D. R., Altman, R. B. 2006; 296 (2): 212-215

    View details for Web of Science ID 000238946500027

    View details for PubMedID 16835427

  • Local kinetic measures of macromolecular structure reveal partitioning among multiple parallel pathways from the earliest steps in the folding of a large RNA molecule JOURNAL OF MOLECULAR BIOLOGY Laederach, A., Shcherbakova, I., Liang, M. P., Brenowitz, M., Altman, R. B. 2006; 358 (4): 1179-1190


    At the heart of the RNA folding problem is the number, structures, and relationships among the intermediates that populate the folding pathways of most large RNA molecules. Unique insight into the structural dynamics of these intermediates can be gleaned from the time-dependent changes in local probes of macromolecular conformation (e.g. reports on individual nucleotide solvent accessibility offered by hydroxyl radical (()OH) footprinting). Local measures distributed around a macromolecule individually illuminate the ensemble of separate changes that constitute a folding reaction. Folding pathway reconstruction from a multitude of these individual measures is daunting due to the combinatorial explosion of possible kinetic models as the number of independent local measures increases. Fortunately, clustering of time progress curves sufficiently reduces the dimensionality of the data so as to make reconstruction computationally tractable. The most likely folding topology and intermediates can then be identified by exhaustively enumerating all possible kinetic models on a super-computer grid. The folding pathways and measures of the relative flux through them were determined for Mg(2+) and Na(+)-mediated folding of the Tetrahymena thermophila group I intron using this combined experimental and computational approach. The flux during Mg(2+)-mediated folding is divided among numerous parallel pathways. In contrast, the flux during the Na(+)-mediated reaction is predominantly restricted through three pathways, one of which is without detectable passage through intermediates. Under both conditions, the folding reaction is highly parallel with no single pathway accounting for more than 50% of the molecular flux. This suggests that RNA folding is non-sequential under a variety of different experimental conditions even at the earliest stages of folding. This study provides a template for the systematic analysis of the time-evolution of RNA structure from ensembles of local measures that will illuminate the chemical and physical characteristics of each step in the process. The applicability of this analysis approach to other macromolecules is discussed.

    View details for DOI 10.1016/j.jmb.2006.02.075

    View details for Web of Science ID 000237567000021

    View details for PubMedID 16574145

  • Delivering diverse data to multiple audiences: the PharmGKB model SCIENTIST Altman, R. B. 2006; 20 (4): 49-50
  • Choosing SNPs using feature selection. Journal of bioinformatics and computational biology Phuong, T. M., Lin, Z., Altman, R. B. 2006; 4 (2): 241-257


    A major challenge for genomewide disease association studies is the high cost of genotyping large number of single nucleotide polymorphisms (SNPs). The correlations between SNPs, however, make it possible to select a parsimonious set of informative SNPs, known as "tagging" SNPs, able to capture most variation in a population. Considerable research interest has recently focused on the development of methods for finding such SNPs. In this paper, we present an efficient method for finding tagging SNPs. The method does not involve computation-intensive search for SNP subsets but discards redundant SNPs using a feature selection algorithm. In contrast to most existing methods, the method presented here does not limit itself to using only correlations between SNPs in local groups. By using correlations that occur across different chromosomal regions, the method can reduce the number of globally redundant SNPs. Experimental results show that the number of tagging SNPs selected by our method is smaller than by using block-based methods. Supplementary website:

    View details for PubMedID 16819782

  • The RNA Ontology Consortium: An open invitation to the RNA community RNA-A PUBLICATION OF THE RNA SOCIETY Leontis, N. B., Altman, R. B., Berman, H. M., Brenner, S. E., Brown, J. W., Engelke, D. R., Harvey, S. C., Holbrook, S. R., Jossinet, F., Lewis, S. E., Major, F., Mathews, D. H., Richardson, J. S., Williamson, J. R., Westhof, E. 2006; 12 (4): 533-541


    The aim of the RNA Ontology Consortium (ROC) is to create an integrated conceptual framework-an RNA Ontology (RO)-with a common, dynamic, controlled, and structured vocabulary to describe and characterize RNA sequences, secondary structures, three-dimensional structures, and dynamics pertaining to RNA function. The RO should produce tools for clear communication about RNA structure and function for multiple uses, including the integration of RNA electronic resources into the Semantic Web. These tools should allow the accurate description in computer-interpretable form of the coupling between RNA architecture, function, and evolution. The purposes for creating the RO are, therefore, (1) to integrate sequence and structural databases; (2) to allow different computational tools to interoperate; (3) to create powerful software tools that bring advanced computational methods to the bench scientist; and (4) to facilitate precise searches for all relevant information pertaining to RNA. For example, one initial objective of the ROC is to define, identify, and classify RNA structural motifs described in the literature or appearing in databases and to agree on a computer-interpretable definition for each of these motifs. To achieve these aims, the ROC will foster communication and promote collaboration among RNA scientists by coordinating frequent face-to-face workshops to discuss, debate, and resolve difficult conceptual issues. These meeting opportunities will create new directions at various levels of RNA research. The ROC will work closely with the PDB/NDB structural databases and the Gene, Sequence, and Open Biomedical Ontology Consortia to integrate the RO with existing biological ontologies to extend existing content while maintaining interoperability.

    View details for DOI 10.1261/rna.2343206

    View details for Web of Science ID 000236700200001

    View details for PubMedID 16484377

  • Pharmacogenomics: The relevance of emerging genotyping technologies. MLO: medical laboratory observer Hernandez-Boussard, T., Klein, T. E., Altman, R. B. 2006; 38 (3): 24-?

    View details for PubMedID 16610446

  • Drug targets for Plasmodium falciparum: A post-genomic review/survey MINI-REVIEWS IN MEDICINAL CHEMISTRY Yeh, I., Altman, R. B. 2006; 6 (2): 177-202


    Over 300 million cases of malaria each year cause significant morbidity and mortality. Growing drug-resistance among the Plasmodia that cause malaria motivates the development of additional anti-malarial drugs. This review summarizes the current state of knowledge about potential drug targets for malaria. The recently sequenced malaria genome data clarifies parasite metabolic pathways, and more metabolic targets have been identified.

    View details for Web of Science ID 000235327300007

    View details for PubMedID 16472186

  • A call for the creation of personalized medicine databases NATURE REVIEWS DRUG DISCOVERY Gurwitz, D., Lunshof, J. E., Altman, R. B. 2006; 5 (1): 23-26


    The success of the Human Genome Project raised expectations that the knowledge gained would lead to improved insight into human health and disease, identification of new drug targets and, eventually, a breakthrough in healthcare management. However, the realization of these expectations has been hampered by the lack of essential data on genotype--drug-response phenotype associations. We therefore propose a follow-up to the Human Genome Project: forming global consortia devoted to archiving and analysing group and individual patient data on associations between genotypes and drug-response phenotypes. Here, we discuss the rationale for such personalized medicine databases, and the key practical and ethical issues that need to be addressed in their establishment.

    View details for DOI 10.1038/nrd1931

    View details for Web of Science ID 000234555300014

    View details for PubMedID 16374513

  • Physics-based simulation of biological sturctures 2006 3RD IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING: MACRO TO NANO, VOLS 1-3 Delp, S. L., Anderson, F. C., Altman, R. B. 2006: 802-803
  • Proceedings of Pacific Symposium on Biocomputing 2006. edited by Altman, R., Dunker, K., Hunter, L. 2006
  • Structural characterization of proteins using residue environments PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS Mooney, S. D., Liang, M. H., DeConde, R., Altman, R. B. 2005; 61 (4): 741-747


    A primary challenge for structural genomics is the automated functional characterization of protein structures. We have developed a sequence-independent method called S-BLEST (Structure-Based Local Environment Search Tool) for the annotation of previously uncharacterized protein structures. S-BLEST encodes the local environment of an amino acid as a vector of structural property values. It has been applied to all amino acids in a nonredundant database of protein structures to generate a searchable structural resource. Given a query amino acid from an experimentally determined or modeled structure, S-BLEST quickly identifies similar amino acid environments using a K-nearest neighbor search. In addition, the method gives an estimation of the statistical significance of each result. We validated S-BLEST on X-ray crystal structures from the ASTRAL 40 nonredundant dataset. We then applied it to 86 crystallographically determined proteins in the protein data bank (PDB) with unknown function and with no significant sequence neighbors in the PDB. S-BLEST was able to associate 20 proteins with at least one local structural neighbor and identify the amino acid environments that are most similar between those neighbors.

    View details for DOI 10.1002/prot.20661

    View details for Web of Science ID 000233691100005

    View details for PubMedID 16245324

  • Time to organize the bioinformatics resourceome PLOS COMPUTATIONAL BIOLOGY Cannata, N., Merelli, E., Altman, R. B. 2005; 1 (7): 531-533

    View details for DOI 10.1371/journal.pcbi.0010076

    View details for Web of Science ID 000239480500002

    View details for PubMedID 16738704

  • Health-information altruists - A potentially critical resource NEW ENGLAND JOURNAL OF MEDICINE Kohane, I. S., Altman, R. B. 2005; 353 (19): 2074-2077

    View details for Web of Science ID 000233119600015

    View details for PubMedID 16282184

  • Using Petri net tools to study properties and dynamics of biological systems JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Peleg, M., Rubin, D., Altman, R. B. 2005; 12 (2): 181-199


    Petri Nets (PNs) and their extensions are promising methods for modeling and simulating biological systems. We surveyed PN formalisms and tools and compared them based on their mathematical capabilities as well as by their appropriateness to represent typical biological processes. We measured the ability of these tools to model specific features of biological systems and answer a set of biological questions that we defined. We found that different tools are required to provide all capabilities that we assessed. We created software to translate a generic PN model into most of the formalisms and tools discussed. We have also made available three models and suggest that a library of such models would catalyze progress in qualitative modeling via PNs. Development and wide adoption of common formats would enable researchers to share models and use different tools to analyze them without the need to convert to proprietary formats.

    View details for DOI 10.1197/jamia.M1637

    View details for Web of Science ID 000227842000009

    View details for PubMedID 15561791

  • SAFA: Semi-automated footprinting analysis software for high-throughput quantification of nucleic acid footprinting experiments RNA-A PUBLICATION OF THE RNA SOCIETY Das, R., Laederach, A., Pearlman, S. M., Herschlag, D., Altman, R. B. 2005; 11 (3): 344-354


    Footprinting is a powerful and widely used tool for characterizing the structure, thermodynamics, and kinetics of nucleic acid folding and ligand binding reactions. However, quantitative analysis of the gel images produced by footprinting experiments is tedious and time-consuming, due to the absence of informatics tools specifically designed for footprinting analysis. We have developed SAFA, a semi-automated footprinting analysis software package that achieves accurate gel quantification while reducing the time to analyze a gel from several hours to 15 min or less. The increase in analysis speed is achieved through a graphical user interface that implements a novel methodology for lane and band assignment, called "gel rectification," and an optimized band deconvolution algorithm. The SAFA software yields results that are consistent with published methodologies and reduces the investigator-dependent variability compared to less automated methods. These software developments simplify the analysis procedure for a footprinting gel and can therefore facilitate the use of quantitative footprinting techniques in nucleic acid laboratories that otherwise might not have considered their use. Further, the increased throughput provided by SAFA may allow a more comprehensive understanding of molecular interactions. The software and documentation are freely available for download at

    View details for DOI 10.1261/rna.7214405

    View details for Web of Science ID 000227190000011

    View details for PubMedID 15701734

  • A statistical approach to scanning the biomedical literature for pharmacogenetics knowledge JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Rubin, D. L., Thorn, C. F., Klein, T. E., Altman, R. B. 2005; 12 (2): 121-129


    Biomedical databases summarize current scientific knowledge, but they generally require years of laborious curation effort to build, focusing on identifying pertinent literature and data in the voluminous biomedical literature. It is difficult to manually extract useful information embedded in the large volumes of literature, and automated intelligent text analysis tools are becoming increasingly essential to assist in these curation activities. The goal of the authors was to develop an automated method to identify articles in Medline citations that contain pharmacogenetics data pertaining to gene-drug relationships.The authors built and evaluated several candidate statistical models that characterize pharmacogenetics articles in terms of word usage and the profile of Medical Subject Headings (MeSH) used in those articles. The best-performing model was used to scan the entire Medline article database (11 million articles) to identify candidate pharmacogenetics articles.A sampling of the articles identified from scanning Medline was reviewed by a pharmacologist to assess the precision of the method. The authors' approach identified 4,892 pharmacogenetics articles in the literature with 92% precision. Their automated method took a fraction of the time to acquire these articles compared with the time expected to be taken to accumulate them manually. The authors have built a Web resource ( to provide access to their results.A statistical classification approach can screen the primary literature to pharmacogenetics articles with high precision. Such methods may assist curators in acquiring pertinent literature in building biomedical databases.

    View details for DOI 10.1197/jamia.M1640

    View details for Web of Science ID 000227842000003

    View details for PubMedID 15561790

  • Biomedical term mapping databases NUCLEIC ACIDS RESEARCH Wren, J. D., Chang, J. T., Pustejovsky, J., Adar, E., Garner, H. R., Altman, R. B. 2005; 33: D289-D293


    Longer words and phrases are frequently mapped onto a shorter form such as abbreviations or acronyms for efficiency of communication. These abbreviations are pervasive in all aspects of biology and medicine and as the amount of biomedical literature grows, so does the number of abbreviations and the average number of definitions per abbreviation. Even more confusing, different authors will often abbreviate the same word/phrase differently. This ambiguity impedes our ability to retrieve information, integrate databases and mine textual databases for content. Efforts to standardize nomenclature, especially those doing so retrospectively, need to be aware of different abbreviatory mappings and spelling variations. To address this problem, there have been several efforts to develop computer algorithms to identify the mapping of terms between short and long form within a large body of literature. To date, four such algorithms have been applied to create online databases that comprehensively map biomedical terms and abbreviations within MEDLINE: ARGH (, the Stanford Biomedical Abbreviation Server (, AcroMed ( and SaRAD ( In addition to serving as useful computational tools, these databases serve as valuable references that help biologists keep up with an ever-expanding vocabulary of terms.

    View details for DOI 10.1093/nar/gki137

    View details for Web of Science ID 000226524300059

    View details for PubMedID 15608198

  • Challenges in creating an infrastructure for physics-based simulation of biological structures Altman, R. B. IEEE COMPUTER SOC. 2005: 3-3
  • Proceedings of Pacific Symposium on Biocomputing 2005. edited by Altman, R., Dunker, K., Hunter, L. 2005
  • Choosing SNPs Using Feature Selection. Phuong, T., Lin, Z., Altman, R. 2005
  • Introduction to ontologies in biomedicine: from powertools to assistants. In Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics. Altman, R. Wiley Online Library.. 2005: 1
  • PharmGKB: The Pharmacogenetics and Pharmacogenomics Knowledge Base. Pharmacogenomics: Methods and Applications Thorn, C., Klein, T., Altman, R. edited by Innocenti, F. Totowa: Humana Press.. 2005: 177-192
  • PharmGKB: The Pharmacogenetics and Pharmacogenomics Knowledge Base. Thorn, C., Klein, T., Altman, R. edited by Innocenti, F. 2005
  • Choosing SNPs using feature selection 2005 IEEE COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE, PROCEEDINGS Phuong, T. M., Lin, Z., Altman, R. B. 2005: 301-309


    A major challenge for genomewide disease association studies is the high cost of genotyping large number of single nucleotide polymorphisms (SNP). The correlations between SNPs, however, make it possible to select a parsimonious set of informative SNPs, known as "tagging" SNPs, able to capture most variation in a population. Considerable research interest has recently focused on the development of methods for finding such SNPs. In this paper, we present an efficient method for finding tagging SNPs. The method does not involve computation-intensive search for SNP subsets but discards redundant SNPs using a feature selection algorithm. In contrast to most existing methods, the method presented here does not limit itself to using only correlations between SNPs in local groups. By using correlations that occur across different chromosomal regions, the method can reduce the number of globally redundant SNPs. Experimental results show that the number of tagging SNPs selected by our method is smaller than by using block-based methods.

    View details for Web of Science ID 000231800100034

    View details for PubMedID 16447987

  • PharmGKB: the pharmacogenetics and pharmacogenomics knowledge base. Methods in molecular biology (Clifton, N.J.) Thorn, C. F., Klein, T. E., Altman, R. B. 2005; 311: 179-191


    The Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB) is an interactive tool for researchers investigating how genetic variation effects drug response. The PharmGKB web site,, displays genotype, molecular, and clinical primary data integrated with literature, pathway representations, protocol information, and links to additional external resources. Users can search and browse the knowledge base by genes, drugs, diseases, and pathways. Registration is free to the entire research community but subject to an agreement to respect the rights and privacy of the individuals whose information is contained within the database. Registered users can access and download primary data to aid in the design of future pharmacogenetics and pharmacogenomics studies.

    View details for PubMedID 16100408

  • Finding haplotype tagging SNPs by use of principal components analysis AMERICAN JOURNAL OF HUMAN GENETICS Lin, Z., Altman, R. B. 2004; 75 (5): 850-861


    The immense volume and rapid growth of human genomic data, especially single nucleotide polymorphisms (SNPs), present special challenges for both biomedical researchers and automatic algorithms. One such challenge is to select an optimal subset of SNPs, commonly referred as "haplotype tagging SNPs" (htSNPs), to capture most of the haplotype diversity of each haplotype block or gene-specific region. This information-reduction process facilitates cost-effective genotyping and, subsequently, genotype-phenotype association studies. It also has implications for assessing the risk of identifying research subjects on the basis of SNP information deposited in public domain databases. We have investigated methods for selecting htSNPs by use of principal components analysis (PCA). These methods first identify eigenSNPs and then map them to actual SNPs. We evaluated two mapping strategies, greedy discard and varimax rotation, by assessing the ability of the selected htSNPs to reconstruct genotypes of non-htSNPs. We also compared these methods with two other htSNP finders, one of which is PCA based. We applied these methods to three experimental data sets and found that the PCA-based methods tend to select the smallest set of htSNPs to achieve a 90% reconstruction precision.

    View details for Web of Science ID 000224303500010

    View details for PubMedID 15389393

  • Computational functional genomics IEEE SIGNAL PROCESSING MAGAZINE Liang, M. P., Troyanskaya, O. G., Laederach, A., Brutlag, D. L., Altman, R. B. 2004; 21 (6): 62-69
  • Tools for loading MEDLINE into a local relational database BMC BIOINFORMATICS Oliver, D. E., Bhalotia, G., Schwartz, A. S., Altman, R. B., Hearst, M. A. 2004; 5


    Researchers who use MEDLINE for text mining, information extraction, or natural language processing may benefit from having a copy of MEDLINE that they can manage locally. The National Library of Medicine (NLM) distributes MEDLINE in eXtensible Markup Language (XML)-formatted text files, but it is difficult to query MEDLINE in that format. We have developed software tools to parse the MEDLINE data files and load their contents into a relational database. Although the task is conceptually straightforward, the size and scope of MEDLINE make the task nontrivial. Given the increasing importance of text analysis in biology and medicine, we believe a local installation of MEDLINE will provide helpful computing infrastructure for researchers.We developed three software packages that parse and load MEDLINE, and ran each package to install separate instances of the MEDLINE database. For each installation, we collected data on loading time and disk-space utilization to provide examples of the process in different settings. Settings differed in terms of commercial database-management system (IBM DB2 or Oracle 9i), processor (Intel or Sun), programming language of installation software (Java or Perl), and methods employed in different versions of the software. The loading times for the three installations were 76 hours, 196 hours, and 132 hours, and disk-space utilization was 46.3 GB, 37.7 GB, and 31.6 GB, respectively. Loading times varied due to a variety of differences among the systems. Loading time also depended on whether data were written to intermediate files or not, and on whether input files were processed in sequence or in parallel. Disk-space utilization depended on the number of MEDLINE files processed, amount of indexing, and whether abstracts were stored as character large objects or truncated.Relational database (RDBMS) technology supports indexing and querying of very large datasets, and can accommodate a locally stored version of MEDLINE. RDBMS systems support a wide range of queries and facilitate certain tasks that are not directly supported by the application programming interface to PubMed. Because there is variation in hardware, software, and network infrastructures across sites, we cannot predict the exact time required for a user to load MEDLINE, but our results suggest that performance of the software is reasonable. Our database schemas and conversion software are publicly available at

    View details for DOI 10.1186/1471-2105-5-146

    View details for Web of Science ID 000225769500002

    View details for PubMedID 15471541

  • Approaches for protecting privacy in the genomic era GENETIC ENGINEERING NEWS Lin, Z., Owen, A. B., Altman, R. B. 2004; 24 (17): 6-?
  • Extracting and characterizing gene-drug relationships from the literature PHARMACOGENETICS Chang, J. T., Altman, R. B. 2004; 14 (9): 577-586


    A fundamental task of pharmacogenetics is to collect and classify relationships between genes and drugs. Currently, this useful information has not been comprehensively aggregated in any database and remains scattered throughout the published literature. Although there are efforts to collect this information manually, they are limited by the size of the published literature on gene-drug relationships. Therefore, we investigated computational methods to extract and characterize pharmacogenetic relationships between genes and drugs from the literature. We first evaluated the effectiveness of the co-occurrence method in identifying related genes and drugs. We then used supervised machine learning algorithms to classify the relationships between genes and drugs from the Pharmacogenetics and Pharmacogenomics Knowledge Base (PharmGKB) into five categories that have been defined by active pharmacogenetic researchers as relevant to their work. The final co-occurrence algorithm was able to extract 78% of the related genes and drugs that were published in a review article from the literature. Our algorithm subsequently classified the relationships between genes and drugs from the PharmGKB into five categories with 74% accuracy. We have made the data available on a supplementary website at Gene-drug relationships can be accurately extracted from text and classified into categories. Although the relationships that we have identified do not capture the details and fine distinctions often made in the literature, these methods will help scientists to track the ever-growing literature and create information resources to support future discoveries.

    View details for Web of Science ID 000224107300002

    View details for PubMedID 15475731

  • Genomic research and human subject privacy SCIENCE Lin, Z., Owen, A. B., Altman, R. B. 2004; 305 (5681): 183-183

    View details for Web of Science ID 000222501000030

    View details for PubMedID 15247459

  • An "omics" view of drug development DRUG DEVELOPMENT RESEARCH Altman, R. B., Rubin, D. L., Klein, T. E. 2004; 62 (2): 81-85

    View details for DOI 10.1002/ddr.10370

    View details for Web of Science ID 000225497400003

  • Training the next generation of informaticians: The impact of "BISTI" and bioinformatics - A report from the American College of Medical Informatics JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Friedman, C. P., Altman, R. B., Kohane, I. S., McCormick, K. A., Miller, P. L., Ozbolt, J. G., Shortliffe, E. H., Stormo, G. D., Szczepaniak, M. C., Tuck, D., Williamson, J. 2004; 11 (3): 167-172


    In 2002-2003, the American College of Medical Informatics (ACMI) undertook a study of the future of informatics training. This project capitalized on the rapidly expanding interest in the role of computation in basic biological research, well characterized in the National Institutes of Health (NIH) Biomedical Information Science and Technology Initiative (BISTI) report. The defining activity of the project was the three-day 2002 Annual Symposium of the College. A committee, comprised of the authors of this report, subsequently carried out activities, including interviews with a broader informatics and biological sciences constituency, collation and categorization of observations, and generation of recommendations. The committee viewed biomedical informatics as an interdisciplinary field, combining basic informational and computational sciences with application domains, including health care, biological research, and education. Consequently, effective training in informatics, viewed from a national perspective, should encompass four key elements: (1). curricula that integrate experiences in the computational sciences and application domains rather than just concatenating them; (2). diversity among trainees, with individualized, interdisciplinary cross-training allowing each trainee to develop key competencies that he or she does not initially possess; (3). direct immersion in research and development activities; and (4). exposure across the wide range of basic informational and computational sciences. Informatics training programs that implement these features, irrespective of their funding sources, will meet and exceed the challenges raised by the BISTI report, and optimally prepare their trainees for careers in a field that continues to evolve.

    View details for Web of Science ID 000221546700001

    View details for PubMedID 14764617

  • Computational analysis of Plasmodium falciparum metabolism: Organizing genomic information to facilitate drug discovery GENOME RESEARCH Yeh, W., Hanekamp, T., Tsoka, S., Karp, P. D., Altman, R. B. 2004; 14 (5): 917-924


    Identification of novel targets for the development of more effective antimalarial drugs and vaccines is a primary goal of the Plasmodium genome project. However, deciding which gene products are ideal drug/vaccine targets remains a difficult task. Currently, a systematic disruption of every single gene in Plasmodium is technically challenging. Hence, we have developed a computational approach to prioritize potential targets. A pathway/genome database (PGDB) integrates pathway information with information about the complete genome of an organism. We have constructed PlasmoCyc, a PGDB for Plasmodium falciparum 3D7, using its annotated genomic sequence. In addition to the annotations provided in the genome database, we add 956 additional annotations to proteins annotated as "hypothetical" using the GeneQuiz annotation system. We apply a novel computational algorithm to PlasmoCyc to identify 216 "chokepoint enzymes." All three clinically validated drug targets are chokepoint enzymes. A total of 87.5% of proposed drug targets with biological evidence in the literature are chokepoint reactions. Therefore, identifying chokepoint enzymes represents one systematic way to identify potential metabolic drug targets.

    View details for DOI 10.1101/gr.2050304

    View details for Web of Science ID 000221171700016

    View details for PubMedID 15078855

  • Eukaryotic regulatory element conservation analysis and identification using comparative genomics GENOME RESEARCH Liu, Y. Y., Liu, X. S., Wei, L. P., Altman, R. B., Batzoglou, S. 2004; 14 (3): 451-458


    Comparative genomics is a promising approach to the challenging problem of eukaryotic regulatory element identification, because functional noncoding sequences may be conserved across species from evolutionary constraints. We systematically analyzed known human and Saccharomyces cerevisiae regulatory elements and discovered that human regulatory elements are more conserved between human and mouse than are background sequences. Although S. cerevisiae regulatory elements do not appear to be more conserved by comparison of S. cerevisiae to Schizosaccharomyces pombe, they are more conserved when compared with multiple other yeast genomes (Saccharomyces paradoxus, Saccharomyces mikatae, and Saccharomyces bayanus). Based on these analyses, we developed a sequence-motif-finding algorithm called CompareProspector, which extends Gibbs sampling by biasing the search in regions conserved across species. Using human-mouse comparison, CompareProspector identified known motifs for transcription factors Mef2, Myf, Srf, and Sp1 from a set of human-muscle-specific genes. It also discovered the NFAT motif from genes up-regulated by CD28 stimulation in T-cells, which implies the direct involvement of NFAT in mediating the CD28 stimulatory signal. Using Caenorhabditis elegans-Caenorhabditis briggsae comparison, CompareProspector found the PHA-4 motif and the UNC-86 motif. CompareProspector outperformed many other computational motif-finding programs, demonstrating the power of comparative genomics-based biased sampling in eukaryotic regulatory element identification.

    View details for Web of Science ID 000189389100013

    View details for PubMedID 14993210

  • Editorial: Building successful biological databases BRIEFINGS IN BIOINFORMATICS Altman, R. B. 2004; 5 (1): 4-5

    View details for Web of Science ID 000222244300001

    View details for PubMedID 15153301

  • GAPSCORE: finding gene and protein names one word at a time BIOINFORMATICS Chang, J. T., Schutze, H., Altman, R. B. 2004; 20 (2): 216-225


    New high-throughput technologies have accelerated the accumulation of knowledge about genes and proteins. However, much knowledge is still stored as written natural language text. Therefore, we have developed a new method, GAPSCORE, to identify gene and protein names in text. GAPSCORE scores words based on a statistical model of gene names that quantifies their appearance, morphology and context.We evaluated GAPSCORE against the Yapex data set and achieved an F-score of 82.5% (83.3% recall, 81.5% precision) for partial matches and 57.6% (58.5% recall, 56.7% precision) for exact matches. Since the method is statistical, users can choose score cutoffs that adjust the performance according to their needs.GAPSCORE is available at

    View details for DOI 10.1093/bioinformatics/btg393

    View details for Web of Science ID 000188389700012

    View details for PubMedID 14734313

  • Using surface envelopes for discrimination of molecular models PROTEIN SCIENCE Dugan, J. M., Altman, R. B. 2004; 13 (1): 15-24


    Shape information about macromolecules is increasingly available but is difficult to use in modeling efforts. We demonstrate that shape information alone can often distinguish structural models of biological macromolecules. By using a data structure called a surface envelope (SE) to represent the shape of the molecule, we propose a method that generates a fitness score for the shape of a particular molecular model. This score correlates well with root mean squared deviation (RMSD) of the model to the known test structures and can be used to filter models in decoy sets. The scoring method requires both alignment of the model to the SE in three-dimensional space and assessment of the degree to which atoms in the model fill the SE. Alignment combines a hybrid algorithm using principal components and a previously published iterated closest point algorithm. We test our method against models generated from random atom perturbation from crystal structures, published decoy sets used in structure prediction, and models created from the trajectories of atoms in molecular modeling runs. We also test our alignment algorithm against experimental electron microscopic data from rice dwarf virus. The alignment performance is reliable, and we show a high correlation between model RMSD and score function. This correlation is stronger for molecular models with greater oblong character (as measured by the ratio of largest to smallest principal component).

    View details for DOI 10.1110/ps.03385504

    View details for Web of Science ID 000187587700002

    View details for PubMedID 14691217

  • Modeling and analyzing biomedical processes using workflow/Petri Net models and tools MEDINFO 2004: PROCEEDINGS OF THE 11TH WORLD CONGRESS ON MEDICAL INFORMATICS, PT 1 AND 2 Peleg, M., Tu, S., Manindroo, A., Altman, R. B. 2004; 107: 74-78


    Computer simulation enables system developers to execute a model of an actual or theoretical system on a computer and analyze the execution output. We have been exploring the use of Petri Net (PN) tools to study the behavior of systems that are represented using three kinds of biomedical models: a biological workflow model used to represent biological processes, and two different computer-interpretable models of health care processes that are derived from clinical guidelines. We developed and implemented software that maps the three models into a single underlying process model (workflow), which is then converted into PNs in formats that are readable by several PN simulation and analysis tools. These analysis tools enabled us to simulate and study the behavior of two biomedical systems: a Malaria parasite invading a host cell, and patients undergoing management of chronic cough.

    View details for Web of Science ID 000226723300016

    View details for PubMedID 15360778

  • Building successful biological databases. Brief Bioinform. Altman, R. 2004; 1 (5): 4-5
  • Proceedings of Pacific Symposium on Biocomputing 2004. edited by Altman, R., Dunker, K., Hunter, L. 2004
  • A resource to acquire and summarize pharmacogenetics knowledge in the literature MEDINFO 2004: PROCEEDINGS OF THE 11TH WORLD CONGRESS ON MEDICAL INFORMATICS, PT 1 AND 2 Rubin, D. L., Carrillo, M., Woon, M., Conroy, J., Klein, T. E., Altman, R. B. 2004; 107: 793-797


    To determine how genetic variations contribute the variations in drug response, we need to know the genes that are related to drugs of interest. But there are no publicly available data-bases of known gene-drug relationships, and it is time-consuming to search the literature for this information. We have developed a resource to support the storage, summarization, and dissemination of key gene-drug interactions of relevance to pharmacogenetics. Extracting all gene-drug relationships from the literature is a daunting task, so we distributed a tool to acquire this knowledge from the scientific community. We also developed a categorization scheme to classify gene-drug relationships according to the type of pharmacogenetic evidence that supports them. Our resource ( can be queried by gene or drug, and it summarizes gene-drug relationships, categories of evidence, and supporting literature. This resource is growing, containing entries for 138 genes and 215 drugs of pharmacogenetics significance, and is a core component of PharmGKB, a pharmacogenetics knowledge base (

    View details for Web of Science ID 000226723300159

    View details for PubMedID 15360921

  • PharmGKB: the pharmacogenetics and pharmacogenomics knowledge base PHARMACOGENOMICS JOURNAL Klein, T. E., Altman, R. B. 2004; 4 (1): 1-1

    View details for DOI 10.1038/sj.tpj.6500230

    View details for Web of Science ID 000220143500001

    View details for PubMedID 14735107

  • Ribosomal dynamics inferred from variations in experimental measurements RNA-A PUBLICATION OF THE RNA SOCIETY Gabashvili, I. S., Whirl-Carrillo, M., Bada, M., Banatao, D. R., Altman, R. B. 2003; 9 (11): 1301-1307


    The crystal structures of the ribosome reveal remarkable complexity and provide a starting set of snapshots with which to understand the dynamics of translation. To augment the static crystallographic models with dynamic information present in crosslink, footprint, and cleavage data, we examined 2691 proximity measurements and focused on the subset that was apparently incompatible with >40 published crystal structures. The measurements from this subset generally involve regions of the structure that are functionally conserved and structurally flexible. Local movements in the crystallographic states of the ribosome that would satisfy biochemical proximity measurements show coherent patterns suggesting alternative conformations of the ribosome. Three different types of data obtained for the two subunits display similar "mismatching" patterns, suggesting that the signals are robust and real. In particular, there is an indication of coherent motion in the decoding region within the 30S subunit and central protuberance and surrounding areas of the 50S subunit. Directions of rearrangements fluctuate around the proposed path of tRNA translocation and the plane parallel to the interface of the two subunits. Our results demonstrate that systematic combination and analysis of noisy, apparently incompatible data sources can provide biologically useful signals about structural dynamics.

    View details for Web of Science ID 000186175900001

    View details for PubMedID 14561879

  • MutDB: annotating human variation with functionally relevant data BIOINFORMATICS Mooney, S. D., Altman, R. B. 2003; 19 (14): 1858-1860


    We have developed a resource, MutDB (, to aid in determining which single nucleotide polymorphisms (SNPs) are likely to alter the function of their associated protein product. MutDB contains protein structure annotations and comparative genomic annotations for 8000 disease-associated mutations and SNPs found in the UCSC Annotated Genome and the human RefSeq gene set. MutDB provides interactive mutation maps at the gene and protein levels, and allows for ranking of their predicted functional consequences based on conservation in multiple sequence alignments. Supplementary information:

    View details for DOI 10.1093/bioinformatics/btg241

    View details for Web of Science ID 000185701100022

    View details for PubMedID 14512363

  • Investigating hypoxic tumor physiology through gene expression patterns ONCOGENE Denko, N. C., Fontana, L. A., Hudson, K. M., Sutphin, P. D., Raychaudhuri, S., Altman, R., Giaccia, A. J. 2003; 22 (37): 5907-5914


    Clinical evidence shows that tumor hypoxia is an independent prognostic indicator of poor patient outcome. Hypoxic tumors have altered physiologic processes, including increased regions of angiogenesis, increased local invasion, increased distant metastasis and altered apoptotic programs. Since hypoxia is a potent controller of gene expression, identifying hypoxia-regulated genes is a means to investigate the molecular response to hypoxic stress. Traditional experimental approaches have identified physiologic changes in hypoxic cells. Recent studies have identified hypoxia-responsive genes that may define the mechanism(s) underlying these physiologic changes. For example, the regulation of glycolytic genes by hypoxia can explain some characteristics of the Warburg effect. The converse of this logic is also true. By identifying new classes of hypoxia-regulated gene(s), we can infer the physiologic pressures that require the induction of these genes and their protein products. Furthermore, these physiologically driven hypoxic gene expression changes give us insight as to the poor outcome of patients with hypoxic tumors. Approximately 1-1.5% of the genome is transcriptionally responsive to hypoxia. However, there is significant heterogeneity in the transcriptional response to hypoxia between different cell types. Moreover, the coordinated change in the expression of families of genes supports the model of physiologic pressure leading to expression changes. Understanding the evolutionary pressure to develop a 'hypoxic response' provides a framework to investigate the biology of the hypoxic tumor microenvironment.

    View details for DOI 10.1038/sj.onc.1206703

    View details for Web of Science ID 000185086100017

    View details for PubMedID 12947397

  • Large scale study of protein domain distribution in the context of alternative splicing NUCLEIC ACIDS RESEARCH Liu, S., Altman, R. B. 2003; 31 (16): 4828-4835


    Alternative splicing plays an important role in processes such as development, differentiation and cancer. With the recent increase in the estimates of the number of human genes that undergo alternative splicing from 5 to 35-59%, it is becoming critical to develop a better understanding of its functional consequences and regulatory mechanisms. We conducted a large scale study of the distribution of protein domains in a curated data set of several thousand genes and identified protein domains disproportionately distributed among alternatively spliced genes. We also identified a number of protein domains that tend to be spliced out. Both the proteins having the disproportionately distributed domains as well as those with spliced-out domains are predominantly involved in the processes of cell communication, signaling, development and apoptosis. These proteins function mostly as enzymes, signal transducers and receptors. Somewhat surprisingly, 28% of all occurrences of spliced-out domains are not effected by straightforward exclusion of exons coding for the domains but by inclusion or exclusion of other exons to shift the reading frame while retaining the exons coding for the domains in the final transcripts.

    View details for DOI 10.1093/nar/gkg668

    View details for Web of Science ID 000184783000020

    View details for PubMedID 12907725

  • The computational analysis of scientific literature to define and recognize gene expression clusters NUCLEIC ACIDS RESEARCH Raychaudhuri, S., Chang, J. T., Imam, F., Altman, R. B. 2003; 31 (15): 4553-4560


    A limitation of many gene expression analytic approaches is that they do not incorporate comprehensive background knowledge about the genes into the analysis. We present a computational method that leverages the peer-reviewed literature in the automatic analysis of gene expression data sets. Including the literature in the analysis of gene expression data offers an opportunity to incorporate functional information about the genes when defining expression clusters. We have created a method that associates gene expression profiles with known biological functions. Our method has two steps. First, we apply hierarchical clustering to the given gene expression data set. Secondly, we use text from abstracts about genes to (i) resolve hierarchical cluster boundaries to optimize the functional coherence of the clusters and (ii) recognize those clusters that are most functionally coherent. In the case where a gene has not been investigated and therefore lacks primary literature, articles about well-studied homologous genes are added as references. We apply our method to two large gene expression data sets with different properties. The first contains measurements for a subset of well-studied Saccharomyces cerevisiae genes with multiple literature references, and the second contains newly discovered genes in Drosophila melanogaster; many have no literature references at all. In both cases, we are able to rapidly define and identify the biologically relevant gene expression profiles without manual intervention. In both cases, we identified novel clusters that were not noted by the original investigators.

    View details for DOI 10.1093/nar/gkg636

    View details for Web of Science ID 000184532900040

    View details for PubMedID 12888516

  • Microenvironment analysis and identification of magnesium binding sites in RNA NUCLEIC ACIDS RESEARCH Banatao, D. R., Altman, R. B., Klein, T. E. 2003; 31 (15): 4450-4460


    Interactions with magnesium (Mg2+) ions are essential for RNA folding and function. The locations and function of bound Mg2+ ions are difficult to characterize both experimentally and computationally. In particular, the P456 domain of the Tetrahymena thermophila group I intron, and a 58 nt 23s rRNA from Escherichia coli have been important systems for studying the role of Mg2+ binding in RNA, but characteristics of all the binding sites remain unclear. We therefore investigated the Mg2+ binding capabilities of these RNA systems using a computational approach to identify and further characterize their Mg2+ binding sites. The approach is based on the FEATURE algorithm, reported previously for microenvironment analysis of protein functional sites. We have determined novel physicochemical descriptions of site-bound and diffusely bound Mg2+ ions in RNA that are useful for prediction. Electrostatic calculations using the Non-Linear Poisson Boltzmann (NLPB) equation provided further evidence for the locations of site-bound ions. We confirmed the locations of experimentally determined sites and further differentiated between classes of ion binding. We also identified potentially important, high scoring sites in the group I intron that are not currently annotated as Mg2+ binding sites. We note their potential function and believe they deserve experimental follow-up.

    View details for DOI 10.1093/nar/gkg471

    View details for Web of Science ID 000184532900029

    View details for PubMedID 12888505

  • A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B., Botstein, D. 2003; 100 (14): 8348-8353


    Genomic sequencing is no longer a novelty, but gene function annotation remains a key challenge in modern biology. A variety of functional genomics experimental techniques are available, from classic methods such as affinity precipitation to advanced high-throughput techniques such as gene expression microarrays. In the future, more disparate methods will be developed, further increasing the need for integrated computational analysis of data generated by these studies. We address this problem with MAGIC (Multisource Association of Genes by Integration of Clusters), a general framework that uses formal Bayesian reasoning to integrate heterogeneous types of high-throughput biological data (such as large-scale two-hybrid screens and multiple microarray analyses) for accurate gene function prediction. The system formally incorporates expert knowledge about relative accuracies of data sources to combine them within a normative framework. MAGIC provides a belief level with its output that allows the user to vary the stringency of predictions. We applied MAGIC to Saccharomyces cerevisiae genetic and physical interactions, microarray, and transcription factor binding sites data and assessed the biological relevance of gene groupings using Gene Ontology annotations produced by the Saccharomyces Genome Database. We found that by creating functional groupings based on heterogeneous data types, MAGIC improved accuracy of the groupings compared with microarray analysis alone. We describe several of the biological gene groupings identified.

    View details for DOI 10.1073/pnas.0832373100

    View details for Web of Science ID 000184222500057

    View details for PubMedID 12826619

  • Inclusion of textual documentation in the analysis of multidimensional data sets: Application to gene expression data MACHINE LEARNING Raychaudhuri, S., Schutze, H., Altman, R. B. 2003; 52 (1-2): 119-145
  • WebFEATURE: an interactive web tool for identifying and visualizing functional sites on macromolecular structures NUCLEIC ACIDS RESEARCH Liang, M. P., Banatao, D. R., Klein, T. E., Brutlag, D. L., Altman, R. B. 2003; 31 (13): 3324-3327


    WebFEATURE ( is a web-accessible structural analysis tool that allows users to scan query structures for functional sites in both proteins and nucleic acids. WebFEATURE is the public interface to the scanning algorithm of the FEATURE package, a supervised learning algorithm for creating and identifying 3D, physicochemical motifs in molecular structures. Given an input structure or Protein Data Bank identifier (PDB ID), and a statistical model of a functional site, WebFEATURE will return rank-scored 'hits' in 3D space that identify regions in the structure where similar distributions of physicochemical properties occur relative to the site model. Users can visualize and interactively manipulate scored hits and the query structure in web browsers that support the Chime plug-in. Alternatively, results can be downloaded and visualized through other freely available molecular modeling tools, like RasMol, PyMOL and Chimera. A major application of WebFEATURE is in rapid annotation of function to structures in the context of structural genomics.

    View details for DOI 10.1093/nar/gkg553

    View details for Web of Science ID 000183832900010

    View details for PubMedID 12824318

  • Identification of promoter regions in the human genome by using a retroviral plasmid library-based functional reporter gene assay GENOME RESEARCH Khambata-Ford, S., Liu, Y. Y., Gleason, C., Dickson, M., Altman, R. B., Batzoglou, S., Myers, R. M. 2003; 13 (7): 1765-1774


    Attempts to identify regulatory sequences in the human genome have involved experimental and computational methods such as cross-species sequence comparisons and the detection of transcription factor binding-site motifs in coexpressed genes. Although these strategies provide information on which genomic regions are likely to be involved in gene regulation, they do not give information on their functions. We have developed a functional selection for promoter regions in the human genome that uses a retroviral plasmid library-based system. This approach enriches for and detects promoter function of isolated DNA fragments in an in vitro cell culture assay. By using this method, we have discovered likely promoters of known and predicted genes, as well as many other putative promoter regions based on the presence of features such as CpG islands. Comparison of sequences of 858 plasmid clones selected by this assay with the human genome draft sequence indicates that a significantly higher percentage of sequences align to the 500-bp segment upstream of the transcription start sites of known genes than would be expected from random genomic sequences. We also observed enrichment for putative promoter regions of genes predicted in at least two annotation databases and for clones overlapping with CpG islands. Functional validation of randomly selected clones enriched by this method showed that a large fraction of these putative promoters can drive the expression of a reporter gene in transient transfection experiments. This method promises to be a useful genome-wide function-based approach that can complement existing methods to look for promoters.

    View details for DOI 10.1101/gr.529803

    View details for Web of Science ID 000183970000023

    View details for PubMedID 12805274

  • Genetic sequence data for pharmacogenomics CURRENT OPINION IN DRUG DISCOVERY & DEVELOPMENT Altman, R. B. 2003; 6 (3): 297-303


    Pharmacogenetics is the study of how variation in human genes leads to variation in response to drugs. Pharmacogenomics is the term applied to large-scale genomic approaches to pharmacogenetics, and it is currently characterized chiefly by the use of high-throughput DNA sequencing to identify sequence variations in pharmacologically important genes. Genes of interest for pharmacogenomics include genes involved in drug metabolism and transport, as well as genes that are drug targets. The past year has seen an increasing number of systematic surveys of genetic variation that establish reliable baseline measurements of sequence variation--at least in coding and promoter regions. These surveys form the basis for determination of population frequencies, genetic linkage studies and association studies relating genotype with drug response phenotypes of interest.

    View details for Web of Science ID 000183571800002

    View details for PubMedID 12833660

  • A functional analysis of disease-associated mutations in the androgen receptor gene NUCLEIC ACIDS RESEARCH Mooney, S. D., Klein, T. E., Altman, R. B., Trifiro, M. A., GOTTLIEB, B. 2003; 31 (8)


    Mutations in the androgen receptor (AR) are associated with a variety of diseases including androgen insensitivity syndrome and prostate cancer, but the way in which these mutations cause disease is poorly understood. We present a method for distinguishing likely disease-causing mutations from mutations that are merely associated with disease but have no causal role. Our method uses a measure of nucleotide conservation, and we find that conservation often correlates with severity of the clinical phenotype. Further, by only including mutations whose pathogenicity has been proven experimentally, this correlation is enhanced in the case of prostate cancer-associated mutations. Our method provides a means for assessing the significance of single nucleotide polymorphisms (SNPs) and cancer-associated mutations.

    View details for DOI 10.1093/nar/gng042

    View details for Web of Science ID 000182161400002

    View details for PubMedID 12682377

  • Recognizing complex, asymmetric functional sites in protein structures using a Bayesian scoring function. Journal of bioinformatics and computational biology Wei, L., Altman, R. B. 2003; 1 (1): 119-138


    The increase in known three-dimensional protein structures enables us to build statistical profiles of important functional sites in protein molecules. These profiles can then be used to recognize sites in large-scale automated annotations of new protein structures. We report an improved FEATURE system which recognizes functional sites in protein structures. FEATURE defines multi-level physico-chemical properties and recognizes sites based on the spatial distribution of these properties in the sites' microenvironments. It uses a Bayesian scoring function to compare a query region with the statistical profile built from known examples of sites and control nonsites. We have previously shown that FEATURE can accurately recognize calcium-binding sites and have reported interesting results scanning for calcium-binding sites in the entire Protein Data Bank. Here we report the ability of the improved FEATURE to characterize and recognize geometrically complex and asymmetric sites such as ATP-binding sites and disulfide bond-forming sites. FEATURE does not rely on conserved residues or conserved residue geometry of the sites. We also demonstrate that, in the absence of a statistical profile of the sites, FEATURE can use an artificially constructed profile based on a priori knowledge to recognize the sites in new structures, using redoxin active sites as an example.

    View details for PubMedID 15290784

  • Complexities of managing biomedical information. Omics : a journal of integrative biology Altman, R. B. 2003; 7 (1): 127-129

    View details for PubMedID 12831574

  • A literature-based method for assessing the functional coherence of a gene group BIOINFORMATICS Raychaudhuri, S., Altman, R. B. 2003; 19 (3): 396-401


    Many experimental and algorithmic approaches in biology generate groups of genes that need to be examined for related functional properties. For example, gene expression profiles are frequently organized into clusters of genes that may share functional properties. We evaluate a method, neighbor divergence per gene (NDPG), that uses scientific literature to assess whether a group of genes are functionally related. The method requires only a corpus of documents and an index connecting the documents to genes.We evaluate NDPG on 2796 functional groups generated by the Gene Ontology consortium in four organisms: mouse, fly, worm and yeast. NDPG finds functional coherence in 96, 92, 82 and 45% of the groups (at 99.9% specificity) in yeast, mouse, fly and worm respectively.

    View details for DOI 10.1093/bioinformatics/btg002

    View details for Web of Science ID 000181303000011

    View details for PubMedID 12584126

  • Mining heterogeneous ribosomal structure data Gabashvili, I. S., Whirl-Carrillo, M., Bada, M., Banatao, D. R., Altman, R. B. CELL PRESS. 2003: 463A-463A
  • Knowledge acquisition, consistency checking and concurrency control for Gene Ontology (GO) BIOINFORMATICS Yeh, I., Karp, P. D., Noy, N. F., Altman, R. B. 2003; 19 (2): 241-248


    A critical element of the computational infrastructure required for functional genomics is a shared language for communicating biological data and knowledge. The Gene Ontology (GO; provides a taxonomy of concepts and their attributes for annotating gene products. As GO increases in size, its ongoing construction and maintenance becomes more challenging. In this paper, we assess the applicability of a Knowledge Base Management System (KBMS), Protégé-2000, to the maintenance and development of GO.We transferred GO to Protégé-2000 in order to evaluate its suitability for GO. The graphical user interface supported browsing and editing of GO. Tools for consistency checking identified minor inconsistencies in GO and opportunities to reduce redundancy in its representation. The Protégé Axiom Language proved useful for checking ontological consistency. The PROMPT tool allowed us to track changes to GO. Using Protégé-2000, we tested our ability to make changes and extensions to GO to refine the semantics of attributes and classify more concepts.Gene Ontology in Protégé-2000 and the associated code are located at Protégé-2000 is available from

    View details for Web of Science ID 000180913600011

    View details for PubMedID 12538245

  • Defining bioinformatics and structural bioinformatics. Methods of biochemical analysis Altman, R. B., Dugan, J. M. 2003; 44: 3-14

    View details for PubMedID 12647379

  • Automatic construction of 3D structural motifs for protein function prediction Liang, M. P., Brutlag, D. L., Altman, R. B. IEEE COMPUTER SOC. 2003: 613-614
  • Proceedings of Pacific Symposium on Biocomputing 2003. edited by Altman, R., Dunker, K., Hunter, L. 2003
  • Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Troyanskaya, O., Dolinski, K., Owen, A., Altman, R., Botstein, D. 2003
  • Preface. Bioinformatics and Functional Genomics. Altman, R., In Pevsner, J. 2003
  • A Personalized and Automated dbSNP Surveillance System. Liu, S., Lin, S., Woon, M., Klein, T., Altman, R. 2003
  • The expanding scope of bioinformatics: sequence analysis and beyond. Heredity Altman, R. 2003; 5 (90): 345
  • Recognizing Complex, Asymmetric Functional Sites in Protein Structures Using a Bayesian Scoring Function. Journal of Bioinformatics and Computational Biology Wei, L., Altman, R. 2003; 1 (1): 119-138
  • A personalized and automated dbSNP surveillance system PROCEEDINGS OF THE 2003 IEEE BIOINFORMATICS CONFERENCE Liu, S., Lin, S., Woon, M., Klein, T. E., Altman, R. B. 2003: 132-136


    The development of high throughput techniques and large-scale studies in the biological sciences has given rise to an explosive growth in both the volume and types of data available to researchers. A surveillance system that monitors data repositories and reports changes helps manage the data overload. We developed a dbSNP surveillance system (URL: that performs surveillance on the dbSNP database and alerts users to new information. The system is notable because it is personalized and fully automated. Each registered user has a list of genes to follow and receives notification of new entries concerning these genes. The system integrates data from dbSNP, LocusLink, PharmGKB, and Genbank to position SNPs on reference sequences and classify SNPs into categories such as synonymous and non-synonymous SNPs. The system uses data warehousing, object model-based data integration, object-oriented programming, and a platform-neutral data access mechanism.

    View details for Web of Science ID 000188997700026

    View details for PubMedID 16452787

  • Automated construction of structural motifs for predicting functional sites on protein structures. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Liang, M. P., Brutlag, D. L., Altman, R. B. 2003: 204-215


    Structural genomics initiatives are beginning to rapidly generate vast numbers of protein structures. For many of the structures, functions are not yet determined and high-throughput methods for determining function are necessary. Although there has been extensive work in function prediction at the sequence level, predicting function at the structure level may provide better sensitivity and predictive value. We describe a method to predict functional sites by automatically creating three dimensional structural motifs from amino acid sequence motifs. These structural motifs perform comparably well with manually generated structural motifs and perform better than sequence motifs. Automatically generated structural motifs can be used for structural-genomic scale function prediction on protein structures.

    View details for PubMedID 12603029

  • Indexing pharmacogenetic knowledge on the World Wide Web PHARMACOGENETICS Altman, R. B., Flockhart, D. A., Sherry, S. T., Oliver, D. E., Rubin, D. L., Klein, T. E. 2003; 13 (1): 3-5

    View details for Web of Science ID 000180584000002

    View details for PubMedID 12544507

  • Qualitative models of molecular function: Linking genetic polymorphisms of tRNA to their functional sequelae PROCEEDINGS OF THE IEEE Peleg, M., Gabashvili, I. S., Altman, R. B. 2002; 90 (12): 1875-1886
  • Creating an online dictionary of abbreviations from MEDLINE JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Chang, J. T., Schutze, H., Altman, R. B. 2002; 9 (6): 612-620


    The growth of the biomedical literature presents special challenges for both human readers and automatic algorithms. One such challenge derives from the common and uncontrolled use of abbreviations in the literature. Each additional abbreviation increases the effective size of the vocabulary for a field. Therefore, to create an automatically generated and maintained lexicon of abbreviations, we have developed an algorithm to match abbreviations in text with their expansions.Our method uses a statistical learning algorithm, logistic regression, to score abbreviation expansions based on their resemblance to a training set of human-annotated abbreviations. We applied it to Medstract, a corpus of MEDLINE abstracts in which abbreviations and their expansions have been manually annotated. We then ran the algorithm on all abstracts in MEDLINE, creating a dictionary of biomedical abbreviations. To test the coverage of the database, we used an independently created list of abbreviations from the China Medical Tribune.We measured the recall and precision of the algorithm in identifying abbreviations from the Medstract corpus. We also measured the recall when searching for abbreviations from the China Medical Tribune against the database.On the Medstract corpus, our algorithm achieves up to 83% recall at 80% precision. Applying the algorithm to all of MEDLINE yielded a database of 781,632 high-scoring abbreviations. Of all the abbreviations in the list from the China Medical Tribune, 88% were in the database.We have developed an algorithm to identify abbreviations from text. We are making this available as a public abbreviation server at \url[].

    View details for DOI 10.1197/jamia.M1139

    View details for Web of Science ID 000178914400005

    View details for PubMedID 12386112

  • Nonparametric methods for identifying differentially expressed genes in microarray data BIOINFORMATICS Troyanskaya, O. G., Garber, M. E., Brown, P. O., Botstein, D., Altman, R. B. 2002; 18 (11): 1454-1461


    Gene expression experiments provide a fast and systematic way to identify disease markers relevant to clinical care. In this study, we address the problem of robust identification of differentially expressed genes from microarray data. Differentially expressed genes, or discriminator genes, are genes with significantly different expression in two user-defined groups of microarray experiments. We compare three model-free approaches: (1). nonparametric t-test, (2). Wilcoxon (or Mann-Whitney) rank sum test, and (3). a heuristic method based on high Pearson correlation to a perfectly differentiating gene ('ideal discriminator method'). We systematically assess the performance of each method based on simulated and biological data under varying noise levels and p-value cutoffs.All methods exhibit very low false positive rates and identify a large fraction of the differentially expressed genes in simulated data sets with noise level similar to that of actual data. Overall, the rank sum test appears most conservative, which may be advantageous when the computationally identified genes need to be tested biologically. However, if a more inclusive list of markers is desired, a higher p-value cutoff or the nonparametric t-test may be appropriate. When applied to data from lung tumor and lymphoma data sets, the methods identify biologically relevant differentially expressed genes that allow clear separation of groups in question. Thus the methods described and evaluated here provide a convenient and robust way to identify differentially expressed genes for further biological and clinical analysis.

    View details for Web of Science ID 000179249800008

    View details for PubMedID 12424116

  • Using text analysis to identify functionally coherent gene groups GENOME RESEARCH Raychaudhuri, S., Schutze, H., Altman, R. B. 2002; 12 (10): 1582-1590


    The analysis of large-scale genomic information (such as sequence data or expression patterns) frequently involves grouping genes on the basis of common experimental features. Often, as with gene expression clustering, there are too many groups to easily identify the functionally relevant ones. One valuable source of information about gene function is the published literature. We present a method, neighbor divergence, for assessing whether the genes within a group share a common biological function based on their associated scientific literature. The method uses statistical natural language processing techniques to interpret biological text. It requires only a corpus of documents relevant to the genes being studied (e.g., all genes in an organism) and an index connecting the documents to appropriate genes. Given a group of genes, neighbor divergence assigns a numerical score indicating how "functionally coherent" the gene group is from the perspective of the published literature. We evaluate our method by testing its ability to distinguish 19 known functional gene groups from 1900 randomly assembled groups. Neighbor divergence achieves 79% sensitivity at 100% specificity, comparing favorably to other tested methods. We also apply neighbor divergence to previously published gene expression clusters to assess its ability to recognize gene groups that had been manually identified as representative of a common function.

    View details for DOI 10.1101/gr.116402

    View details for Web of Science ID 000178396400014

    View details for PubMedID 12368251

  • Promises of text processing: natural language processing meets AI DRUG DISCOVERY TODAY Chang, J. T., Altman, R. B. 2002; 7 (19): 992-993

    View details for Web of Science ID 000178338600006

    View details for PubMedID 12546913

  • Emerging scientific applications in data mining COMMUNICATIONS OF THE ACM Han, J. W., Altman, R. B., Kumar, V., Mannila, H., Pregibon, D. 2002; 45 (8): 54-58
  • Determining the genomic locations of repetitive DNA sequences with a whole-genome microarray: IS6110 in Mycobacterium tuberculosis JOURNAL OF CLINICAL MICROBIOLOGY Kivi, M., Liu, X. M., Raychaudhuri, S., Altman, R. B., Small, P. M. 2002; 40 (6): 2192-2198


    The mycobacterial insertion sequence IS6110 has been exploited extensively as a clonal marker in molecular epidemiologic studies of tuberculosis. In addition, it has been hypothesized that this element is an important driving force behind genotypic variability that may have phenotypic consequences. We present here a novel, DNA microarray-based methodology, designated SiteMapping, that simultaneously maps the locations and orientations of multiple copies of IS6110 within the genome. To investigate the sensitivity, accuracy, and limitations of the technique, it was applied to eight Mycobacterium tuberculosis strains for which complete or partial IS6110 insertion site information had been determined previously. SiteMapping correctly located 64% (38 of 59) of the IS6110 copies predicted by restriction fragment length polymorphism analysis. The technique is highly specific; 97% of the predicted insertion sites were true insertions. Eight previously unknown insertions were identified and confirmed by PCR or sequencing. The performance could be improved by modifications in the experimental protocol and in the approach to data analysis. SiteMapping has general applicability and demonstrates an expansion in the applications of microarrays that complements conventional approaches in the study of genome architecture.

    View details for DOI 10.1128/JCM.40.6.2192-2198.2002

    View details for Web of Science ID 000176159200048

    View details for PubMedID 12037086

  • Modelling biological processes using workflow and Petri Net models BIOINFORMATICS Peleg, M., Yeh, I., Altman, R. B. 2002; 18 (6): 825-837


    Biological processes can be considered at many levels of detail, ranging from atomic mechanism to general processes such as cell division, cell adhesion or cell invasion. The experimental study of protein function and gene regulation typically provides information at many levels. The representation of hierarchical process knowledge in biology is therefore a major challenge for bioinformatics. To represent high-level processes in the context of their component functions, we have developed a graphical knowledge model for biological processes that supports methods for qualitative reasoning.We assessed eleven diverse models that were developed in the fields of software engineering, business, and biology, to evaluate their suitability for representing and simulating biological processes. Based on this assessment, we combined the best aspects of two models: Workflow/Petri Net and a biological concept model. The Workflow model can represent nesting and ordering of processes, the structural components that participate in the processes, and the roles that they play. It also maps to Petri Nets, which allow verification of formal properties and qualitative simulation. The biological concept model, TAMBIS, provides a framework for describing biological entities that can be mapped to the workflow model. We tested our model by representing malaria parasites invading host erythrocytes, and composed queries, in five general classes, to discover relationships among processes and structural components. We used reachability analysis to answer queries about the dynamic aspects of the model.The model is available at

    View details for Web of Science ID 000176553400006

    View details for PubMedID 12075018

  • RNAML: A standard syntax for exchanging RNA information RNA-A PUBLICATION OF THE RNA SOCIETY Waugh, A., Gendron, P., Altman, R., Brown, J. W., Case, D., GAUTHERET, D., Harvey, S. C., Leontis, N., Westbrook, J., Westhof, E., Zuker, M., Major, F. 2002; 8 (6): 707-717


    Analyzing a single data set using multiple RNA informatics programs often requires a file format conversion between each pair of programs, significantly hampering productivity. To facilitate the interoperation of these programs, we propose a syntax to exchange basic RNA molecular information. This RNAML syntax allows for the storage and the exchange of information about RNA sequence and secondary and tertiary structures. The syntax permits the description of higher level information about the data including, but not restricted to, base pairs, base triples, and pseudoknots. A class-oriented approach allows us to represent data common to a given set of RNA molecules, such as a sequence alignment and a consensus secondary structure. Documentation about experiments and computations, as well as references to journals and external databases, are included in the syntax. The chief challenge in creating such a syntax was to determine the appropriate scope of usage and to ensure extensibility as new needs will arise. The syntax complies with the eXtensible Markup Language (XML) recommendations, a widely accepted standard for syntax specifications. In addition to the various generic packages that exist to read and interpret XML formats, an XML processor was developed and put in the open-source MC-Core library for nucleic acid and protein structure computer manipulation.

    View details for DOI 10.1017/S1355838202028017

    View details for Web of Science ID 000176277100001

    View details for PubMedID 12088144

  • Mining biochemical information: Lessons taught by the ribosome RNA-A PUBLICATION OF THE RNA SOCIETY Whirl-Carrillo, M., Gabashvili, I. S., Bada, M., Banatao, D. R., Altman, R. B. 2002; 8 (3): 279-289


    The publication of the crystal structures of the ribosome offers an opportunity to retrospectively evaluate the information content of hundreds of qualitative biochemical and biophysical studies of these structures. We assessed the correspondence between more than 2,500 experimental proximity measurements and the distances observed in the ribosomal crystals. Although detailed experimental procedures and protocols are unique in almost each analyzed paper, the data can be grouped into subsets with similar patterns and analyzed in an integrative fashion. We found that, for crosslinking, footprinting, and cleavage data, the corresponding distances observed in crystal structures generally did not exceed the maximum values expected (from the estimated length of the agent and maximal anticipated deviations from the conformations found in crystals). However, the distribution of distances had heavier tails than those typically assumed when building three-dimensional models, and the fraction of incompatible distances was greater than expected. Some of these incompatibilities can be attributed to the experimental methods used. In addition, the accuracy of these procedures appears to be sensitive to the different reactivities, flexibilities, and interactions among the components. These findings demonstrate the necessity of a very careful analysis of data used for structural modeling and consideration of all possible parameters that could potentially influence the quality of measurements. We conclude that experimental proximity measurements can provide useful distance information for structural modeling, but with a broad distribution of inferred distance ranges. We also conclude that development of automated modeling approaches would benefit from better annotations of experimental data for detection and interpretation of their significance.

    View details for DOI 10.1017/S135583820202407X

    View details for Web of Science ID 000175155500002

    View details for PubMedID 12003488

  • Challenges for biomedical informatics and pharmacogenomics ANNUAL REVIEW OF PHARMACOLOGY AND TOXICOLOGY Altman, R. B., Klein, T. E. 2002; 42: 113-133


    Pharmacogenomics requires the integration and analysis of genomic, molecular, cellular, and clinical data, and it thus offers a remarkable set of challenges to biomedical informatics. These include infrastructural challenges such as the creation of data models and databases for storing these data, the integration of these data with external databases, the extraction of information from natural language text, and the protection of databases with sensitive information. There are also scientific challenges in creating tools to support gene expression analysis, three-dimensional structural analysis, and comparative genomic analysis. In this review, we summarize the current uses of informatics within pharmacogenomics and show how the technical challenges that remain for biomedical informatics are typical of those that will be confronted in the postgenomic era.

    View details for Web of Science ID 000174038800007

    View details for PubMedID 11807167

  • Scoring functions sensitive to alignment error have a more difficult search: A paradox for threading Chang, J., Carrillo, M. W., Waugh, A., Wei, L. P., Altman, R. B. AMER CHEMICAL SOC. 2002: 309-320
  • Modeling molecular function and failure: Misreading of genetic code by the ribosome Gabashvili, I. S., Peleg, M., Altman, R. B. CELL PRESS. 2002: 167A-168A
  • Qualitative models of molecular function: linking genetic polymorphisms of tRNA to their functional sequelae. Peleg, M., Gabashvili, I., S., Altman, R. edited by Akay, M. 2002
  • Representing genetic sequence data for pharmacogenomics: an evolutionary approach using ontological and relational models. Bioinformatics, 18 Suppl 1 Rubin, D., Shafa, F., Oliver, D., Hewett, M., Altman, R. 2002: S207-S215
  • Using Binning to Maintain Confidentiality of Medical Data. Lin, Z., Hewett, M., Altman 2002
  • Scoring Functions Sensitive to Alignment Error Have a More Difficult Search: A Paradox for Threading. In Structures and Mechanisms Chang, J., Carrillo, M., Waugh, A., Wei, L., Altman, R. ACS Publications.. 2002: 309-320
  • Proceedings of Pacific Symposium on Biocomputing 2002. edited by Altman, R., Dunker, K., Hunter, L. 2002
  • Preface. Microarrays For An Integrative Genomics. Altman, R., In Kohane, I., Kho, A., Butte, A. 2002: xii-xv
  • Emerging Scientific Applications in Data Mining. Communications of the ACM Han, J., Altman, R., Kumar, V., Mannila, H., Pregibon, D. 2002; 8 (45): 54-58
  • Using binning to maintain confidentiality of medical data AMIA 2002 SYMPOSIUM, PROCEEDINGS Lin, Z., Hewett, M., Altman, R. B. 2002: 454-458


    Biomedical informatics in general and pharmacogenomics in particular require a research platform that simultaneously enables discovery while protecting research subjects' privacy and information confidentiality. The development of inexpensive DNA sequencing and analysis technologies promises unprecedented database access to very specific information about individuals. To allow analysis of this data without compromising the research subjects' privacy, we must develop methods for removing identifying information from medical and genomic data. In this paper, we build upon the idea that binned database records are more difficult to trace back to individuals. We represent symbolic and numeric data hierarchically, and bin them by generalizing the records. We measure the information loss due to binning using an information theoretic measure called mutual information. The results show that we can bin the data to different levels of precision and use the bin size to control the tradeoff between privacy and data resolution.

    View details for Web of Science ID 000189418100092

    View details for PubMedID 12463865

  • Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature GENOME RESEARCH Raychaudhuri, S., Chang, J. T., Sutphin, P. D., Altman, R. B. 2002; 12 (1): 203-214


    Functional characterizations of thousands of gene products from many species are described in the published literature. These discussions are extremely valuable for characterizing the functions not only of these gene products, but also of their homologs in other organisms. The Gene Ontology (GO) is an effort to create a controlled terminology for labeling gene functions in a more precise, reliable, computer-readable manner. Currently, the best annotations of gene function with the GO are performed by highly trained biologists who read the literature and select appropriate codes. In this study, we explored the possibility that statistical natural language processing techniques can be used to assign GO codes. We compared three document classification methods (maximum entropy modeling, naïve Bayes classification, and nearest-neighbor classification) to the problem of associating a set of GO codes (for biological process) to literature abstracts and thus to the genes associated with the abstracts. We showed that maximum entropy modeling outperforms the other methods and achieves an accuracy of 72% when ascertaining the function discussed within an abstract. The maximum entropy method provides confidence measures that correlate well with performance. We conclude that statistical methods may be used to assign GO codes and may be useful for the difficult task of reassignment as terminology standards evolve over time.

    View details for Web of Science ID 000173064900022

    View details for PubMedID 11779846

  • Automating data acquisition into ontologies from pharmacogenetics relational data sources using declarative object definitions and XML. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Rubin, D. L., Hewett, M., Oliver, D. E., Klein, T. E., Altman, R. B. 2002: 88-99


    Ontologies are useful for organizing large numbers of concepts having complex relationships, such as the breadth of genetic and clinical knowledge in pharmacogenomics. But because ontologies change and knowledge evolves, it is time consuming to maintain stable mappings to external data sources that are in relational format. We propose a method for interfacing ontology models with data acquisition from external relational data sources. This method uses a declarative interface between the ontology and the data source, and this interface is modeled in the ontology and implemented using XML schema. Data is imported from the relational source into the ontology using XML, and data integrity is checked by validating the XML submission with an XML schema. We have implemented this approach in PharmGKB (, a pharmacogenetics knowledge base. Our goals were to (1) import genetic sequence data, collected in relational format, into the pharmacogenetics ontology, and (2) automate the process of updating the links between the ontology and data acquisition when the ontology changes. We tested our approach by linking PharmGKB with data acquisition from a relational model of genetic sequence information. The ontology subsequently evolved, and we were able to rapidly update our interface with the external data and continue acquiring the data. Similar approaches may be helpful for integrating other heterogeneous information sources in order make the diversity of pharmacogenetics data amenable to computational analysis.

    View details for PubMedID 11928521

  • Representing genetic sequence data for pharmacogenomics: an evolutionary approach using ontological and relational models. Bioinformatics Rubin, D. L., Shafa, F., Oliver, D. E., Hewett, M., Altman, R. B. 2002; 18: S207-15


    The information model chosen to store biological data affects the types of queries possible, database performance, and difficulty in updating that information model. Genetic sequence data for pharmacogenetics studies can be complex, and the best information model to use may change over time. As experimental and analytical methods change, and as biological knowledge advances, the data storage requirements and types of queries needed may also change.We developed a model for genetic sequence and polymorphism data, and used XML Schema to specify the elements and attributes required for this model. We implemented this model as an ontology in a frame-based representation and as a relational model in a database system. We collected genetic data from two pharmacogenetics resequencing studies, and formulated queries useful for analysing these data. We compared the ontology and relational models in terms of query complexity, performance, and difficulty in changing the information model. Our results demonstrate benefits of evolving the schema for storing pharmacogenetics data: ontologies perform well in early design stages as the information model changes rapidly and simplify query formulation, while relational models offer improved query speed once the information model and types of queries needed stabilize.

    View details for PubMedID 12169549

  • Ontology development for a pharmacogenetics knowledge base. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Oliver, D. E., Rubin, D. L., Stuart, J. M., Hewett, M., Klein, T. E., Altman, R. B. 2002: 65-76


    Research directed toward discovering how genetic factors influence a patient's response to drugs requires coordination of data produced from laboratory experiments, computational methods, and clinical studies. A public repository of pharmacogenetic data should accelerate progress in the field of pharmacogenetics by organizing and disseminating public datasets. We are developing a pharmacogenetics knowledge base (PharmGKB) to support the storage and retrieval of both experimental data and conceptual knowledge. PharmGKB is an Internet-based resource that integrates complex biological, pharmacological, and clinical data in such a way that researchers can submit their data and users can retrieve information to investigate genotype-phenotype correlations. Successful management of the names, meaning, and organization of concepts used within the system is crucial. We have selected a frame-based knowledge-representation system for development of an ontology of concepts and relationships that represent the domain and that permit storage of experimental data. Preliminary experience shows that the ontology we have developed for gene-sequence data allows us to accept, store, and query data submissions.

    View details for PubMedID 11928517

  • PharmGKB: The Pharmacogenetics Knowledge Base NUCLEIC ACIDS RESEARCH Hewett, M., Oliver, D. E., Rubin, D. L., Easton, K. L., Stuart, J. M., Altman, R. B., Klein, T. E. 2002; 30 (1): 163-165


    The Pharmacogenetics Knowledge Base (PharmGKB; contains genomic, phenotype and clinical information collected from ongoing pharmacogenetic studies. Tools to browse, query, download, submit, edit and process the information are available to registered research network members. A subset of the tools is publicly available. PharmGKB currently contains over 150 genes under study, 14 Coriell populations and a large ontology of pharmacogenetics concepts. The pharmacogenetic concepts and the experimental data are interconnected by a set of relations to form a knowledge base of information for pharmacogenetic researchers. The information in PharmGKB, and its associated tools for processing that information, are tailored for leading-edge pharmacogenetics research. The PharmGKB project was initiated in April 2000 and the first version of the knowledge base went online in February 2001.

    View details for Web of Science ID 000173077100041

    View details for PubMedID 11752281

  • Diversity of gene expression in adenocarcinoma of the lung PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Garber, M. E., Troyanskaya, O. G., Schluens, K., Petersen, S., Thaesler, Z., Pacyna-Gengelbach, M., van de Rijn, M., Rosen, G. D., Perou, C. M., Whyte, R. I., Altman, R. B., Brown, P. O., Botstein, D., Petersen, I. 2001; 98 (24): 13784-13789


    The global gene expression profiles for 67 human lung tumors representing 56 patients were examined by using 24,000-element cDNA microarrays. Subdivision of the tumors based on gene expression patterns faithfully recapitulated morphological classification of the tumors into squamous, large cell, small cell, and adenocarcinoma. The gene expression patterns made possible the subclassification of adenocarcinoma into subgroups that correlated with the degree of tumor differentiation as well as patient survival. Gene expression analysis thus promises to extend and refine standard pathologic analysis.

    View details for Web of Science ID 000172328100058

    View details for PubMedID 11707590

  • Missing value estimation methods for DNA microarrays BIOINFORMATICS Troyanskaya, O., Cantor, M., Sherlock, G., BROWN, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R. B. 2001; 17 (6): 520-525


    Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data.We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.

    View details for Web of Science ID 000169404700005

    View details for PubMedID 11395428

  • Whole-genome expression analysis: challenges beyond clustering CURRENT OPINION IN STRUCTURAL BIOLOGY Altman, R. B., Raychaudhuri, S. 2001; 11 (3): 340-347


    Measuring the expression of most or all of the genes in a biological system raises major analytic challenges. A wealth of recent reports uses microarray expression data to examine diverse biological phenomena - from basic processes in model organisms to complex aspects of human disease. After an initial flurry of methods for clustering the data on the basis of similarity, the field has recognized some longer-term challenges. Firstly, there are efforts to understand the sources of noise and variation in microarray experiments in order to increase the biological signal. Secondly, there are efforts to combine expression data with other sources of information to improve the range and quality of conclusions that can be drawn. Finally, techniques are now emerging to reconstruct networks of genetic interactions in order to create integrated and systematic models of biological systems.

    View details for Web of Science ID 000169375000013

    View details for PubMedID 11406385

  • Basic microarray analysis: grouping and feature reduction TRENDS IN BIOTECHNOLOGY Raychaudhuri, S., Sutphin, P. D., Chang, J. T., Altman, R. B. 2001; 19 (5): 189-193


    DNA microarray technologies are useful for addressing a broad range of biological problems - including the measurement of mRNA expression levels in target cells. These studies typically produce large data sets that contain measurements on thousands of genes under hundreds of conditions. There is a critical need to summarize this data and to pick out the important details. The most common activities, therefore, are to group together microarray data and to reduce the number of features. Both of these activities can be done using only the raw microarray data (unsupervised methods) or using external information that provides labels for the microarray data (supervised methods). We briefly review supervised and unsupervised methods for grouping and reducing data in the context of a publicly available suite of tools called CLEAVER, and illustrate their application on a representative data set collected to study lymphoma.

    View details for Web of Science ID 000168716800008

    View details for PubMedID 11301132

  • Including biological literature improves homology search. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Chang, J. T., Raychaudhuri, S., Altman, R. B. 2001: 374-383


    Annotating the tremendous amount of sequence information being generated requires accurate automated methods for recognizing homology. Although sequence similarity is only one of many indicators of evolutionary homology, it is often the only one used. Here we find that supplementing sequence similarity with information from biomedical literature is successful in increasing the accuracy of homology search results. We modified the PSI-BLAST algorithm to use literature similarity in each iteration of its database search. The modified algorithm is evaluated and compared to standard PSI-BLAST in searching for homologous proteins. The performance of the modified algorithm achieved 32% recall with 95% precision, while the original one achieved 33% recall with 84% precision; the literature similarity requirement preserved the sensitive characteristic of the PSI-BLAST algorithm while improving the precision.

    View details for PubMedID 11262956

  • Using metacomputing tools to facilitate large scale analyses of biological databases. Waugh, A., Williams, G., Wei, L., Altman, R. edited by Altman, R., Dunker, K., Hunter, L. 2001
  • Challenges for intelligent systems in biology. IEEE Intelligent Systems Altman, R. 2001; 6 (16): 14-18
  • Proceedings of Pacific Symposium on Biocomputing 2001. edited by Altman, R., Dunker, K., Hunter, L. 2001
  • ViewFeature: integrated feature analysis and visualization. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Banatao, D. R., Huang, C. C., Babbitt, P. C., Altman, R. B., Klein, T. E. 2001: 240-250


    Visualization interfaces for high performance computing systems pose special problems due to the complexity and volume of data these systems manipulate. In the post-genomic era, scientists must be able to quickly gain insight into structure-function problems, and require flexible computing environments to quickly create interfaces that link the relevant tools. Feature, a program for analyzing protein sites, takes a set of 3-dimensional structures and creates statistical models of sites of structural or functional significance. Until now, Feature has provided no support for visualization, which can make understanding its results difficult. We have developed an extension to the molecular visualization program Chimera that integrates Feature's statistical models and site predictions with 3-dimensional structures viewed in Chimera. We call this extension ViewFeature, and it is designed to help users understand the structural Features that define a site of interest. We applied ViewFeature in an analysis of the enolase superfamily; a functionally distinct class of proteins that share a common fold, the alpha/beta barrel, in order to gain a more complete understanding of the conserved physical properties of this superfamily. In particular, we wanted to define the structural determinants that distinguish the enolase superfamily active site scaffold from other alpha/beta barrel superfamilies and particularly from other metal-binding alpha/beta barrel proteins. Through the use of ViewFeature, we have found that the C-terminal domain of the enolase superfamily does not differ at the scaffold level from metal-binding alpha/beta barrels. We are, however, able to differentiate between the metal-binding sites of alpha/beta barrels and those of other metal-binding proteins. We describe the overall architectural Features of enolases in a radius of 10 Angstroms around the active site.

    View details for PubMedID 11262944

  • Using meta computing tools to facilitate large-scale analyses of biological databases. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Waugh, A., WILLIAMS, G. A., Wei, L., Altman, R. B. 2001: 360-371


    Given the high rate at which biological data are being collected and made public, it is essential that computational tools be developed that are capable of efficiently accessing and analyzing these data. High-performance distributed computing resources can play a key role in enabling large-scale analyses of biological databases. We use a distributed computing environment, Legion, to enable large-scale computations on the Protein Data Bank (PDB). In particular, we employ the Feature program to scan all protein structures in the PDB in search for unrecognized potential cation binding sites. We evaluate the efficiency of Legion's parallel execution capabilities and analyze the initial biological implications that result from having a site annotation scan of the entire PDB. We discuss four interesting proteins with unannotated, high-scoring candidate cation binding sites.

    View details for PubMedID 11262955

  • Integrating genotype and phenotype information: an overview of the PharmGKB project. Pharmacogenetics Research Network and Knowledge Base. pharmacogenomics journal Klein, T. E., Chang, J. T., Cho, M. K., Easton, K. L., FERGERSON, R., Hewett, M., Lin, Z., Liu, Y., Liu, S., Oliver, D. E., Rubin, D. L., SHAFA, F., Stuart, J. M., Altman, R. B. 2001; 1 (3): 167-170

    View details for PubMedID 11908751

  • Constrained global optimization for estimating molecular structure from atomic distances JOURNAL OF COMPUTATIONAL BIOLOGY Williams, G. A., Dugan, J. M., Altman, R. B. 2001; 8 (5): 523-547


    Finding optimal three-dimensional molecular configurations based on a limited amount of experimental and/or theoretical data requires efficient nonlinear optimization algorithms. Optimization methods must be able to find atomic configurations that are close to the absolute, or global, minimum error and also satisfy known physical constraints such as minimum separation distances between atoms (based on van der Waals interactions). The most difficult obstacles in these types of problems are that 1) using a limited amount of input data leads to many possible local optima and 2) introducing physical constraints, such as minimum separation distances, helps to limit the search space but often makes convergence to a global minimum more difficult. We introduce a constrained global optimization algorithm that is robust and efficient in yielding near-optimal three-dimensional configurations that are guaranteed to satisfy known separation constraints. The algorithm uses an atom-based approach that reduces the dimensionality and allows for tractable enforcement of constraints while maintaining good global convergence properties. We evaluate the new optimization algorithm using synthetic data from the yeast phenylalanine tRNA and several proteins, all with known crystal structure taken from the Protein Data Bank. We compare the results to commonly applied optimization methods, such as distance geometry, simulated annealing, continuation, and smoothing. We show that compared to other optimization approaches, our algorithm is able combine sparse input data with physical constraints in an efficient manner to yield structures with lower root mean squared deviation.

    View details for Web of Science ID 000171950200005

    View details for PubMedID 11694181

  • Biomedical computation at Stanford University: a larger umbrella for the future M D COMPUTING Altman, R. B. 2000; 17 (6): 35-37

    View details for Web of Science ID 000165970200020

    View details for PubMedID 11189759

  • The interactions between clinical informatics and bioinformatics: A case study JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Altman, R. B. 2000; 7 (5): 439-443


    For the past decade, Stanford Medical Informatics has combined clinical informatics and bioinformatics research and training in an explicit way. The interest in applying informatics techniques to both clinical problems and problems in basic science can be traced to the Dendral project in the 1960s. Having bioinformatics and clinical informatics in the same academic unit is still somewhat unusual and can lead to clashes of clinical and basic science cultures. Nevertheless, the benefits of this organization have recently become clear, as the landscape of academic medicine in the next decades has begun to emerge. The author provides examples of technology transfer between clinical informatics and bioinformatics that illustrate how they complement each other.

    View details for Web of Science ID 000089431900002

    View details for PubMedID 10984462

  • Calculation of the relative geometry of tRNAs in the ribosome from directed hydroxyl-radical probing data RNA-A PUBLICATION OF THE RNA SOCIETY Joseph, S., Whirl, M. L., Kondo, D., Noller, H. F., Altman, R. B. 2000; 6 (2): 220-232


    The many interactions of tRNA with the ribosome are fundamental to protein synthesis. During the peptidyl transferase reaction, the acceptor ends of the aminoacyl and peptidyl tRNAs must be in close proximity to allow peptide bond formation, and their respective anticodons must base pair simultaneously with adjacent trinucleotide codons on the mRNA. The two tRNAs in this state can be arranged in two nonequivalent general configurations called the R and S orientations, many versions of which have been proposed for the geometry of tRNAs in the ribosome. Here, we report the combined use of computational analysis and tethered hydroxyl-radical probing to constrain their arrangement. We used Fe(II) tethered to the 5' end of anticodon stem-loop analogs (ASLs) of tRNA and to the 5' end of deacylated tRNA(Phe) to generate hydroxyl radicals that probe proximal positions in the backbone of adjacent tRNAs in the 70S ribosome. We inferred probe-target distances from the resulting RNA strand cleavage intensities and used these to calculate the mutual arrangement of A-site and P-site tRNAs in the ribosome, using three different structure estimation algorithms. The two tRNAs are constrained to the S configuration with an angle of about 45 degrees between the respective planes of the molecules. The terminal phosphates of 3'CCA are separated by 23 A when using the tRNA crystal conformations, and the anticodon arms of the two tRNAs are sufficiently close to interact with adjacent codons in mRNA.

    View details for Web of Science ID 000085267900007

    View details for PubMedID 10688361

  • Generating interactive molecular documentaries using a library of graphical actions. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Pulavarthi, P., Chiang, R., Altman, R. B. 2000: 266-277


    Paper-based publishing of scientific articles limits the types of presentations that can be used. The emergence of electronic publishing has created opportunities to increase the range of formats available for conveying scientific content. We introduce the Graphical Explanation Markup Language, GEML, implemented as an XML format for defining molecular documentaries which exploit the interactive capabilities of electronic publishing. GEML builds upon existing molecular structure definitions such as the Protein Data Bank (PDB) standard file format. GEML provides a library of gestures (or actions) commonly used for structural explanations, and is extensible. XML allows us to separate explicit statements about how to highlight a molecular structure from the implementation of these instructions. We also present GEIS (Generator of Explanatory Interactive Systems), a program that takes as input a GEML documentary definition file and produces all the files necessary for an interactive, web-based molecular documentary. To demonstrate GEML and GEIS, we constructed a documentary capturing the difficult 3D notions expressed in two selected published reports about human topoisomerase I. We have created a prototype Java application, GEMLBuilder, as an editor of GEML files.

    View details for PubMedID 10902175

  • The new peer review JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Kohane, I. S., Altman, R. B. 2000: 433-437


    It is widely recognized that the Internet has fundamentally changed the dynamics of publication, and in particular, it is clear that there is no effective way to control the release of any web-based publication. The scientific and lay literature is now accessible to the public with unprecedented ease Recent proposals to start a life sciences online repository of preprints highlights the trend towards "publish first, review later" that seems to be emerging. Does this mean that the peer review process is dead? It certainly suggests that there is a need for a change in how the process works. We discuss currently available technologies to enable the implementation of new, distributed peer review process benefiting multiple user communities.

    View details for Web of Science ID 000170207500089

    View details for PubMedID 11079920

  • Proceedings of Pacific Symposium on Biocomputing 2000. edited by Altman, R., Dunker, K., Hunter, L. 2000
  • Calculation of the relative geometry of tRNAs in the ribosome from directed hydroxyl-radical probing data. RNA, PMCID: PMC1369908. Joseph, S., Carrillo, M., Kondo, H., Noller, H., Altman, R. 2000; 6: 220-232
  • Bioinformatics. Medical Informatics: Computer Applications in Health Care Altman, R. edited by Shortliffe, T., Wiederhold, G., Fagan, L. Heidelberg: Springer-Verlag.. 2000: 638-660
  • National Research Council Panel. Networking Health: Prescriptions for the Internet. Altman, R, B. Washington, DC: National Academy Press.. 2000: 1
  • Pattern recognition of genomic features with microarrays: site typing of Mycobacterium tuberculosis strains. Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology Raychaudhuri, S., Stuart, J. M., Liu, X., Small, P. M., Altman, R. B. 2000; 8: 286-295


    Mycobacterium tuberculosis (M. tb.) strains differ in the number and locations of a transposon-like insertion sequence known as IS6110. Accurate detection of this sequence can be used as a fingerprint for individual strains, but can be difficult because of noisy data. In this paper, we propose a non-parametric discriminant analysis method for predicting the locations of the IS6110 sequence from microarray data. Polymerase chain reaction extension products generated from primers specific for the insertion sequence are hybridized to a microarray containing targets corresponding to each open reading frame in M. tb. To test for insertion sites, we use microarray intensity values extracted from small windows of contiguous open reading frames. Rank-transformation of spot intensities and first-order differences in local windows provide enough information to reliably determine the presence of an insertion sequence. The nonparametric approach outperforms all other methods tested in this study.

    View details for PubMedID 10977090

  • Computational modeling of structural experimental data RNA-LIGAND INTERACTIONS PT A Bada, M. A., Altman, R. B. 2000; 317: 470-491

    View details for Web of Science ID 000087898000028

    View details for PubMedID 10829296

  • Principal components analysis to summarize microarray experiments: application to sporulation time series. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Raychaudhuri, S., Stuart, J. M., Altman, R. B. 2000: 455-466


    A series of microarray experiments produces observations of differential expression for thousands of genes across multiple conditions. It is often not clear whether a set of experiments are measuring fundamentally different gene expression states or are measuring similar states created through different mechanisms. It is useful, therefore, to define a core set of independent features for the expression states that allow them to be compared directly. Principal components analysis (PCA) is a statistical technique for determining the key variables in a multidimensional data set that explain the differences in the observations, and can be used to simplify the analysis and visualization of multidimensional data sets. We show that application of PCA to expression data (where the experimental conditions are the variables, and the gene expression measurements are the observations) allows us to summarize the ways in which gene responses vary under different conditions. Examination of the components also provides insight into the underlying factors that are measured in the experiments. We applied PCA to the publicly released yeast sporulation data set (Chu et al. 1998). In that work, 7 different measurements of gene expression were made over time. PCA on the time-points suggests that much of the observed variability in the experiment can be summarized in just 2 components--i.e. 2 variables capture most of the information. These components appear to represent (1) overall induction level and (2) change in induction level over time. We also examined the clusters proposed in the original paper, and show how they are manifested in principal component space. Our results are available on the internet at http:¿ .

    View details for PubMedID 10902193

  • AI in medicine - The spectrum of challenges from managed care to molecular medicine AI MAGAZINE Altman, R. B. 1999; 20 (3): 67-77
  • Automated diagnosis of data-model conflicts using metadata JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Chen, R. O., Altman, R. B. 1999; 6 (5): 374-392


    The authors describe a methodology for helping computational biologists diagnose discrepancies they encounter between experimental data and the predictions of scientific models. The authors call these discrepancies data-model conflicts. They have built a prototype system to help scientists resolve these conflicts in a more systematic, evidence-based manner. In computational biology, data-model conflicts are the result of complex computations in which data and models are transformed and evaluated. Increasingly, the data, models, and tools employed in these computations come from diverse and distributed resources, contributing to a widening gap between the scientist and the original context in which these resources were produced. This contextual rift can contribute to the misuse of scientific data or tools and amplifies the problem of diagnosing data-model conflicts. The authors' hypothesis is that systematic collection of metadata about a computational process can help bridge the contextual rift and provide information for supporting automated diagnosis of these conflicts. The methodology involves three major steps. First, the authors decompose the data-model evaluation process into abstract functional components. Next, they use this process decomposition to enumerate the possible causes of the data-model conflict and direct the acquisition of diagnostically relevant metadata. Finally, they use evidence statically and dynamically generated from the metadata collected to identify the most likely causes of the given conflict. They describe how these methods are implemented in a knowledge-based system called GRENDEL and show how GRENDEL can be used to help diagnose conflicts between experimental data and computationally built structural models of the 30S ribosomal subunit.

    View details for Web of Science ID 000082447300006

    View details for PubMedID 10495098

  • RiboWeb: An ontology-based system for collaborative molecular biology IEEE INTELLIGENT SYSTEMS & THEIR APPLICATIONS Altman, R. B., Bada, M., Chai, X. Q., Carillo, M. W., Chen, R. O., Abernethy, N. F. 1999; 14 (5): 68-76
  • Sophia: A flexible, Web-based knowledge server IEEE INTELLIGENT SYSTEMS & THEIR APPLICATIONS Abernethy, N. F., Wu, J. J., Hewett, M., Altman, R. B. 1999; 14 (4): 79-85
  • Are predicted structures good enough to preserve functional sites? STRUCTURE Wei, L. P., Huang, E. S., Altman, R. B. 1999; 7 (6): 643-650


    A principal goal of structure prediction is the elucidation of function. We have studied the ability of computed models to preserve the microenvironments of functional sites. In particular, 653 model structures of a calcium-binding protein (generated using an ab initio folding protocol) were analyzed, and the degree to which calcium-binding sites were recognizable was assessed.While some model structures preserve the calcium-binding microenvironments, many others, including some with low root mean square deviations (rmsds) from the crystal structure of the native protein, do not. There is a very weak correlation between the overall rmsd of a structure and the preservation of calcium-binding sites. Only when the quality of the model structure is high (rmsd less than 2 A for atoms in the 7 A local neighborhood around calcium) does the modeling of the binding sites become reliable.Protein structure prediction methods need to be assessed in terms of their preservation of functional sites. High-resolution structures are necessary for identifying binding sites such as calcium-binding sites.

    View details for Web of Science ID 000080967100007

    View details for PubMedID 10404593

  • Using imperfect secondary structure predictions to improve molecular structure computations BIOINFORMATICS Chen, C. C., Singh, J. P., Altman, R. B. 1999; 15 (1): 53-65


    Until ab initio structure prediction methods are perfected, the estimation of structure for protein molecules will depend on combining multiple sources of experimental and theoretical data. Secondary structure predictions are a particularly useful source of structural information, but are currently only approximately 70% correct, on average. Structure computation algorithms which incorporate secondary structure information must therefore have methods for dealing with predictions that are imperfect. EXPERIMENTS PERFORMED: We have modified our algorithm for probabilistic least squares structural computations to accept 'disjunctive' constraints, in which a constraint is provided as a set of possible values, each weighted with a probability. Thus, when a helix is predicted, the distances associated with a helix are given most of the weight, but some weights can be allocated to the other possibilities (strand and coil). We have tested a variety of strategies for this weighting scheme in conjunction with a baseline synthetic set of sparse distance data, and compared it with strategies which do not use disjunctive constraints.Naive interpretations in which predictions were taken as 100% correct led to poor-quality structures. Interpretations that allow disjunctive constraints are quite robust, and even relatively poor predictions (58% correct) can significantly increase the quality of computed structures (almost halving the RMS error from the known structure).Secondary structure predictions can be used to improve the quality of three-dimensional structural computations. In fact, when interpreted appropriately, imperfect predictions can provide almost as much improvement as perfect predictions in three-dimensional structure calculations.

    View details for Web of Science ID 000079090200006

    View details for PubMedID 10068692

  • Proceedings of Pacific Symposium on Biocomputing 1999. edited by Altman, R., Dunker, K., Hunter, L. 1999
  • RiboWeb: An Ontology-Based System for Collaborative Molecular Biology. IEEE Intelligent Systems and Their Application Altman, R., Chen, R., Abernethy, N., Bada, M. 1999; 5 (14): 68-76
  • AI in medicine: The spectrum of challenges from managed care to molecular medicine. AI Magazine Altman, R. 1999; 3 (20): 67-77
  • Hierarchical organization of molecular structure computations JOURNAL OF COMPUTATIONAL BIOLOGY Chen, C. C., Singh, J. P., Altman, R. B. 1998; 5 (3): 409-422


    The task of computing molecular structure from combinations of experimental and theoretical constraints is expensive because of the large number of estimated parameters (the 3D coordinates of each atom) and the rugged landscape of many objective functions. For large molecular ensembles with multiple protein and nucleic acid components, the problem of maintaining tractability in structural computations becomes critical. A well-known strategy for solving difficult problems is divide-and-conquer. For molecular computations, there are two ways in which problems can be divided: (1) using the natural hierarchy within biological macromolecules (taking advantage of primary sequence, secondary structural subunits and tertiary structural motifs, when they are known); and (2) using the hierarchy that results from analyzing the distribution of structural constraints (providing information about which substructures are constrained to one another). In this paper, we show that these two hierarchies can be complementary and can provide information for efficient decomposition of structural computations. We demonstrate five methods for building such hierarchies--two automated heuristics that use both natural and empirical hierarchies, one knowledge-based process using both hierarchies, one method based on the natural hierarchy alone, and for completeness one random hierarchy oblivious to auxiliary information--and apply them to a data set for the procaryotic 30S ribosomal subunit using our probabilistic least squares structure estimation algorithm. We show that the three methods that combine natural hierarchies with empirical hierarchies create decompositions which increase the efficiency of computations by as much as 50-fold. There is only half this gain when using the natural decomposition alone, while the random hierarchy suggests that a speedup of about five can be expected just by virtue of having a decomposition. Although the knowledge-based method performs marginally better, the automatic heuristics are easier to use, scale more reliably to larger problems, and can match the performance of knowledge-based methods if provided with basic structural information.

    View details for Web of Science ID 000075921100005

    View details for PubMedID 9773341

  • Reuse, CORBA, and knowledge-based systems INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES Gennari, J. H., Cheng, H. N., Altman, R. B., Musen, M. A. 1998; 49 (4): 523-546
  • Bioinformatics in support of molecular medicine JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Altman, R. B. 1998: 53-61


    Bioinformatics studies two important information flows in modern biology. The first is the flow of genetic information from the DNA of an individual organism up to the characteristics of a population of such organisms (with an eventual passage of information back to the genetic pool, as encoded within DNA). The second is the flow of experimental information from observed biological phenomena to models that explain them, and then to new experiments in order to test these models. The discipline of bioinformatics has its roots in a number of activities, including the organization of DNA sequence and protein three-dimensional structural data collections in the 1960's and 1970's. It has become a booming academic and industrial enterprise with the introduction of biological experiments that rapidly produce massive amounts of data (such as the multiple genome sequencing projects, the large scale analysis of gene expression, and the large scale analysis of protein-protein interactions). Basic biological science has always had an impact on clinical medicine (and clinical medical information systems), and is creating a new generation of epidemiologic, diagnostic, prognostic, and treatment modalities. Bioinformatics efforts that appear to be wholly geared towards basic science are likely to become relevant to clinical informatics in the coming decade. For example, DNA sequence information and sequence annotations will appear in the medical chart with increasing frequency. The algorithms developed for research in bioinformatics will soon become part of clinical information systems.

    View details for Web of Science ID 000171768600009

    View details for PubMedID 9929182

  • Bioinformatics in Support of Molecular Medicine. Altman, R. 1998
  • MHCWeb: Converting a WWW Database into a Knowledge-based Collaborative Environment. Hon, L., Abernethy, N., Brusic, V., Chai, J., Altman, R. 1998
  • A Curriculum for Bioinformatics: The Time is Ripe. Bioinformatics Altman, R. 1998; 7 (14): 549-550
  • Graphical Style Sheets: Towards Reusable Representations of Biomedical Graphics. Felciano, R., Altman, R. 1998
  • Determination of the Spatial Distribution of Protein Structure Using Solution Data. Altman, R., Duncan, B., Brinkley, J., Buchanan, B., Jardetzky, O. edited by Jaroszewski, J., Schaumburg, K., Kofod, H. 1998
  • SOPHIA: Providing Basic Knowledge Services with a Common DBMS. Abernethy, N., Altman, R. edited by Borgida, A., Chaudhri, V., Staudt, M. 1998
  • Updated Bibliography Using the RELATED ARTICLES Function within PubMed. Liu, X., Altman, R. 1998
  • Probabilistic and Statistical Descriptions of Protein Structure. Computational Biology: Pattern Analysis and Machine Learning Methods Wei, L., Chang, J., Altman, R. edited by Salzberg, S., Searls, D., Kasif, S. London, UK: Elsevier Science.. 1998: 207-225
  • The Hierarchical Organization of Molecular Structure Computation. In: RECOMB-98 Chen, C., Singh, J., Altman, R. New York: ACM Press.. 1998: 51-59
  • Proceedings of Pacific Symposium on Biocomputing 1998. edited by Altman, R., Dunker, K., Hunter, L. 1998
  • PROTEAN: Deriving Protein Structure from Constraints. Blackboard Systems Hayes-Roth, B., Buchanan, B., Lichtarge, O., Hewett, M., Altman, R., Brinkley, J. edited by Engelmore, R., Morgan, A. Workingham: Addison-Wesley.. 1998: 417-431
  • The Hierarchical Organization of Molecular Structure Computations. Journal of Computational Biology Chen, C., Singh, J., Altman, R. 1998; 3 (5): 409-422
  • Updating a bibliography using the RELATED ARTICLES function within PubMed JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Liu, X. L., Altman, R. B. 1998: 750-754


    Comprehensive bibliographies are useful for conducting reviews of the literature, and for assessing the progress within a field. These bibliographies may be broad and inclusive, or focused and precise in their inclusion criteria. In either case, the task of maintaining a complete bibliography within a particular area of research is made difficult by the diversity, complexity and huge volume of newly published literature. In an effort to effectively and automatically retrieve relevant literature, different search strategies and indexing tools have been developed, including the RELATED ARTICLES function provided with the PubMed system. In this paper, we report a program for incremental updates of a bibliography using the PubMed RELATED ARTICLES function. Given a highly specialized starting bibliography of experimental measurements of the structure of the 30S bacterial ribosomal subunit, the system was applied to find additional relevant references. For this particular task, the system has a recall of 75%, a strict precision of 32% and a partial precision of 42%. Our results are notable because although the RELATED ARTICLES function is purely statistical, it is nonetheless able to select a very narrowly defined set of articles from the literature. We discuss the tradeoffs between having a user to evaluate many articles of possible interest in a single session, versus asking a user to evaluate a small set of articles on a periodic basis.

    View details for Web of Science ID 000171768600146

    View details for PubMedID 9929319

  • A surface measure for probabilistic structural computations. Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology Schmidt, J. P., Chen, C. C., Cooper, J. L., Altman, R. B. 1998; 6: 148-156


    Computing three-dimensional structures from sparse experimental constraints requires method for combining heterogeneous sources of information, such as distances, angles, and measures of total volume, shape, and surface. For some types of information, such as distances between atoms, numerous methods are available for computing structures that satisfy the provided constraints. It is more difficult, however, to use information about the degree to which an atom is on the surface or buried as a useful constraint during structure computations. Surface measures have been used as accept/reject criteria for previously computed structures, but this is not an efficient strategy. In this paper, we investigate the efficacy of applying a surface measure in the computation of molecular structure, using a method of probabilistic least square computations which facilitates the introduction of multiple, noisy, heterogeneous data sources. For this purpose, we introduce a simple purely geometrical measure of surface proximity called maximal conic view (MCV). MCV is efficiently computable and differentiable, and is hence well suited to driving a structural optimization method based, in part, on surface data. As an initial validation, we show that MCV correlates well with known measures for total exposed surface area. We use this measure in our experiments to show that information about surface proximity (derived from theory or experiment, for example) can be added to a set of distance measurements to increase significantly the quality of the computed structure. In particular, when 30 to 50 percent of all possible short-range distances are provided, the addition of surface information improves the quality of the computed structure (as measured by RMS fit) by as much as 80 percent. Our results demonstrate that knowledge of which atoms are on the surface and which are buried can be used as a powerful constraint in estimating molecular structure.

    View details for PubMedID 9783220

  • MHCWeb: Converting a WWW database into a knowledge-based collaborative environment JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Hon, L., Abernethy, N. F., Brusic, V., Chai, J., Altman, R. B. 1998: 947-951


    The World Wide Web (WWW) is useful for distributing scientific data. Most existing web data resources organize their information either in structured flat files or relational databases with basic retrieval capabilities. For databases with one or a few simple relations, these approaches are successful, but they can be cumbersome when there is a data model involving multiple relations between complex data. We believe that knowledge-based resources offer a solution in these cases. Knowledge bases have explicit declarations of the concepts in the domain, along with the relations between them. They are usually organized hierarchically, and provide a global data model with a controlled vocabulary. We have created the OWEB architecture for building online scientific data resources using knowledge bases. OWEB provides a shell for structuring data, providing secure and shared access, and creating computational modules for processing and displaying data. In this paper, we describe the translation of the online immunological database MHCPEP into an OWEB system called MHCWeb. This effort involved building a conceptual model for the data, creating a controlled terminology for the legal values for different types of data, and then translating the original data into the new structure. The OWEB environment allows for flexible access to the data by both users and computer programs.

    View details for Web of Science ID 000171768600185

    View details for PubMedID 9929358

  • Recognizing protein binding sites using statistical descriptions of their 3D environments. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Wei, L., Altman, R. B. 1998: 497-508


    We have developed a new method for recognizing sites in three-dimensional protein structures. Our method is based on our previously reported algorithm for creating descriptions of protein microenvironments using physical and chemical properties at multiple levels of detail (including features at the atomic, chemical group, residue, and secondary structural levels). The recognition method takes three inputs: a set of sites that share some structural or functional role, a set of control nonsites that lack this role, and a single query site. The values of properties for the query site are compared to the distributions of values for both sites and nonsites to determine the group to which it is most similar. A log-odds scoring function, based on Bayes' Rule, computes a score that indicates the likelihood that the query region is a site of interest. In this paper, we apply the method to the task of identifying calcium binding sites in proteins. Cross-validation analysis shows that this recognition approach has high sensitivity and specificity. We also describe the results of scanning four calcium binding proteins (with the calcium removed) using a three-dimensional grid of probe points at 2 A spacing. The probe points that have high scores cluster around the true calcium binding sites, with the highest scoring points at or near the binding sites. The method fails in only one case where a calcium binding site is created by four proteins in the crystal lattice, and is thus not recognizable within the crystallographic asymmetric unit. Our results show that property-based descriptions can be used for recognizing protein sites in unannotated structures.

    View details for PubMedID 9697207

  • Informatics in the care of patients: Ten notable challenges WESTERN JOURNAL OF MEDICINE Altman, R. B. 1997; 166 (2): 118-122


    What is medical informatics, and why should practicing physicians care about it? Medical informatics is the study of the concepts and conceptual relationships within biomedical information and how they can be harnessed for practical applications. In the past decade, the field has exploded as health professionals recognize the importance of strategic information management and the inadequacies of traditional tools for information storage, retrieval, and analysis. At the same time that medical informatics has established a presence within many academic and industrial research facilities, its goals and methods have become less clear to practicing physicians. In this article, I outline 10 challenges in medical informatics that provide a framework for understanding developments in the field. These challenges have been divided into those relating to infrastructure, specific performance, and evaluation. The primary goals of medical informatics, as for any other branch of biomedical research, are to improve the overall health of patients by combining basic scientific and engineering insights with the useful application of these insights to important problems.

    View details for Web of Science ID A1997WR20700003

    View details for PubMedID 9109328

  • LPFC: An Internet library of protein family core structures PROTEIN SCIENCE Schmidt, R., Gerstein, M., Altman, R. B. 1997; 6 (1): 246-248


    As the number of protein molecules with known, high-resolution structures increases, it becomes necessary to organize these structures for rapid retrieval, comparison, and analysis. The Protein Data Bank (PDB) currently contains nearly 5,000 entries and is growing exponentially. Most new structures are similar structurally to ones reported previously and can be grouped into families. As the number of members in each family increases, it becomes possible to summarize, statistically, the commonalities and differences within each family. We reported previously a method for finding the atoms in a family alignment that have low spatial variance and those that have higher spatial variance (i.e., the "core" atoms that have the same relative position in all family members and the "non-core" atoms that do not). The core structures we compute have biological significance and provide an excellent quantitative and visual summary of a multiple structural alignment. In order to extend their utility, we have constructed a library of protein family cores, accessible over the World Wide Web at http:/ / This library is generated automatically with publicly available computer programs requiring only a set of multiple alignments as input. It contains quantitative analysis of the spatial variation of atoms within each protein family, the coordinates of the average core structures derived from the families, and display files (in bitmap and VRML formats). Here, we describe the resource and illustrate its applicability by comparing three multiple alignments of the globin family. These three alignments are found to be similar, but with some significant differences related to the diversity of family members and the specific method used for alignment.

    View details for Web of Science ID A1997WD20100027

    View details for PubMedID 9007997

  • RiboWeb: Linking Structural Computations to a Knowledge Base of Published Experimental Data. Chen, R., Felciano, R., Altman, R. 1997
  • Standardized Representations of the Literature: Combining Diverse Sources of Ribosomal Data. Altman, R., Abernethy, N., Chen, R. 1997
  • Using the Radial Distribution of Physical Features to Compare Amino Acid Environments. Wei, L., Altman, R., Chang, J. edited by Altman, R., Dunker, K., Hunter, L. 1997
  • Proceedings of Pacific Symposium on Biocomputing 1997. edited by Altman, R., Dunker, K., Hunter, L. 1997
  • RNA secondary structure as a reusable interface to biological information resources GENE-COMBIS Felciano, R. M., Chen, R. O., Altman, R. B. 1997; 190: GC59-GC70


    The dissemination of biological information has become critically dependent on the Internet and World Wide Web (WWW), which enable distributed access to information in a platform independent manner. The mode of interaction between biologists and on-line information resources, however, has been mostly limited to simple interface technologies such has hypertext links, tables and forms. The introduction of platform-independent runtime environments facilitates the development of more sophisticated WWW-based user interfaces. Until recently, most such interfaces have been tightly coupled to the underlying computation engines, and not separated as reusable components. We believe that many subdisciplines of biology have intuitive and familiar graphical representations of knowledge that can serve as multipurpose user interface elements. We call such graphical idioms "domain graphics". In order to illustrate the power of such graphics, we have built a reusable interface based on the standard two dimensional (2D) layout of RNA secondary structure. The interface can be used to represent any pre-computed layout of RNA, and takes as a parameters the sets of actions to be performed as a user interacts with the interface. It can provide to any associated application program information about the base, helix, or subsequence selected by the user. We show the versatility of this interface by using it as a special purpose interface to BLAST, Medline and the RNA MFOLD search/compute engines. These demonstrations are available at: gene-combis-96/

    View details for Web of Science ID A1997WP51800001

    View details for PubMedID 9197551

  • Standardized representations of the literature: Combining diverse sources of ribosomal data ISMB-97 - FIFTH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY, PROCEEDINGS Altman, R. B., Abernethy, N. F., Chen, R. O. 1997: 15-24


    We are building a knowledge base (KB) of published structural data on the 30s ribosomal subunit in prokaryotes. Our KB is distinguished by a standardized representation of biological experiments and their results, in a reusable format. It can be accessed by computer programs that exploit the rich interconnections within the data. The KB is designed to support the construction of 3D models of the 30S subunit, as well as the analysis and extension of relevant functional and phylogenetic information. Most published information about the structure of the ubiquitous ribosome focuses on E. coli as a model system. At the same time, thousands of RNA sequences for the ribosome have been gathered and cataloged. The volume and complexity of these data can complicate attempts to separate structural data peculiar to E. coli from data of universal relevance. We have written an application that dynamically queries the KB and the Ribosome Database Project, a repository of ribosomal RNA sequences from other organisms, in order to assess the relevance of structural data to particular organisms. The application uses the RDP alignment to determine whether a set of data refer primarily to conserved, mismatched, or gapped positions. For a set of 16 representative articles evaluated over 211 sequences, 73% of observations have unambiguous translations from E. coli to the other organisms, 21% have somewhat ambiguous translations, and 6% have no translations. There is a wide variation in these numbers over different articles and organisms, confirming that some articles report structural information specific to E. coli while others report information that is quite general.

    View details for Web of Science ID 000072320000002

    View details for PubMedID 9322010

  • RIBOWEB: Linking structural computations to a knowledge base of published experimental data ISMB-97 - FIFTH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY, PROCEEDINGS Chen, R. O., Felciano, R., Altman, R. B. 1997: 84-87


    The world wide web (WWW) has become critical for storing and disseminating biological data. It offers an additional opportunity, however, to support distributed computation and sharing of results. Currently, computational analysis tools are often separated from the data in a manner that makes iterative hypothesis testing cumbersome. We hypothesize that the cycle of scientific reasoning (using data to build models, and evaluating models in light of data) can be facilitated with resources that link computations with semantic models of the data. Riboweb is an on-line knowledge-based resource that supports the creation of three-dimensional models of the 30S ribosomal subunit. It has three components: (I) a knowledge base containing representations of the essential physical components and published structural data, (II) computational modules that use the knowledge base to build or analyze structural models, and (III) a web-based user interface that supports multiple users, sessions and computations. We have built a prototype of Riboweb, and have used it to refine a rough model of the central domain of the 30S subunit from E. coli. procedure. Our results suggest that sophisticated and integrated computational capabilities can be delivered to biologists using this simple three-component architecture.

    View details for Web of Science ID 000072320000011

    View details for PubMedID 9322019

  • Using the radial distributions of physical features to compare amino acid environments and align amino acid sequences. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Wei, L., Altman, R. B., Chang, J. T. 1997: 465-476


    We have performed a comprehensive analysis of the microenvironments surrounding the twenty amino acids. Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments. We describe the amino acid environments with a set of 21 features summarizing atomic, chemical group, residue, and secondary structural features. The environments are divided into radial shells of 1 A thickness to represent the distance of the features from the amino acid C beta atoms. We make the results of our analysis available graphically over the world wide web. To illustrate the validity and utility of our analysis, we used the amino acid comparative profiles to construct a substitution matrix, the WAC matrix, based on a simple summary of the computed environmental differences. We compared our matrix to BLOSUM62 and PAM250 in BLAST searches with query sequences selected from 39 protein families found in the PROSITE database. Although BLOSUM62 was the most sensitive matrix overall, our matrix was more sensitive for some families, and exhibited overall performance similar to PAM250. Our results suggest that the radial distribution of biochemical and biophysical features is useful for comparing amino acid environments, and that similarity matrices based on the geometric distribution of features around amino acids may produce improved search sensitivity.

    View details for PubMedID 9390315

  • Computational methods for defining the allowed conformational space of 16S rRNA based on chemical footprinting data RNA-A PUBLICATION OF THE RNA SOCIETY Fink, D. L., Chen, R. O., Noller, H. F., Altman, R. B. 1996; 2 (9): 851-866


    Structural models for 16S ribosomal RNA have been proposed based on combinations of crosslinking, chemical protection, shape, and phylogenetic evidence. These models have been based for the most part on independent data sets and different sets of modeling assumptions. In order to evaluate such models meaningfully, methods are required to explicitly model the spatial certainty with which individual structural components are positioned by specific data sets. In this report, we use a constraint satisfaction algorithm to explicitly assess the location of the secondary structural elements of the 16S RNA, as well as the certainty with which these elements can be positioned. The algorithm initially assumes that these helical elements can occupy any position and orientation and then systematically eliminates those positions and orientations that do not satisfy formally parameterized interpretations of structural constraints. Using a conservative interpretation of the hydroxyl radical footprinting data, the positions of the ribosomal proteins as defined by neutron diffraction studies, and the secondary structure of 16S rRNA, the location of the RNA secondary structural elements can be defined with an average precision of 25 A (ranging from 12.8 to 56.3 A). The uncertainty in individual helix positions is both heterogeneous and dependent upon the number of constraints imposed on the helix. The topology of the resulting model is consistent with previous models based on independent approaches. The result of our computation is a conservative upper bound on the possible positions of the RNA secondary structural elements allowed by this data set, and provides a suitable starting point for refinement with other sources of data or different sets of modeling assumptions.

    View details for Web of Science ID A1996VH69500001

    View details for PubMedID 8809013

  • Constraining volume by matching the moments of a distance distribution COMPUTER APPLICATIONS IN THE BIOSCIENCES Chen, C. C., Chen, R. O., Altman, R. B. 1996; 12 (4): 319-326


    The problem of computing a molecular structure from a set of distances arises in the interpretation of NMR data as well as other experimental methods that yield distance information. Techniques for computing structures must find conformations consistent with the distance data. There are often other constraints on the structure that must be satisfied as well. One of the most problematic constraints is the constraint on the total volume occupied by the atoms. In this paper, we use the first two moments (mean and variance) of an estimated distance distribution to constrain the volume of a computed structure. We show that a probabilistic algorithm for matching the first two moments of the estimated distance distribution significantly improves the quality of the solution, especially when the distance information alone is not sufficient to define the structure precisely. We also show that our method is not sensitive to small errors in the estimates of mean and variance of the distance distribution. Finally, we demonstrate the use of this constraint in computing a low-resolution structure of the 30S prokaryotic ribosomal subunit. Quantitative analysis of our results allows us to assess the information content contained in constraints on volume, and to show that in some cases addition of a volume constraint adds information roughly equivalent to doubling the number of input distances. Our results also demonstrate the flexibility of probabilistic representations of structural constraints, and the importance of including volume information to constrain structural computations-especially in the case of sparse data.

    View details for Web of Science ID A1996VM02500008

    View details for PubMedID 8902359

  • Images in clinical medicine. Knotted umbilical cord. New England journal of medicine Altman, R. B., Merino, J. E. 1996; 334 (9): 573-?

    View details for PubMedID 8569825

  • Knotted umbilical cord NEW ENGLAND JOURNAL OF MEDICINE Altman, R. B., Merino, J. E. 1996; 334 (9): 573-573
  • Conserved features in the active site of nonhomologous serine proteases FOLDING & DESIGN Bagley, S. C., Altman, R. B. 1996; 1 (5): 371-379


    Serine protease activity is critical for many biological processes and has arisen independently in a few different protein families. It is not clear, though, the degree to which these protease families share common biochemical and biophysical properties. We have used a computer program to study the properties that are shared by four serine protease active sites with no overall structural or sequence homology. The program systematically compares the region around the catalytic histidines from the four proteins with a set of noncatalytic histidines, used as controls. It reports the three-dimensional locations and level of statistical significance for those properties that distinguish the catalytic histidines from the noncatalytic ones. The method of analysis is general and can be applied easily to other active sites of interest.As expected, some of the reported properties correspond to previously known features of the serine protease active site, including the catalytic triad and the oxyanion hole. Novel properties are also found, including the spatial distribution of charged, polar, and hydrophobic groups arranged to stabilize the catalytic residues, and a relative abundance of some residues (Val, Tyr, Leu, and Gly) around the active site.Our findings show that in addition to some properties common to all the proteases examined, there are a set of preferred, but not required, properties that can be reliably observed only by aligning the sites and comparing them with carefully selected statistical controls.

    View details for Web of Science ID A1996WC40600007

    View details for PubMedID 9080183

  • Using tee radial distributions of physical features to compare amino acid environments and align amino acid sequences PACIFIC SYMPOSIUM ON BIOCOMPUTING '97 Wei, L. P., Altman, R. B., Chang, J. T. 1996: 465-476
  • Conserved Features in the Active Site of Nonhomologous Serine Proteases. Folding & Design Bagley, S., Altman, R. 1996; 5 (1): 371-379
  • A programming course in bioinformatics for computer and information science students. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Altman, R. B., Koza, J. 1996: 73-84


    We have created a course entitled "Representations and Algorithms for Computational Molecular Biology" with three specific goals in mind. First, we want to provide a technical introduction for computer science and medical information science students to the challenges of computing with molecular biology data, particularly the advantages of having easy access to real-world data sets. Second, we want to equip the students with the skills required of productive research assistants in molecular biology computing research projects. Finally, we want to provide a showcase for local investigators to describe their work in the context of a course that provide adequate background information. In order to achieve these goals, we have created a programming course, in which three major projects and six smaller assignments are assigned during the quarter. We stress fundamental representations and algorithms during the first part of the course in lectures given by the core faculty, and then have more focused lectures in which faculty research interests are highlighted. The course stressed issues of structural molecular biology, in order to better motivate the critical issues in sequence analysis. The culmination of the course was a challenge to the students to use a version of protein threading to predict which members of a set of unknown sequences were globins. The course was well received, and has been made a core requirement in the Medical Information Sciences program.

    View details for PubMedID 9390224

  • Lamprey: tracking users on the World Wide Web. Proceedings : a conference of the American Medical Informatics Association / ... AMIA Annual Fall Symposium. AMIA Fall Symposium Felciano, R. M., Altman, R. B. 1996: 757-761


    Tracking individual web sessions provides valuable information about user behavior. This information can be used for general purpose evaluation of web-based user interfaces to biomedical information systems. To this end, we have developed Lamprey, a tool for doing quantitative and qualitative analysis of Web-based user interfaces. Lamprey can be used from any conforming browser, and does not require modification of server or client software. By rerouting WWW navigation through a centralized filter, Lamprey collects the sequence and timing of hyperlinks used by individual users to move through the web. Instead of providing marginal statistics, it retains the full information required to recreate a user session. We have built Lamprey as a standard Common Gateway Interface (CGI) that works with all standard WWW browsers and servers. In this paper, we describe Lamprey and provide a short demonstration of this approach for evaluating web usage patterns.

    View details for PubMedID 8947767

  • An evaluation of the TransFER model for sharing clinical decision-support applications. Proceedings : a conference of the American Medical Informatics Association / ... AMIA Annual Fall Symposium. AMIA Fall Symposium Sujansky, W., Altman, R. 1996: 468-472


    TransFER is a formal model designed to facilitate the sharing of decision-support applications across institutions with heterogeneous clinical databases. The TransFER model provides a mechanism to automatically customize database queries based on a reference schema of clinical data and an encoded set of database mappings. In this paper, we describe the elements of the TransFER model and we present the results of a formal evaluation we conducted to assess the utility and generality of the model. The results suggest that the TransFER has significant potential for automating query translation and facilitating application sharing, but that further work on the representation of temporal semantics, on the modeling of missing data, and on the optimization of complex queries is required.

    View details for PubMedID 8947710

  • Using a measure of structural variation to define a core for the globins COMPUTER APPLICATIONS IN THE BIOSCIENCES Gerstein, M., Altman, R. B. 1995; 11 (6): 633-644


    As the database of three-dimensional protein structures expands, it becomes possible to classify related structures into families. Some of these families, such as the globins, have enough members to allow statistical analysis of conserved features. Previously, we have shown that a probabilistic representation based on means and variances can be useful for defining structural cores for large families. These cores contain the subset of atoms that are in essentially the same relative positions in all members of the family. In addition to defining a core, our method creates an ordered list of atoms, ranked by their structural variation. In applying our core-finding procedure to the globins, we find that helices A, B, G and H form a structural core with low variance. These helices fold early in the folding pathway, and superimpose well with helices in the helix-turn-helix repressor protein family. The non-core helices (F and the parts of other helices that interact with it) are associated with the functional differences among the globins, and are encoded within a separate exon. We have also compared the variability measure implicit in our core structures with measures of sequence variability, using a procedure for measuring sequence variability that helps correct for the biased sampling in the databanks. We find, somewhat surprisingly, that sequence variation does not appear to correlate with structural variation.

    View details for Web of Science ID A1995TR87100009

    View details for PubMedID 8808580



    A variety of methods are currently available for creating multiple alignments, and these can be used to define and characterize families of related proteins, such as the globins or the immunoglobulins. We have developed a method for using a multiple alignment to identify an average structural "core", a subset of atoms with low structural variation. We show how the means and variances of core-atom positions summarize the commonalities and differences with a family, making them particularly useful in compiling libraries of protein folds. We show further how it is possible to describe the rotation and translation relating two core structures, as in two domains of a multi-domain protein, in a consistent fashion in terms of a "mean" transformation and a deviation about this mean. Once determined, our average core structures (with their implicit measure of structural variation) allow us to define a measure of structural similarity more informative than the usual root-mean-square (RMS) deviation in atomic position, i.e. a "better RMS." Our average structures also permit straightforward comparisons between variation in structure and sequence at each position in a family. We have applied our core-finding methodology in detail to the immunoglobulin family. We find that the structural variability we observe just within the VL and VH domains anticipates the variability that others have observed throughout the whole immunoglobulin superfamily; that a core definition based on sequence conservation, somewhat surprisingly, does not agree with one based on structural similarity; and that the cores of the VL and VH domains vary about 5 degrees in relative orientation across the known structures.

    View details for Web of Science ID A1995RN00200014

    View details for PubMedID 7643385



    Most molecular graphics programs ignore any uncertainty in the atomic coordinates being displayed. Structures are displayed in terms of perfect points, spheres, and lines with no uncertainty. However, all experimental methods for defining structures, and many methods for predicting and comparing structures, associate uncertainties with each atomic coordinate. We have developed graphical representations that highlight these uncertainties. These representations are encapsulated in a new interactive display program, PROTEAND. PROTEAND represents structural uncertainty in three ways: (1) The traditional way: The program shows a collection of structures as superposed and overlapped stick-figure models. (2) Ellipsoids: At each atom position, the program shows an ellipsoid derived from a three-dimensional Gaussian model of uncertainty. This probabilistic model provides additional information about the relationship between atoms that can be displayed as a correlation matrix. (3) Rigid-body volumes: Using clouds of dots, the program can show the range of rigid-body motion of selected substructures, such as individual alpha helices. We illustrate the utility of these display modalities by the applying PROTEAND to the globin family of proteins, and show that certain types of structural variation are best illustrated with different methods of display.

    View details for Web of Science ID A1995RL45300002

    View details for PubMedID 7577841



    Sites are microenvironments within a biomolecular structure, distinguished by their structural or functional role. A site can be defined by a three-dimensional location and a local neighborhood around this location in which the structure or function exists. We have developed a computer system to facilitate structural analysis (both qualitative and quantitative) of biomolecular sites. Our system automatically examines the spatial distributions of biophysical and biochemical properties, and reports those regions within a site where the distribution of these properties differs significantly from control nonsites. The properties range from simple atom-based characteristics such as charge to polypeptide-based characteristics such as type of secondary structure. Our analysis of sites uses non-sites as controls, providing a baseline for the quantitative assessment of the significance of the features that are uncovered. In this paper, we use radial distributions of properties to study three well-known sites (the binding sites for calcium, the milieu of disulfide bridges, and the serine protease active site). We demonstrate that the system automatically finds many of the previously described features of these sites and augments these features with some new details. In some cases, we cannot confirm the statistical significance of previously reported features. Our results demonstrate that analysis of protein structure is sensitive to assumptions about background distributions, and that these distributions should be considered explicitly during structural analyses.

    View details for Web of Science ID A1995QU44000004

    View details for PubMedID 7613462

  • Characterizing oriented protein structural sites using biochemical properties. Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology Bagley, S. C., Wei, L., Cheng, C., Altman, R. B. 1995; 3: 12-20


    A protein site is a region of a three-dimensional protein structure with a distinguishing functional or structural role. Certain sites recur in different protein structures (for example catalytic sites, calcium binding sites, and some types of turns), but maintain critical shared features. To facilitate the analysis of such protein sites, we have developed a computer system for analyzing the spatial distributions of biochemical properties around a site. The system takes a set of similar sites and a set of control nonsites, and finds differences between them. Specifically, it compares distributions of the properties surrounding the sites with those surrounding the nonsites, and reports statistically significant differences. In this paper, we use our method to analyze the features in the active site of the serine protease enzymes. We compare the use of radial distributions (shells) with 3-D grids (blocks) in the analysis of the active site. We demonstrate three different strategies for focusing attention on significant findings, based on properties of interest, spatial volumes of interest, and on the level of statistical significance. Finally, we show that the program automatically identifies conserved sequential, secondary structural and biophysical features of the serine protease active site, using noncatalytic histidine residues as a control environment.

    View details for PubMedID 7584427

  • Computing the Structure of Large Complexes: Applying Constraint Satisfaction Techniques to Modeling the 16S Ribosomal RNA. Biomolecular NMR Spectroscopy Chen, R., Fink, D., Altman, R. edited by Markley, J., Opella, S. London: Oxford University Press.. 1995: 279-299
  • Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (Cambridge, England). edited by Rawlings, C., Clark, D., Altman, R. 1995
  • A Probabilistic Approach to Determining Biological Structure: Integrating Uncertain Data Sources. International Journal of Human Computer Studies Altman, R. 1995; 42: 593-616
  • Finding an average core structure: application to the globins. Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology Altman, R. B., Gerstein, M. 1994; 2: 19-27


    We present a procedure for automatically identifying from a set of aligned protein structures a subset of atoms with only a small amount of structural variation, i.e., a core. We apply this procedure to the globin family of proteins. Based purely on the results of the procedure, we show that the globin fold can be divided into two parts. The part with greater structural variation consists of the residues near the heme (the F helix and parts of the G and H helices), and the part with lesser structural variation (the core) forms a structural framework similar to that of the repressor protein (A, B, and E helices and remainder of the G and H helices). Such a division is consistent with many other structural and biochemical findings. In addition, we find further partitions within the core that may have biological significance. Finally, using the structural core of the globin family as a reference point, we have compared structural variation to sequence variation and shown that a core definition based on sequence conservation does not necessarily agree with one based on structural similarity.

    View details for PubMedID 7584390

  • Compositional Characteristics of Disordered Regions in Proteins. Protein and Peptide Letters Altman, R., Hughes, C., Jardetzky, O. 1994; 2 (1): 120-127
  • Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (Stanford, CA). edited by Altman, R., Brutlag, D., Karp, P. 1994


    Although quite successful in a variety of settings, standard optimization approaches can have drawbacks within medical applications. For example, they often provide a single solution which is difficult to explain, or which can not be incrementally modified using secondary "soft" constrains that are difficult to encode within the optimization. In order to address these issues, we have developed a probabilistic optimization technique that allows the user to enter prior probability distributions (Gaussian) for the parameters to be optimized as well as for the constraints on the parameters. Our technique combines the prior distributions with the constraints using Bayes' rule. The algorithm produces not only a set of parameter values, but variances on these values and covariances showing the correlations between parameters. We have applied this method to the problem of planning a radiosurgical ablation of brain tumors. The radiation plan should maximize dose to tumor, minimize dose to surrounding areas, and provide an even distribution of dosage across the tumor. It also should be explainable to and modifiable by the expert physicians based on external considerations. We have compared the results of our method with the standard linear programming approach.

    View details for Web of Science ID A1994QF21600137

    View details for PubMedID 7950031



    Clinicians have traditionally documented patient data using natural language text. With the increasing prevalence of computer systems in health care, an increasing amount of medical record text will be stored electronically. However, for such textual documents to be indexed, shared, and processed adequately by computers, it will be important to be able to identify concepts in the documents using a common medical terminology. Automated methods for extracting concepts in a standard terminology would enhance retrieval and analysis of medical record data. This paper discusses a method for extracting concepts from medical record documents using the medical terminology SNOMED-III (Systematized Nomenclature of Human and Veterinary Medicine, Version III). The technique employs a linear least squares fit that maps training set phrases to SNOMED concepts. This mapping can be used for unknown text inputs in the same domain as the training set to predict SNOMED concepts that are contained in the document. We have implemented the method in the domain of congestive heart failure for history and physical exam texts. Our system has a reasonable response time. We tested the system over a range of thresholds. The system performed with 90% sensitivity and 83% specificity at the lowest threshold, and 42% sensitivity and 99.9% specificity at the highest threshold.

    View details for Web of Science ID A1994QF21600033

    View details for PubMedID 7949915

  • Constraint satisfaction techniques for modeling large complexes: application to the central domain of 16S ribosomal RNA. Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology Altman, R. B., Weiser, B., Noller, H. F. 1994; 2: 10-18


    Standard experimental techniques for determining the structure of small to moderately-sized molecules are difficult to apply to large macromolecular complexes. These complexes, consisting of multiple protein and/or nucleic acid components, can contain many thousands of atoms and the experimental techniques used to study them provide relatively sparse structural information with significant measurement uncertainty. Computational technologies are required to reduce the conformational search space and synthesize the data in order to produce the structures or (more usually) sets of structures compatible with the data. In this paper, we show that a method based on the constraint satisfaction paradigm produces a three-dimensional topology for the central domain of the 16S ribosomal RNA that is generally consistent with interactively built models, although differing in significant ways. The modeling incorporates information about secondary structure of the nucleic acid, neutron diffraction data about the relative positions and uncertainties of the proteins, and protection experiments indicating proximities of segments of RNA to specific protein subunits. Unlike previously proposed models, our model contains explicit information about the range of positions for each subunit that are compatible with the data. The system uses a grid search, checks distances in a direction-dependent manner, uses disjunctive distance constraints, and checks for volume overlap violations.

    View details for PubMedID 7584378



    Many clinical decision-support applications are created in a centralized manner, but distributed widely for local use. When such applications include queries to electronic patient databases, the queries must be translated to conform to local database specifications. Because no well-defined standard model of clinical data exists, the translation process is ad hoc, costly, and error-prone. In this paper, we propose an abstract formalism, called the Standard Query Model Framework, for specifying a standard clinical data model and for supporting the automated and reliable translation of queries that appear in shared decision-support applications. We present the components of this framework, discuss their desirable features, and describe a prototype that we have developed for relational patient databases. We also highlight the outstanding research issues relevant to our approach.

    View details for Web of Science ID A1994QF21600059

    View details for PubMedID 7949944



    The use of electronic mail (e-mail) is increasing among both physicians and patients, although there is limited information in the literature about how patients might use e-mail to communicate with their physician. In our university-based internal medicine clinic, we have studied attitudes toward and access to e-mail among patients. A survey of 444 patients in our clinic showed that 46% of patients in the clinic use e-mail, and 89% of those with e-mail use it at work. Fifty-one percent would use e-mail all or most of the time to communicate with the clinic if it were available, and many of the communications that currently take place by phone could be replaced by e-mail. Barriers to e-mail use include privacy concerns among patients who use e-mail in the workplace, choosing the appropriate tasks for e-mail, and methods for efficiently triaging electronic messages in the clinic.

    View details for Web of Science ID A1994QF21600004

    View details for PubMedID 7949909

  • Probabilistic constraint satisfaction with structural models: application to organ modeling by radial contours. Proceedings / the ... Annual Symposium on Computer Application [sic] in Medical Care. Symposium on Computer Applications in Medical Care Altman, R. B., Brinkley, J. F. 1993: 492-496


    One of the key challenges within medical information sciences is the development of useful models for biological structure and its variability. Many biomedical problems involve the elucidation of structure (for example, from experimental data or from imaging studies), and structural models can often drive the process of inferring precise structure from data. Ideally, model-driven data interpretation combines knowledge about the generic features of a class of biological structures (as contained within a model) with data that provide specific information (often noisy) about a particular instance of the class. In this paper we briefly discuss model-driven determination of biological structure as an example of a structural constraint satisfaction problem. We describe a probabilistic implementation of structural constraint satisfaction, and show that our formulation of a particular organ modeling technology (Radial Contour Models) exhibits promising performance. Our results demonstrate the utility of probabilistic models for the solution of structural constraint satisfaction problems.

    View details for PubMedID 8130522

  • Probabilistic structure calculations: a three-dimensional tRNA structure from sequence correlation data. Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology Altman, R. B. 1993; 1: 12-20


    Algorithms based on probability theory can address issues of uncertainty directly through their representational framework and their theory for data combination. In this paper, we discuss the advantages of probabilistic formulations for molecular-structure calculations, describe one implementation of such a formulation, and show its performance on a data set derived from analysis of the statistical correlations within a set of aligned transfer RNA sequences. By assigning reasonable physical interpretations to certain statistical correlations, we are able to calculate three-dimensional structures for tRNA from a random starting structure. The constraints that we use are associated with different variances, and so their effects are not uniform, and must be reconciled by a probabilistic algorithm to yield the most likely structure. As might be predicted, the uncertainty in the position for each base is a function of both the number and strength of the constraints, and is reflected in the variances in atomic position calculated by the algorithm. For example, the hinge region in the tRNA is shown to be the most uncertain. In addition, the algorithm retains information about positional covariation that is useful for understanding the relationships between different parts of the structure. These experiments also demonstrate that we can define a single-sphere representation for each base that is useful for nucleic acid structural calculations in the same way that alpha-carbon representations are useful for protein structural calculations.

    View details for PubMedID 7584327



    We have systematically examined how the quality of NMR protein structures depends on (1) the number of NOE distance constraints, (2) their assumed precision, (3) the method of structure calculation and (4) the size of the protein. The test sets of distance constraints have been derived from the crystal structures of crambin (5 kDa) and staphylococcal nuclease (17 kDa). Three methods of structure calculation have been compared: Distance Geometry (DGEOM), Restrained Molecular Dynamics (XPLOR) and the Double Iterated Kalman Filter (DIKF). All three methods can reproduce the general features of the starting structure under all conditions tested. In many instances the apparent precision of the calculated structure (as measured by the RMS dispersion from the average) is greater than its accuracy (as measured by the RMS deviation of the average structure from the starting crystal structure). The global RMS deviations from the reference structures decrease exponentially as the number of constraints is increased, and after using about 30% of all potential constraints, the errors asymptotically approach a limiting value. Increasing the assumed precision of the constraints has the same qualitative effect as increasing the number of constraints. For comparable numbers of constraints/residue, the precision of the calculated structure is less for the larger than for the smaller protein, regardless of the method of calculation. The accuracy of the average structure calculated by Restrained Molecular Dynamics is greater than that of structures obtained by purely geometric methods (DGEOM and DIKF).

    View details for Web of Science ID A1992JF96900006

    View details for PubMedID 1511237



    We have determined the solution structures and examined the dynamics of the Escherichia coli trp repressor (a 25-kDa dimer), with and without the co-repressor L-tryptophan, from NMR data. This is the largest protein structure thus far determined by NMR. To obtain a set of data sufficient for a structure determination it was essential to resort to isotopic spectral editing. Line broadening observed in this molecular mass range precludes for the most part the measurement of coupling constants and stereospecific assignments, with the inevitable result that the attainable resolution of the final structure will be somewhat lower than the resolution reported for smaller proteins and peptides. Nevertheless the general topology of the protein can be deduced from the subsets of NOEs defining the secondary and tertiary structure, providing a basis for further refinement using the full set of NOEs and energy minimization. We report here (a) an intermediate resolution structure that can be deduced from NMR data, covalent, angular and van-der-Waals constraints only, without resort to detailed energy calculations, and (b) the limits of uncertainty within which this structure is valid. An examination of these structures combined with backbone amide exchange data shows that even at this resolution three important conclusions can be drawn: (a) the protein structure changes upon binding tryptophan; (b) the putative DNA binding region is much more flexible than the core of the molecule, with backbone amide proton exchange rates 1000 times faster than in the core; (c) the binding of tryptophan stabilizes the repressor molecule, which is reflected in both the appearance of additional NOEs, and in the slowing of backbone proton exchange rates by factors of 3-10. Sequence-specific 1H-NMR assignments and the secondary structure of the holopressor (L-tryptophan-bound form) have been reported previously [C. H. Arrowsmith, R. Pachter, R. B. Altman, S. B. Iyer & O. Jardetzky (1990) Biochemistry 29, 6332-6341]. Those for the trp aporepressor (L-tryptophan-free form), made using the same methods and conditions as described in the cited paper, are reported here. The secondary structure of the aporepressor was calculated from sequential and medium-range NOEs and is the same as reported for the holorepressor except that helix E is shorter. The tertiary solution structures for both forms of the repressor were calculated from long-range NOE data.(ABSTRACT TRUNCATED AT 400 WORDS)

    View details for Web of Science ID A1991GP84100006

    View details for PubMedID 1935980

  • Determination of Large Protein Structures from NMR Data: Definition of the Solution Structure of the TRP Repressor. Computational Aspects of the Study of Biological Macromolecules by NMR Spectroscopy Altman, R., Arrowsmith, C., Pachter, R., Jardetzky, O. edited by Hoch, J., Poulsen, F., Redfield, C. New York: Plenum Publishing Corp.. 1991: 363-374


    Sequence-specific 1H NMR assignments are reported for the active L-tryptophan-bound form of Escherichia coli trp repressor. The repressor is a symmetric dimer of 107 residues per monomer; thus at 25 kDa, this is the largest protein for which such detailed sequence-specific assignments have been made. At this molecular mass the broad line widths of the NMR resonances preclude the use of assignment methods based on 1H-1H scalar coupling. Our assignment strategy centers on two-dimensional nuclear Overhauser spectroscopy (NOESY) of a series of selectively deuterated repressor analogues. A new methodology was developed for analysis of the spectra on the basis of the effects of selective deuteration on cross-peak intensities in the NOESY spectra. A total of 90% of the backbone amide protons have been assigned, and 70% of the alpha and side-chain proton resonances are assigned. The local secondary structure was calculated from sequential and medium-range backbone NOEs with the double-iterated Kalman filter method [Altman, R. B., & Jardetzky, O. (1989) Methods Enzymol. 177, 218-246]. The secondary structure agrees with that of the crystal structure [Schevitz, R., Otwinowski, Z., Joachimiak, A., Lawson, C. L., & Sigler, P. B. (1985) Nature 317, 782], except that the solution state is somewhat more disordered in the DNA binding region and in the N-terminal region of the first alpha-helix. Since the repressor is a symmetric dimer, long-range intersubunit NOEs were distinguished from intrasubunit interactions by formation of heterodimers between two appropriate selectively deuterated proteins and comparison of the resulting NOESY spectrum with that of each selectively deuterated homodimer. Thus, from spectra of three heterodimers, long-range NOEs between eight pairs of residues were identified as intersubunit NOEs, and two additional long-range intrasubunits NOEs were assigned.

    View details for Web of Science ID A1990DN23200002

    View details for PubMedID 2207078

  • PROTEAN - Part II: Molecular Structure Determination from Uncertain Data. Quantitative Computer Program Exchange Bulletin Altman, R., Pachter, R., Carrara, E., Jardetzky, O. 1990; 4 (10): 596
  • PROTEAN - Part I: Generating Ensembles of Stylized Molecular Fragments using Uncertain Constraints. Quantative Computer Program Exchange Bulletin Carrara, E., Brinkley, J., Cornelius, C., Altman, R., Brugge, J., Pachter, R. 1990; 4 (10): 596
  • NMR AND PROTEIN-STRUCTURE BIOFIZIKA Jardetzky, O., Altman, R., Madrid, M. 1989; 34 (5): 763-771

    View details for Web of Science ID A1989CW82100011

    View details for PubMedID 2691845

  • NMR and Protein Structure. Biofizika Jardetzky, O., Altman, R., Madrid, M. 1989; 5 (34): 763-771
  • The Determination of Structural Uncertainty from NMR and Other Data: The Lac Repressor Headpiece. Protein Structure and Engineering. Altman, R., Pachter, R., Jardetzky, O. edited by Jardetzky, O. New York: Plenum Publishing Corp.. 1989: 1
  • The Heuristic Refinement Method for the Determination of the Solution Structure of Proteins from NMR Data. Nuclear Magnetic Resonance, Part B: Structure and Mechanisms (Methods in Enzymology) Altman, R., Jardetzky, O. edited by Oppenheimer, N., James, T. New York: Academic Press.. 1989: 218-247
  • Artificial Intelligence Techniques and NMR Spectroscopy: Application to the Structure of Proteins in Solution. Nuclear Magnetic Resonance: The Principles and Applications of NMR Spectroscopy and Imaging to Biomedical Research Duncan, B., Brinkley, J., Altman, R., Buchanan, B., Jardetzky, O. edited by Pettegrew, J. New York: Springer-Verlag.. 1989: 99-123


    A method is described for determining the family of protein structures compatible with solution data obtained primarily from nuclear magnetic resonance (NMR) spectroscopy. Starting with all possible conformations, the method systematically excludes conformations until the remaining structures are only those compatible with the data. The apparent computational intractability of this approach is reduced by assembling the protein in pieces, by considering the protein at several levels of abstraction, by utilizing constraint satisfaction methods to consider only a few atoms at a time, and by utilizing artificial intelligence methods of heuristic control to decide which actions will exclude the most conformations. Example results are presented for simulated NMR data from the known crystal structure of cytochrome b562 (103 residues). For 10 sample backbones an average root-mean-square deviation from the crystal of 4.1 A was found for all alpha-carbon atoms and 2.8 A for helix alpha-carbons alone. The 10 backbones define the family of all structures compatible with the data and provide nearly correct starting structures for adjustment by any of the current structure determination methods.

    View details for Web of Science ID A1988R230100006

    View details for PubMedID 3235473

  • The Heuristic Refinement Method for the Derivation of Protein Solution Structures: Validation on Cytochrome-b562. Journal of Chemical Info. & Computer Sciences Brinkley, J., Altman, R., Duncan, B., Buchanan, B., Jardetzky, O. 1988; 4 (28): 194-210
  • Positive Strand RNA Viruses Harrison, S., C., Sorger, P., K., Stockley, P., G., Hogle, J., Altman, R., Strong, R., K. 1987


    Non-crystallographic approaches to the determination of protein structure must solve the problem of insufficient and low information content experimental data. Most successful methods augment experimentation with theoretical constraints (for example, potential energy functions or optimization error metrics). We believe it is important to separate the contributions of experimentation and theory in the construction of protein structure. The PROTEAN system defines protein topology on the basis of experimental data alone. Its performance on three data sets, derived from the lac-repressor headpiece of E. coli, sperm whale myoglobin, and domain 1 of bacteriophage T4 lysozyme, indicates that there may be families of related conformations that are consistent with the experimental data. These conformations provide insight into the strengths and weaknesses in the data sets. They also provide a set of structures with which to begin theoretical refinements. We outline here a strategy which maintains a clear distinction between refinements based on theory and those based on experiment, and thus allows a careful analysis of the properties of such refinement methods.

    View details for Web of Science ID A1986F079500001

    View details for PubMedID 3553167

  • PROTEAN: A New Method of Deriving Solution Structures of Proteins. Bulletin of Magnetic Resonance Duncan, B., Buchanan, B., Hayes-Roth, B., Lichtarge, O., Altman, R., Brinkley, J. 1986; 8: 111-119

    View details for Web of Science ID A1982PJ75900022

    View details for PubMedID 6756403