Mark Musen receives NIH Big Data to Knowledge (BD2K) Center of Excellence grant
The Center for Expanded Data Annotation and Retrieval <http://metadatacenter.org> (CEDAR) is a newly funded Center of Excellence supported by the NIH Big Data to Knowledge (BD2K) initiative. CEDAR is based organizationally at Stanford University, with key collaborators at the University of Oxford, Yale University, and Northrup Grumman Corporation. CEDAR has the explicit mission to work with the biomedical community to use, extend, and develop community-driven standards for representation of biomedical metadata; to create a Web-based, machine-processable repository for these community-based standards; and to develop technology to promote the use of these standards in the authoring and management of biomedical metadata. CEDAR will work with the other new NIH BD2K centers of excellence to facilitate the use of data-related standards for the computer-interpretable description of biomedical metadata.
The scientific enterprise is at risk because it is often impossible for investigators to reproduce one another's experiments. Often it is impossible even to find the experimental data that an investigator collected online and, if that is possible, to have clear insight into what the experimenter actually did. Imagine a library with no central catalog or, at best, a catalog with lots of missing entries that made it hard or impossible to know where a resource might be located, what language it is written in, what the resource is even about. The current situation in biomedicine leaves the scientific community with myriad experimental data sets in different online locations, with no uniform mechanisms to find what data are needed, to understand what experiments were done to collect the data, and how the individual data items might be interpreted. Just as metadata are needed to index books in a library, metadata are needed to index archived experimental data sets. A huge problem in Big Data is that scientists routinely find that their colleagues create poor or inadequate metadata to describe their experiments, if they create any real metadata at all.
The Center for Expanded Data Annotation and Retrieval <http://metadatacenter.org> (CEDAR) is developing information technologies to make the authoring of complete and comprehensive metadata much more manageable, and to facilitate the use of metadata in the analysis of Big Data sets. The Center is working with several major community-based groups to develop and test its ideas. These groups include (1) the BioSharing initiative <http://www.biosharing.org>, led by Oxford University, who draw on a wide range of international contributors to register and make discoverable metadata standards for diverse types of biomedical experiments (see also their response to this RFI), (2) ImmPort <https://immport.niaid.nih.gov/immportWeb/home/home.do?loginType=full>, NIAID's data warehouse of immunology-related datasets, led by investigators at Stanford at Northrup Grumman Corporation, (3) the Human Immunology Project Consortium <http://www.immuneprofiling.org/hipc/page/show> Standards Working Group centered at Yale University, which designs new metadata templates and channels experimental datasets to the ImmPort repository, and (4) the Stanford Digital Repository <http://library.stanford.edu/research/stanford-digital-repository>, operated by Stanford University Libraries, which helps all Stanford investigators to archive and disseminate data of all kinds—including data from many projects funded by the NSF. Each of these groups is clamoring for better metadata and for better metadata-authoring tools, as CEDAR plans to develop.