Principal Investigator:

Contributors: Mark Bender Gerstein, David Galas


DESCRIPTION (provided by applicant): Extra-cellular RNAs (exRNAs) are emitted into the human bloodstream and other body fluids by different types of cells in the human body and may be uptaken by other cells. The exRNAs may also originate from edible plants and from microbes that inhabit the human body. The Extracellular RNA Communication Program (ERCP) will explore this newly discovered mechanism of communication in healthy individuals and in pathological conditions such as cancer. The Data Management Resource and Repository for the exRNA Atlas (DMRR) will integrate the efforts of the FRCP and serve as a community-wide resource for the development of the exRNA Atlas database. DMRR will consist of three components and an Administrative Core. The Data Coordination Component (DCC) will develop data and metadata standards, establish data flow into the exRNA Atlas database; develop tools for download, visualization and analysis of exRNA data; and integrate exRNA Atlas database with other relevant resources. The Scientific Outreach Component (SOC) will develop the exRNA Atlas Web Portal to disseminate and provide for visualization of the exRNA Atlas data; ensure accessibility of ERCP-generated resources; and initiate community engagement in exRNA biology using leading biological Wiki sites. In close coordination with the DCC, the SOC will engage the community through knowledge curation jamborees, scientific workshops and symposia. The Data Integration and Analysis Component (DIAC) will provide large-scale integrative and analytic support; evaluate tools and build pipelines to be hosted by DCC and used to populate the exRNA Atlas; build tools to be deployed and distributed by the DCC for use by other consortium participants and the wider scientific community for exRNA data; and lead consortium-wide advanced integrative analyses. Through these coordinated efforts of its DCC, SOC, and DIAC components, the DMRR will help organize the ERCP consortium and open opportunities for rapid progress in the nascent field of exRNA biology.

The DMRR DCC will also indicate how it would accommodate proteomic, lipidomic, metabolomic, or other non-DNA/RNA based datasets.   Anticipated activities of the DMRR DCC include:

  • establishing common ERCP consortial protocols for the characterization of body fluids from normal and diseased individuals and the generation of ExRNA datasets derived from these fluids
  • working with ERCP members to establish the exact types and formats of data that will be transferred to the DMRR and develop data verification, validation and quality metrics pipelines.
  • working with ERCP members to define standard experimental metadata required to be submitted with each dataset, including common data elements such as clinical phenotypes and ExRNA-specific categories, using well-defined formats and associated controlled vocabularies.
  • making all data and metadata submitted to the DMRR DCC rapidly available to the data producers and ultimately to the public through the DMRR website.
  • establishing a process for accepting or incorporating data sets produced outside the consortium to help maximize the value of all ExRNA datasets generated.
  • developing a separate submission pipeline for ancillary data and information as needed, including information on reagents, standard protocols, and data generated as a result of technology development, platform characterization, studies to examine biological relevance, and ERCP publications.
  • creating an overall data coordination plan that maximizes data transportability and data interoperability, given the rapidly changing landscapes for data access and data integration.
  • working with ERCP PD(s)/PI(s) to establish a public data release policy congruent with that of a community resource project while simultaneously protecting human subjects.  The DMRR will need to be flexible and responsive in following the evolving NIH policies on data sharing.
  • working with ERCP consortium members to establish regular data freezes to quantify productivity and facilitate analysis activities.
  • establishing an export pipeline to permit timely transfer of ERCP data from consortium data freezes or publications to appropriate public repositories and community databases.  These repositories may include the Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI), dbGAP (in the case of control access datasets), ArrayExpress at the European Bioinformatics Institute (EBI), the UCSC Genome Browser at the University of California at Santa Cruz, and Ensembl at the EBI and the Wellcome Trust Sanger Institute.  Regardless of where datasets are deposited, the DMRR DCC will provide access to the ERCP-generated data and metadata for the duration of the ERCP.  Data and metadata must be available in standard formats to facilitate additional analysis by members of ERCP consortium and the broader scientific community.
  • The DMRR DCC must be able to provide links between the data in the public repositories and the data as they reside in the DMRR.   Submission of the data to public repositories and community databases must be no later than submission of a manuscript.
  • working with the DMRR DAIC and DMRR SOC to provide a public website containing links to the data and metadata, including workflows used to generate each analysis, for all figures in the paper.