July Report – Appendix B – Overview of all DCCs

Table of Contents:
Kids First
GTEx
HMP/iHMP
LINCS
4D Nucleome
Metabolomics
MoTrPAC
SPARC
HuBMAP

Kids First

General description Integrates genomic and clinical data from different disease types to accelerate discovery using cloudbased analyses systems.

Cohorts / supported data sources Approximately 20 data contributors generating whole genome sequence data and, in some cases, whole exome and transcriptome data for tumors or affected tissue to discover genetic variants that contribute to pediatric conditions. Utilizing a variety of approaches may be taken for sequencing of tumors or affected tissue, such as 30X whole genome sequencing, combined with 100X whole exome and 100X RNA sequencing.

Degree of operationalization 18 months. However signficant volume of data and infrastructure has been generated during this time thanks in part to additional funding from independent sources.

Bulk file content 11 studies, each with 250-2000+ subjects, 927.1TB

Dataset types BAM, CRAM, fastq, VCF, clinical measurements

Dataset search facets Study, diagnosis, clinical phenotypes, patient events, pedigree, gender, race, tissue type, data type

Protected data hosted at dbGaP? Yes

Cloud expertise / usage Fully deployed on cloud-based system

GTEx

General description The Genotype-Tissue Expression (GTEx) project aims to provide to the scientific community a resource with which to study human gene expression and regulation and its relationship to genetic variation.

Cohorts / supported data sources V7 includes: ~1000 post mortem donors, 53 tissue sites, 11 distinct brain regions, and 2 cell lines.

Degree of operationalization Established in 2010. Globally used resource with roughly 50% of their users from the US or the UK, with the rest being spread throughout the world. Portal receives roughly fifteen thousand users per month.

Bulk file content 53 Tissues, 960 donors 30,000 samples, data set size expected increase to a total of 600GB by next release

Dataset types De-identified annotations, RNA-Seq Data, Single-Tissue cis-eQTL Data, Multi-Tissue eQTL Data, Reference Files, and Single-Cell Data.

Dataset search facets Tissue type, individual gene expression level, expression comparison across tissues, comparisons based on eQTLS, histological images

Protected data hosted at dbGaP? Yes

Cloud expertise / usage Fully deployed on cloud-based system

HMP/iHMP

General description Characterization of the human microbiota to further our understanding of how the microbiome impacts human health and disease.

Cohorts / supported data sources Healthy Cohort of 300 healthy individuals, each sampled at 5 major body sites (oral, airways, skin, gut, vagina) at up to three timepoints. Each body site consisted of a number of body subsites, for a total of 15 to 18 samples per individual per timepoint. Disease Cohorts: 18 projects with one or more cohorts aimed at studying specific health conditions.

Degree of operationalization Completing 10 years of operation, funding discontinued.

Bulk file content 21 studies, 48 primary body sites, >32,000 samples, >118,000 files, 9.75TB (iHMP), 7.22TB (HMP)

Dataset types Reference microbial genomes, Whole metagenomic sequence, 16S metagenomic sequence

Dataset search facets Unique subject ID, Body site, Sex, Studies, Visit number

Protected data hosted at dbGaP? Yes

Cloud expertise / usage Local servers for data served from the website. Have performed demonstrations/workshops with cloud-based data.

LINCS

General description The Library of Integrated Network-Based Cellular Signatures (LINCS) Program aims to create a network-based understanding of biology by cataloging changes in gene expression and other cellular processes that occur when cells are exposed to a variety of perturbing agents.

Cohorts / supported data sources Studying effects of perturbations on gene expression in ~1,175 cell lines, 60 primary cells, 32 IPSCs, 31 differentiated cells, and 2 embryonic stem cells.

Degree of operationalization The initial phase ended in FY 2013. Phase II has been funded since 2014. Funding will end on June 30, 2020.

Bulk file content 398 datasets, 100TB genomic data, >1PB imaging data

Dataset types Binding, imaging, transcriptomics, proteomics, epigenomics

Dataset search facets Method, Subject area, Center, Assay, Process, Project

Protected data hosted at dbGaP? Most data is not protected. Some information may be in dbGaP.

Cloud expertise / usage Not as yet, effort towards using cloud based systems underway.

4D Nucleome

General description The 4D Nucleome (4DB) project aims to understand principles underlying nuclear organization in space (three dimensions) and time (the fourth dimension), the role nuclear organization plays in gene expression and cellular function, and how changes in nuclear organization affect normal development as well as various diseases.

Cohorts / supported data sources 24 different labs are helping to generate quantitative models of nuclear organization in human and mouse genomes in diverse cell types and conditions, including in single cells. Five cell lines have been designated as Tier 1, which will be a primary focus of 4DN research and integrated analysis. 12 other lines that are expected to be used by multiple labs and have approved SOPs for maintaining them have been designated Tier 2.

Degree of operationalization Stage 1 has been operating since 2015. Stage 2 is expected to launch in 2020.

Bulk file content 730 experiment sets, 2107 experiments, 6823 files, 28.99 TB, 166 external datasets

Dataset types DNA FISH, RNA FISH, SPT, 2-stage Repli-seq, in situ Hi-C, single cell Hi-C, RNA-seq, ChIPseq, DamID-seq, ATACseq, NAD-seq, ChIAPET, DNA SPRITE, PLAC-seq, MARGI, RNA-DNA SPRITE, Micro-C, TSA-Seq

Dataset search facets Organism, Experiment type, Biosource type, Lab, Center, Treatments, Assay details, Commendations

Protected data hosted at dbGaP? No

Cloud expertise / usage Fully deployed on cloud-based system

Metabolomics

General description The Common Fund’s Metabolomics program serves as long standing national public repository for metabolomic data. Users are enabled to analyze and interpret metabolomics data, including the ability to determine metabolite identities. This project also has developed best practices and guidelines to promote accuracy, reproducibility, and re-analysis of metabolomics data.

Cohorts / supported data sources The Data Repository and Coordinating Center (DRCC) accepts metabolomics data for small and large studies on cells, tissues and organisms via the Metabolomics Workbench. It accommodates a variety of metabolite analyses, including, but not limited to MS and NMR. Processed data (measurements) may be in the form of quantitated metabolite concentrations, MS peak height/area values, LC retention times, NMR binned areas, etc. Raw data in the form of MS and NMR binary files and associated parameter files may also be uploaded. Data from both targeted and untargeted studies are accepted.

Degree of operationalization Launched in 2012. Approved for a second stage of support from FY18-21.

Bulk file content 920 publicly available studies, 6.4TB (compressed zip files), 233 studies with restricted access

Dataset types Raw/unprocessed NMR data, MS data, Processed data (general)

Dataset search facets Project (study groupings), Study, Sample source (site on body), Species, Disease (from study) Human Pathways (metabolic process) Metabolite class, PUBCHEM_CID, Name (Common, Systematic), Formula, Exact mass, Tolerance (daltons), LIPID MAPS ID, KEGG ID, InChiKey, Gene Name Gene Symbol, Synonyms, Alternate names, HMDB Pathway, Reactome Pathway

Protected data hosted at dbGaP? No

Cloud expertise / usage Cloud-based data hosted at the San Diego Supercomputer Center.

MoTrPAC

General description The Molecular Transducers of Physical Activity in Humans program aims to extensively catalogue the biological molecules affected by physical activity in people, identify some of the key molecules that underlie the systemic effects of physical activity, and characterize the function of these key molecules.

Cohorts / supported data sources Animal studies involved acute and exercise training of 6-month and 18-month rats. Eighteen tissues were collected at 7 post-acute exercise time points for acute cohort. For rats that underwent a training regiment ranging from one to eight weeks, tissues were harvested 48 hours after last exercise bout. Human studies will involve a multi-center clinical cohort of approximately 2,700 healthy human volunteers (males/females), 10-80 years of age, all fitness levels. They will collect blood, muscle, and fat samples from active and sedentary volunteers who will perform resistance or aerobic exercises.

Degree of operationalization Awards issued in Sept. 2016. Six-year program, through 2022. Pre-clinical animal studies finished in May 2019. Recruiting for human clinical participants is expected to begin in Fall 2019.

Bulk file content No data available yet.

Dataset types Genomic, transcriptomic, epigenomic, metabolomics and proteomics data are expected.

SPARC

General description The Stimulating Peripheral Activity to Relieve Conditions (SPARC) program aims to transform our understanding of nerve-organ interactions and ultimately advance the neuromodulation field toward precise treatment of diseases and conditions for which conventional therapies fall short.

Cohorts / supported data sources SPARC projects are expected to use a multi-expertise approach to comprehensively understand the neuroanatomy and neurobiology of both afferent and efferent innervation of a major organ, as well as characterize nervous system regulation of function of that organ. The targets of the projects encompass 10 major organs and 7 other tissues/organs, such as adipose tissue, bone marrow, esophagus,etc. Data generated by these projects will be provided to the SPARC data coordination center, which in turn will generate detailed, functional and anatomical neural circuit maps for major organs and their functionally-associated structures.

Degree of operationalization Program was launched in FY 2015. Funding for the Data Coordination, Mapping, and Modeling Center started in September 2017 and will end in August 2022.

Bulk file content No data available yet. Data portal expected to be launched in Summer 2019.

Dataset types Imaging and omics data, such as proteomics and transcriptomics are expected. Functional and anatomical neural circuit maps will be also be generated.

Cloud expertise / usage Blackfynn, Inc. has been awarded funding for five-years to develop a cloudbased scientific data management platform tailored to the needs of SPARC investigators.

HuBMAP

General description The Human BioMolecular Atlas Project (HuBMAP) aims to catalyze development of an open, global framework for comprehensively mapping the human body at the level of individual cells.

Cohorts / supported data sources The HuBMAP Tissue Mapping Centers (TMCs) will collect and analyze a broad range of largely normal tissues, representing both sexes, different ethnicities and a variety of ages across the adult lifespan. These tissues include: 1) discrete, complex organs (kidney, ureter, bladder, lung, breast, colon); 2) distributed organ systems (vasculature); and 3) systems comprised of dynamic or motile cell types with distinct microenvironments (lymphatic organs: spleen, thymus, and lymph nodes).

Degree of operationalization Launched in November 2018 and entered a scale-up phase in Fiscal Year (FY 2019) that will continue through FY 2021. A production phase will run during FY 2022–FY 2024, and a transition phase will occur in FY 2025.

Bulk file content Data portal is currently being developed. No public data until 2020.

Dataset types Imaging data, single cell omics datasets (scRNAseq, scATACseq, snDropseq, SNAREseq, scTHSseq), and MS-based proteomic, lipidomic, and metabolomic datasets are expected.

Cloud expertise / usage Cloud based data hosted at Pittsburgh Supercomputing Center (PSC) and the University of Pittsburgh.