July Report – Our Assessment – Common Fund Data Ecosystem

Table of Contents:
General description of the DCCs; commonalities and differences
Deep dives
Opportunities and Challenges for Individual DCCs, summarized
Opportunities and Challenges across the DCCs, summarized

General description of the DCCs; commonalities and differences

The goal of this section is to compare and contrast the content, status and maturity of Common Fund DCCs. Information here was collected either by passive review of the websites and resources of the DCCs, or through personal contact with DCC staff (see Appendix A, Methodology).

First impressions. Collectively, the Common Fund DCCs:

Range from just getting started to fully implemented;
Possess a broad range of datatypes, many in common across all DCCs;
Have many datatypes that are complementary with other coordination centers;
- (e.g. variants and expression data)
Are strongly motivated by FAIRness, but vary in their implementations;
Each utilize a different data model to host their assets, which is vital to their operations;
Largely do not use common metadata vocabularies;
Mostly host data (four out of six sites), and also have data at dbGaP;
Cannot perform queries across the data hosted at dbGaP;
Use a variety of sophisticated and capable analytical tools specialized to their project;
Provide training materials to their user community.

Other general information about all of the DCCs are the following:

Common Fund DCCs have vast data assets. The Common Fund DCCs possess data derived from hundreds of studies and samples collected from thousands of human subjects. A summary of the datatypes and studies hosted by each DCC is shown in Table 1: an incredible diversity of datatypes has been generated at the genomic, expression, proteomic, metagenomic, and imaging levels. At each DCC website, users can search through these data using a wide variety of facets, such as assay, project, tissue type, disease, patient variables. Information for individual genes is available for expression, epigenomics, variants, and chromatin organization.

	4D Nucleome	GTEx	HMP / iHMP	Kids First	LINCS	Metabolomics
Studies and Bulk File Content	730 experiment sets, 2107 experiments, 6823 files, 28.99 TB, 166 external datasets	53 Tissues, 960 donors 30,000 samples, data set size expected increase to a total of 600BG by next release	21 studies, 48 primary body sites, >32,000 samples, >118,000 files, 9.75TB (iHMP), 7.22TB (HMP)	11 studies, each with 250-2,000+ subjects, 927.1TB	398 datasets, 100TB genomic data, >1PB imaging data	920 studies, 6.4TB (compressed zip files), 233 studies with restricted access
Dataset types	DNA FISH, RNA FISH, SPT, 2-stage Repli-seq, in situ Hi-C, single cell Hi-C, RNA-seq, ChIP-seq, DamID-seq, ATAC-seq, NAD-seq, ChIA-PET, DNA SPRITE, PLAC-seq, MARGI, RNA-DNA SPRTIE, Micro-C, TSA-Seq	De-identified annotations, RNA-seq, single-tissue cis-eQTLs, multi-tissue eQTLs, single-cell data	Reference microbial genomes, whole metagenomic sequence, 16S metagenomic sequence	BAM, CRAM, fastq, VCF, clinical measurements	Binding, imaging, transcriptomics, proteomics, epigenomics	Raw/unprocessed NMR data, MS data, Processed data (general)

Table 1: Summary of bulk assets hosted by each DCC who host data publicly, and their relevant studies.

Common Fund DCCs exist on a continuum. Figure 1 shows approximate start and end dates of Common Fund funding for each of the DCCs; it is recognized that funding for some programs is continued by another NIH IC. One implication of the range in start dates is that DCCs vary widely in their data assets, depending on their stage of maturity. For example, HuBMap was launched in November 2018 and is not expected to be in production phase until 2022. At the other extreme, the HMP/iHMP DCC has completed 10 years of operation and is generating no new data - its funding has been discontinued. The DCCs also vary in terms of their readiness to be operationalized on cloud-based systems. This is reflected in Table 2, which provides a short summary of data hosted at each site, their number of users, and whether each DCC is using a cloud-based system.

Figure 1: Approximate start and end dates of Common Fund funding for the Common Fund DCCs. Some dates may be inaccurate due to differences between funding approval and program start.

	4D Nucleome	GTEx	HMP / iHMP	Kids First	LINCS	Metabolomics	Hubmap	MoTrPAC	SPARC
API available	Yes	Yes	Yes	Yes	Yes	Yes	N/A	N/A	N/A
Data model documented	Yes	Yes	Yes	Yes	Yes	Yes	N/A	N/A	N/A
Training materials online	Yes	Yes	Yes	Yes	Yes	Yes	N/A	N/A	N/A
Website user visits / month		~15,000	~20,000	~2,000	~7,000	~1,500	N/A	N/A	N/A
Total volume of data	28TB	600GB	10TB(iHMP) 7TB(HMP)	927TB	100TB genomic, >1PB imaging	6.4TB	N/A	N/A	N/A
Linked to cloud workspace	Yes	Yes	No	Yes	Yes	No	N/A	N/A	N/A
Cloud or local storage	Cloud	Cloud	Local	Cloud	Local	Cloud	N/A	N/A	N/A
Protected data hosted at dbGaP	No	Yes	Yes	Yes	Yes	No	N/A	N/A	N/A

Table 2: Approximate dataset size, numbers of users, controlled access usage, and additional resources hosted by Common Fund DCCs

Commonalities and complementarity of DCC assets. A comparison of data types across all DCCs is presented in Table 3; these data indicate that the same types of data are hosted between sites, and that data found between sites could be useful in combination. Whole genome sequence, exome sequence, and transcriptional data were among the datatypes most frequently hosted by the DCCs. Several sites host data associated with human genes, which means that if properly linked, users could obtain expression, epigenetic, and variant information associated with specific gene regions. Metadata categories are also frequently similar across multiple sites. For example several sites host data associated with a body site, suggesting that queries such as "retrieve all datasets associated with skin samples" would return multiple datatypes from multiple DCCs. At least four DCCs host clinical information which suggests that CFDE users could obtain sets of different datatypes associated with disease and patient variables (e.g. body mass index, blood pressure) from across the Common Fund DCCs.

Table 3: Summary of DCC datatypes and subject matter. (A) Datatypes found across all sites. Assets that are currently available are represented by ‘X’. Assets that are planned are indicated by ‘P’. (B) Level of resolution, in terms of anatomy, cellular or molecular level, that are represented at each site.

FAIRness of the DCC assets is reasonable, but has room for improvement. We evaluated each Common Fund DCC that publicly hosts data for FAIRness. We employed a set of metrics developed by Wilkinson et. al, Scientific Data, 2016 and the results are shown in Table 4. The FAIR Principles described by Wilkinson place specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. Our analysis of the DCCs demonstrates that even though they do not follow a common set of FAIRness practices, each individual group has been very effective at improving the FAIRness of their data. All groups did very well with findability (hosting data in searchable resources), supply free and open electronic protocols for accessibility, and support reusability by describing their data with accurate and relevant attributes. Interoperation measures scored reasonably well, but there was more room for improvement in this category than in the other metrics.

Additional overview information for all of the CF DCCs can be found in Appendix B.

Table 4: FAIRness assessment performed by manual, subjective review of each DCC. The CFDE tech team is currently developing objective and fully automated measures to be applied to each site going forward.

Deep dives

To date, the CFDE engagement team has interviewed four DCCs, which are at varying points in their funding lifecycle. Two of these engagements, Kids First and GTEx, were two-day, in-person interviews. Our other two interviews, with HMP and LINCS, took place via teleconference. (Additional information about our engagement methodology is in Appendix A). One important result of these engagements was to re-enforce how personalized visits play an essential role in building an effective working relationship with the DCC staff. Our visits were a very effective mechanism in establishing trust, understanding the goals of each DCC, creating incentives for the DCCs to participate with CFDE, and discovering important resources developed by the DCCs that could be utilized by other groups.

DCC site visits are a crucial part of creating an effective working relationship with the DCC staff. During our visits, we operated in a mode of listening, avoiding any discussion of specific CFDE implementation details; staff were then much more receptive to sharing how they imagined this project would unfold. There were also more emboldened to be honest about where the CFDE might fail. Because we are researchers who are independent from the NIH, staff were more open to communicate about challenges with NIH policies. Each visit helped us recalibrate our understanding of the goals and technical expertise of each DCC. Synthesizing this information after each visit pushed us to align our concept of the CFDE with the true needs of the DCCs. Overall, the site visits increased trust between the DCCs members and our group, which will be vital to successful operationalization of the CFDE.

While we were able to get baseline data from our interviews regardless of format, we found our in person visits to be much more productive. Our face to face conversations, and the level of trust they established, allowed us to discover important blockers and challenges faced by the DCCs that we could not have learned from any other means. Further, visiting over two days allowed us plenty of time to cover topics in-depth, while still allowing for breaks, downtime and review. DCC staff did much of the talking on day one of our meetings, and would be quite fatigued by early afternoon. We took this as a natural breakpoint, and moved our engagement team off-site to discuss what we had learned, and what CFDE might do to support the challenges the DCC had presented to us. On the second day, we were able to offer well-thought out suggestions for CFDE involvement, and to get feedback in real time. These interactions were vital to creating our vision for CFDE, and many of the recommendations in this document were shaped in those meetings. We believe it is essential to continue meeting on-site with DCC staff.

One clear lesson from our site visits is that new DCCs will face a wide variety of challenges. Each center we talked to had deep knowledge of challenges that are likely shared by other DCCs, and had solved many with innovative solutions. Unfortunately DCCs frequently operate in isolation from each other due to the initial burden of getting started, and continued pressure of serving teams of data generators. Staff at these sites are experts in the niche domain of running a DCC, and between them there is a vast wealth of institutional knowledge. The CFDE could leverage that knowledge, and increase cross-DCC interactions, by continuing outreach and engagement to the DCCs, serving as a centralized resource of information that is generated over time, and building a DCC community.

GTEx. Over 15,000 users come to their site each month and GTEx works hard to ensure that those researchers can easily accomplish their goals. GTEx is heavily focused on user experience. GTEx has done a tremendous amount of work to ensure that all of their data is uniformly processed, and has undergone rigorous quality assessment before being displayed in the portal. The GTEx portal allows a user to do a wide array of analyses, such as compare expression levels and plot PCAs, in a rich, interactive point and click interface. The richness and complexity of the GTEx dataset combined with their massive number of users results in an incredible support burden. Although some inquiries are about GTEx data specifically, most of their time is spent answering questions such as ‘How do I compare my RNA-Seq with GTEx data?’, that are really questions of basic bioinformatics skills rather than about GTEx in particular. A planned update to their portal later this year will allow even more complex analysis of eQTLs, and GTEx is committed to increasing user accessibility, even if it increases their support burden. They told us that enabling user queries across datasets, for example between GTEx and Kids First, and improving user access to data hosted at dbGaP would be among their highest priorities when participating in CFDE. GTEx is also working to enable their users to perform analysis on Terra, the Broad’s data analysis platform. The complete GTEx report can be found in Appendix C.

Kids First. The overarching goal of the Kids First DCC is to accelerate the pace of translational research, in order to impact the lives of children right now. Kids First takes a very pragmatic approach to every aspect of their DCC, and is interested in essentially any program, standard or collaboration that will increase the pace of research. Although they do not currently have a user training program, Kids First is eager to create one. The users for Kids First range from clinicians with no computational knowledge to bioinformaticians with no medical knowledge, and so require an equally broad training program. Getting such a training program up and running will allow more researchers to use the Kids First data, and to use it to ask more sophisticated questions, faster; which puts it near the top of Kids Firsts priority list. Kids First also expressed interest in making all of their data FAIR. They are very interested in a cross-DCC search capability and making their data interoperable with outside datasets. Again, broadening access and reducing barriers to finding and using their data is a top priority. Assuming CFDE can operationalize FAIR principles, Kids First is excited to improve their FAIRness and participate in assessments. Finally, Kids First would like to participate in a DCC community. Running a DCC is a specialized skill shared by only a handful of people. However, because the demands of daily operations far outweigh other priorities, it is rare when one DCC will be incentivized to work with another. The Kids First team pointed out that while our goal of de-siloing the data had merit, that people at DCCs need to be de-siloed as well. The complete Kids First report can be found in Appendix D.

HMP. Funding for the HMP expired earlier this year, so while the portal and data are still accessible, there is no active work on the DCC. As such, much of our discussion was centered on how fundamental concepts in data sharing apply to retired data centers. For instance, in the first phase of HMP, they devoted a great deal of resources to ensuring that their pipelines were well documented and could be replicated by users, and they worked hard to build a consistent and complete metadata model. While the term FAIR and its attendant definitions didn’t exist until well after phase 1 of HMP was complete, the HMP was using many of the same principles to build their data center. However, the data at HMP is due to be moved to archival storage in the next few months as they no longer have funding to host it, and although the pipelines are well documented, many of their processing pipelines are based on software that should be updated. The HMP group has exerted significant resources towards metadata curation and documenting how their data was processed for the research community but it is unclear what will happen to that information once their data is moved to an archive. The HMP was also interested in the idea of an avenue for cross-DCC collaborations and discussions. In fact, the HMP told us that they have been offering informal advice and mentoring to younger data centers, though not ones on CF’s priority list. Given their lack of funding, the HMP indicated that moving their data to a stable, professionally managed file system (such as Google or Amazon cloud) was a top priority. They also noted that updating their pipelines to use more modern technology will be key to providing quality datasets to users going forward. The complete HMP report can be found in Appendix E.

LINCS. The primary concern of the LINCS DCC is that their funding ends on June 30th, 2020, and in conjunction with their Program Officers, they are working hard to mitigate the effects of this impending hard stop. Common Fund DCCs are limited to ten years of funding, and receive their awards in equal sums across all years. LINCS told us that this made their first year difficult, as they couldn’t onboard staff or ramp up projects quickly enough to use the funding, and were not allowed to carry forward an unobligated balance. Currently, LINCS supports over 50 portals and analysis apps, tens of terabytes of data, and a robust training program, all of which are popular with the community. No one wants all of those resources to just disappear, but the path forward is uncertain. LINCS’ top priority is trying to get approved for a no-cost extension, or to use their old unobligated balance to extend their funding past June 2020. They have also begun work on the Signature Commons project, a philosophical continuation of the LINCS portal that emerged from their technology, and that could accept new funding under the new name; however, none of these solutions were guaranteed at the time of our meeting. All of this highlights that different stages of the DCC lifecycle have varying challenges, and that DCCs could benefit from targeted support in their early and late years. The complete LINCS report can be found in Appendix F.

Opportunities and Challenges for Individual DCCs, summarized

The challenges described in this subsection, as well as the opportunities we have identified, are those that will require targeted funding to individual DCCs in addition to broader CFDE efforts. These challenges may be shared across DCCs, however the specific details of the challenge at each DCC are different enough that they require individualized implementations. This is as opposed to challenges that can be addressed by a generalized Common Fund-wide solution, and which are addressed in the next subsection. In some instances, such as user support, there are challenges at both the individual DCC and Common fund level.

Data maintenance and access. All of the DCCs reported having various funding issues regarding hosting, analyzing or providing access to their data. Although they have used different, non-cross-compatible implementations, user access to unprotected data is solved at each DCC. However, for example, GTEx reported that their current protected data access solution is non-functional. Targeted support is needed to make GTEx data accessible. Similarly, all of the DCCs reported being un-funded or under-funded for fees associated with cloud computing, however their specific needs vary. This challenge is addressed in Recommendations 1, 3, and 4.

User training needs. All four interviewed DCCs see user-focused training as a straightforward way to onboard more users to their datasets, speed discovery for clinical researchers, and improve the skill level of bioinformaticians. Users at each DCC require specific training tailored to that DCCs datasets. KF and GTEx have few ongoing training efforts and see it as a substantial need, while both LINCS and HMP have invested in training and see it as a path to enhanced use of their data. KF was particularly enthusiastic about the opportunities for enhanced clinical impact through training clinicians to use their portal. This challenge is addressed in Recommendations 1, 3, and 5.

Ageing infrastructure. HMP is no longer funded, and LINCS funding ends on June 30, 2020. Both DCCs use primarily local storage, and both told us that they are unsure about ongoing data hosting. In particular, the local servers that hold their data are being retired in early 2020. Unless they receive immediate funding to migrate that data to the cloud, and ongoing funding for cloud storage, their data will become inaccessible. This challenge is addressed in Recommendations 1, 3, 4, and 5.

Opportunities and Challenges across the DCCs, summarized

These challenges represent issues that are both widely faced by DCCs and that require a single Common Fund wide-solution or buy-in from multiple DCCs.

Asset specification incompatibilities. Each of the DCCs host many files (e.g., genomic sequence, metagenomic, RNA-Seq, physiological and metabolic data) and it is hard to discover these files across DCCs. Moreover, information describing the contents of the files is not available in a standard format. This prevents DCCs from making use of each other’s data, makes the data less discoverable by others, and challenges interoperability. If the DCCs adopt a standardized Common Fund asset specification format, these problems could be solved. This challenge is addressed in Recommendations 1, 3 and 4.

Data access barriers. Both Kids First and GTEx reported that a critical need is to reduce the barriers associated with their users accessing dbGaP data. Many types of relatively simple data retrieval are made impractical by the structure of dbGaP, FISMA compliance is a major challenge, and the administrative burden of obtaining access to multiple dbGaP studies is prohibitive. Another significant concern that was expressed was that there are no methods for DCC users to achieve signon and authorization that could be used by all of the DCCs. This challenge is addressed in Recommendations 3 and 4.

Asset transport. DCCs saw a need for an interoperable mechanism for transporting datasets from DCC portals to analysis resources, such as Terra or Cavatica. This would support combining dataset cohorts across DCCs, and will also facilitate the use of a range of analysis pipelines. An example workflow from Kids First is: a clinician builds a synthetic cohort on the Kids First portal, and the portal would create a list of the relevant data files. The list would then be transmitted to an analysis platform where a biomedical data scientist can analyze the data in response to the clinician’s needs. Asset transport mechanism already exist, but a significant issue is no single standard is used by the DCCs, greatly impeding data sharing, movement of assets to common analysis systems, and sharing lists of data assets between users. This challenge is addressed in Recommendations 3.

Life stage challenges. DCCs noted various challenges as their center were initially ramping up, interacting with data producers during the active period of the project, and sustaining data and tools after the primary funding is over. These challenges include recruiting engineers, making infrastructure technology choices, creating a data submission, validation, and processing pipeline, providing data in standard formats, developing robust user-facing software, and transitioning data to long-term storage. This results in slow ramp-up, suboptimal infrastructure choices, delays in getting robust data pipelines in place, fragile software, and lost opportunities for data reuse. This challenge is addressed in Recommendations 1, 4 and 5.

Expertise silos. Each DCC has developed tools and strategies to support their mission, and these tools could be used by other DCCs. For example, GTEx has built a mature visualization portal for RNA-Seq data, with tremendous data exploration capabilities that could be used in other DCCs; Kids First has a pragmatic strategy for protected data access that could be readily adopted by new DCCs; and LINCS has a number of powerful analysis tools and approaches that could be reused. There is strong interest in enabling DCCs to reuse existing resources and tools and to learn from each other, in addition to interoperating. A great opportunity for CFDE is to promote cross-DCC interactions, to maintain institutional memory across the DCCs and increase re-use of infrastructure developed by each site. This challenge is addressed in Recommendations 2, 3, and 4.

Increased data analysis and problem complexity. Operationalizing FAIR principles will result in less expert users being able to complete complex, cross-DCC analyses. This increase in researcher access and ability will also dramatically increase both the number and complexity of user support requests at all participating DCCs. In particular, both KF and GTEx are already challenged by the support needs for their datasets. While both DCCs are eager to see more reuse of their data, and KF in particular sees tremendous opportunity for training clinicians to use their portal to find data relevant to patient care, they are wary of bringing on more users without a better on-ramp system and a way to increase the skill level of biomedical scientists using their data. Solutions include obtaining access to inexpensive, large-scale computing, as well as improving interoperability between computational platforms so that users can use familiar tools, and developing helpbot technology to direct users to appropriate documentation resources. This challenge is addressed in Recommendations 3, 4, and 5.

Support burden. All four interviewed DCCs saw a strong need to reduce their support burden. More users and more data reuse will overwhelm their already significant activities in user support, both by creating more support requests and by diversifying the requests they do get as user expertise grows. This challenge is addressed in Recommendations 3, 4 and 5.

Obstacles to cooperation. DCCs have limited time to participate in group exercises, and are wary of spending resources to participate in creating a common set of best practices unless they know all of the DCCs will adopt them. Without facilitation by the CFDE, cross-DCC participation is difficult because interactions between the groups are limited, each group has few incentives to adopt the practices of another, for some data and metadata assets no standard exists, and for other assets competing standards may be used by two different DCCs. This challenge is addressed in Recommendations 2, 3, 4, and 5.

Lack of common practices. The acronym "FAIR" is now a popular term frequently appearing in conference presentations, whitepapers, peer-reviewed literature and NIH RFAs. However, there are no specific criteria for DCCs to operationalize FAIRness, or criteria that should be followed in order to determine if data is more FAIR or not. The absence of rigorous FAIRness criteria also greatly reduces each DCCs motivation to participate in the use of common practices. This challenge is addressed in Recommendations 2, 4 and 5.

Insufficient DCC engagement. Technical cooperation between the DCCs and identification of reusable technical solutions is challenging because DCCs are largely unaware of what technical approaches are being used by other DCCs. More generally, as noted by Kids First, running DCCs is a specialized skill shared by only a handful of people. Regular sharing of challenges, approaches, and solutions between DCCs, with the CFDE, and with Common Fund program officers, could have a significant operational impact. This challenge is addressed in Recommendations 2, 3, 4 and 5.