July Report – Executive Summary – Common Fund Data Ecosystem

View/Download the full July report on figshare

Preamble

In early 2019, we were charged with assessing the opportunities and challenges that Common Fund DCCs face with respect to making their data more accessible and usable within and between CF programs. Identifying and solving issues that inhibit data access and reuse will lead to enhanced utility of CF data, both during the CF support period and after a CF program has ended. Moreover, we wish to lay the foundation for interoperability that will enable work across one or more CF programs. Our effort identifies many of the vital elements needed to build a comprehensive Common Fund digital ecosystem.

In order to achieve this goal we initiated a comprehensive assessment of all of the CF DCCs specifically targeting their current FAIRness level, infrastructural elements such as data storage, support for clinical users, and their ability to provide access to human subjects data. This intermediate report summarize the results of our initial in-depth assessment of 4 DCCs: Kids First, GTEx, HMP and LINCS. We make recommendations on how to support the DCCs, establish a Common Fund Data Ecosystem (CFDE) within and across the Common Fund DCCs, and lay the groundwork for integration with other NIH datasets.

This assessment was generated from a combination of systematic review of online materials, in-person site visits to the Genotype Tissue Expression (GTEx) DCC and Kids First, and online interviews with Library of Integrated Network-Based Cellular Signatures (LINCS) and Human Microbiome Project (HMP) DCCs. Comprehensive reports of the site visits and online interviews are available in the appendices. We summarize the results within the body of the report.

The Common Fund DCCs

The Common Fund DCCs store and provide data derived from hundreds of studies and samples collected from thousands of human subjects. An incredible diversity of datatypes has been generated at the genomic, expression, proteomic, metagenomic, and imaging levels, and the DCCs support a tremendous range of scientific discovery efforts.

However, the present ability of a clinical or biomedical researcher to use the resources generated by the Common Fund is poor. The resources are hard to search collectively and not readily usable in combination. Rapidly making use of these resources is particularly challenging because they are hosted across multiple Data Coordination Centers (DCCs). At our visit with the Kids First DCC, we heard the following story:

Members of Kids First were contacted by a doctor who had 24 hours to enroll a patient in a clinical trial. Using the Kids First platform, they compared the patient's genome to variants hosted on their platform, and reviewed these results with information hosted at GTEx. As a result, the clinician was able to identify additional therapeutic avenues.
This was only possible because of unique circumstances. Kids First had already gathered the data from GTEx and reprocessed it in a way that made it possible to compare the results to their own data. This process required months of work, advance planning and significant bioinformatics expertise on the part of Kids First, and would have been an impossible task for the clinician they helped. This highlights the main result of our assessment: the datasets hosted by the DCCs are not inherently interoperable, and placing their assets in the cloud does not intrinsically solve the problems of findability, accessibility, interoperability, and reusability. Further, even if this combined dataset was made available, most researchers don’t have the bioinformatic skills to use it: researcher training is needed to support data use.

It was apparent from our assessment that the DCCs have many needs, some of which are specific to one DCC and others which are shared among DCCs. These include needs for enhanced protected data access (GTEx), long term data storage support (GTEx, KF, LINCS, HMP), training (GTEx and KF), the ability to export collections of data from their portal (GTEx, KF), and support for their data and data portals past the end of the Common Fund Program lifecycle (HMP, GTEx and LINCS). We also found advanced capabilities at DCCs that could be reused by other DCCs (e.g. RNA-Seq and eQTL visualization at GTEx, and a strategy for accessing protected data at KF). We anticipate discovering additional needs and opportunities in our next set of interviews.

We also found that a transformative opportunity exists to operationalize FAIRness across the DCCs, to permit translational impact by improving data discovery, access, and reuse. While the individual Common Fund projects excel at making their data FAIR within each DCC, there is little cross-DCC dataset FAIRness. For example,

There is no systematic way to identify what data is hosted at the CF DCCs, which makes individual datasets hard to Find.
In the absence of a standard way to access protected datasets at dbGaP, much of the underlying phenotype data cannot be Accessed.
There is no standard way to transfer collections of data from multiple DCC portals to a single analysis system like Broad’s Terra or Kid First’s Cavatica, which inhibits Reuse of data.
Interoperability of data across DCCs cannot be evaluated in depth without actually performing analyses between DCCs, which relies on resourcing pilot studies.

Operationalizing FAIR principles across the DCCs would be transformative because it would permit researchers and clinicians to rapidly and routinely leverage multiple Common Fund datasets in their work, just as Kids First has done with GTEx. Operationalizing FAIR will require cross-DCC solutions. For example, a cross-Common Fund portal would enable the Findability of related datasets in different DCCs. This portal would ideally rely on an asset inventory distributed by the DCCs in a common format, so that the portal could automatically update as new data is released. The portal, and its underlying standards, will accelerate discovery and decrease time to impact of Common Fund-supported research. Moreover, long-term sustainability of the datasets would also be enhanced by common descriptors and common access methods.

This tremendous opportunity comes with challenges. In order for cross-DCC data findability, accessibility, reuse and interoperability to work in practice, all of the DCCs must participate. This will involve a significant investment of time and energy that needs to be supported and incentivized by the Common Fund leadership. This transformation could be driven by using FAIRness evaluation as an organizational tool, and making investments in both incremental and transformative change at each DCC. There is also an important role for a continued trans-DCC effort that will engage with the DCCs, drive iteration of standards through FAIRness metrics, and implement missing technical solutions.

To address these challenges, our recommendations for resource allocation are as follows:

Support individual DCC needs with targeted investments, e.g. fund data storage.
Invest in common DCC needs, including a standard method for authentication/access to protected data, training, lifecycle support, and FAIRness metrics.
Continue the Common Fund Data Ecosystem’s work to support transformation at the ecosystem level, including building a common portal, developing an asset specification, and supporting a pilot data reuse postdoc.
Invest in additional work by the Common Fund Data Ecosystem to drive transformational change with a common manifest format, FAIRness evaluation across multiple DCCs, and pilot data reuse projects between DCCs.
Consider RFAs for longer-term investment in supporting the CFDE ecosystem.

We have coalesced a consortium that is prepared to meet the challenges required to implement the CFDE. The flexibility of resource allocation and administrative oversight enabled by the OTA affords an unprecedented degree of effectiveness. This, in combination with an interdisciplinary group composed of NIH program representatives and technology experts -- and an engagement team who actively consults with the Common Fund DCCs -- has allowed us to adapt our original set of deliverables to rapidly meet the needs DCC community. Based on these experiences, and from what we have learned generating this assessment, we are confident we can operationalize the CFDE in the near future.