December Report – The Value Proposition for a Common Fund Data Ecosystem

A data ecosystem is a collection of data silos or commons joined together by a set of standards and services that facilitate findability, accessibility, reuse, and interoperability of datasets between silos/commons. A data ecosystem is focused on enabling multi-way connectivity between datasets, in a horizontal fashion, rather than deeper “vertical” analysis within each dataset. The goal of an ecosystem is to enable use cases between data silos, not within.

A key feature of an ecosystem is that improvements to one member of the ecosystem provides benefits to other members of the ecosystem, which implies a kind of interdependence. This is something that we can measure. It also implies that there is some kind of governance that provides for standards evolution and formal adoption of standards between members of an ecosystem; many of these standards are all concerned with FAIR.

Levels of maturity for an ecosystem progress through findability and accessibility of data, data reuse, and data interoperability. Findability requires a search or catalog mechanism, e.g. an inventory mechanism. Accessibility requires convergence on auth/auth so that datasets can be accessed across the ecosystem. Reuse requires workflows and compute. And interoperability requires mechanisms for syntactic and semantic exchange. Nothing about an ecosystem requires that all datasets must interoperate, and indeed most won’t be able to. However an ecosystem should contain enough information on individual datasets so that the compatibility of those datasets can be determined, ideally in an automated way.

Common Fund Programs are creating and curating transformative, cross-cutting datasets, and are staffed by domain experts. Operating independently allows them to quickly adopt new standards, create new tools and protocols, and develop new workflows to meet the needs of their users. Adding interdependence and interoperability requirements to CF programs risks slowing down innovation and burdening already over-taxed data resource centers with additional work. However, interdependence does not require uniformity, and we see a number of ways that participation in an ecosystem would benefit all Common Fund Data Coordinating Centers, with minimal costs:

Improving Cross-common Fund Findability And Accessibility Of Datasets

There is substantial opportunity to increase the findability of Common Fund datasets by standardizing on a minimal asset specification, implementing practical FAIR metrics across the Common Fund, and engaging with NIH-wide access and authentication protocol development efforts such as RAS.

Managing Data Storage And Compute Costs

At the technical level, there are an increasing number of open source platforms that support both NIH- and non-NIH payments for analysis of cloud datasets without incurring egress charges, e.g. Broad’s Terra (in use by GTEx and ANViL) and Seven Bridges Genomics’ Cavatica (in use by Kids First). Evaluation and/or adoption of these platforms represents a specific technical opportunity to solve problems across current Common Fund projects.

Opportunities for Cross-Pollination Between Common Fund DCCs

There are many opportunities to connect between scientific questions and datasets as well as share infrastructure and coordination solutions. Many of the datasets may be complementary, if the metadata and analysis pipelines can be harmonized. Most of the DCCs use similar open source data analysis software systems (R and Python/Jupyter) and are using GitHub to distribute workflows. Almost every DCC is using Google or Amazon cloud hosting for data and compute.

Regular connections between DCCs could result in the sharing of solutions and enhancement of expertise in all of these areas. These connections could be facilitated by a centralized team who can field specific requests for information, as well by informal cross-pollination events to discuss shared technical and scientific goals. Increased interactions between all groups would lead to increased specific DCC-to-DCC projects involving data integration, or harmonization, as well as creating a network for sharing resources that can be used by other DCCs.

Opportunities for User Support and Training

There is a significant opportunity to increase usage of CF resources by supporting training across the CF as well as within individual DCCs. Training options could include material development for workshops, executing in-person and online workshops, offering and recording webinars, providing MOOCs, and doing hackathons for advanced users. Data reuse fellows and summer programs for undergraduates could further increase the usage and accessibility of CF data. Paired assessment programs could be developed to evaluate, iterate, and improve upon training offerings.

Separately, if resources become easier to find across the CF, a centralized help resource and tier 1 helpdesk could provide a first line of engagement to biomedical scientists seeking to find and integrate CF data assets.

Preparation For The Future

Several projects are underway to establish cloud-based data platforms across NIH (e.g., NHGRI-Anvil, NHLBI Bio Data Catalyst, and NCI-TCGA), and the Common Fund recognizes the need to prepare for a future of federation and interoperation. Adoption of standards for federation across all of NIH have yet to emerge but we can be certain that integration with data hosted by other ICs is of critical importance. Sequence based technologies for variant detection, whole genome/exome analysis, single cells and human cell atlas, as well as epigenomic analysis will increasingly be used by future Common Fund programs, as well as programs at other ICs. If properly linked to CFDE assets, the data available from these many projects, will create a synergistic data network for the research community. To realize this network, it will be important to collaboratively develop national and international interoperability efforts, and to encourage the use of these new standards throughout the Common Fund.