December Report – What Is The CFDE – CFDE-CC Implementation

Table of Contents:
Federation
Common Fund Data Asset Specification
Common Fund Data Asset Manifest
Common Fund Data Asset API
The CFDE Portal
Fair Metrics
Develop CFDE Best Practices
Policy Development And Technical Implementation
Training Program Development
DCC Cross-Pollination Events
Increased Software Reuse
Platform For Innovation

Federation

Each of the DCCs host many assets (files) (e.g., genomic sequence, metagenomic, RNA-Seq, physiological and metabolic data) and it is hard to discover these assets across DCCs. Moreover, information describing the contents of the files is not available in a standard format. This prevents DCCs from making use of each other’s data, makes the data less discoverable by others, and challenges interoperability. To improve federation, the Common Fund Data Ecosystem (CFDE) will be based on a collection of inventories derived from data that are being hosted on cloud based systems by a number of DCCs. The inventories will describe all the assets at each Program. As part of the ecosystem the Coordinating Center will make these inventories available via a central catalog to enable discovery of the assets across the DCCs. The advantage of this approach is that formation of the ecosystem does not require the data assets themselves be moved to a central repository, only the inventories describing those assets. Cataloging all of the Common Fund assets is a simple and effective means of liberating data from what would be several siloed repositories, and therefore greatly increases Findability, Accessibility, Interoperability and Reusability of all Common Fund data. This form of data federation can also be extended to programs funded by other institutes, and easily linked to other NIH ecosystems: once an inventory system is available, it can be used by anyone.

Common Fund Data Asset Specification

We will simplify discovery of assets hosted at the DCCs by creating a specification of a minimal set of descriptors for each of these files, and electronically encoding this information into a common format. While many implementations for electronically encoded data assets of biomedical resources have been proposed in the literature, no single standard has been adopted by the Common Fund DCC community. However, there is a high likelihood of achieving adoption across several groups if a consensus-building process is carefully managed by the NIH and the CFDE-CC team. The types of files (e.g., genomic sequence, metagenomic, RNA-Seq, physiological and metabolic data) that are referenced with the Common Fund Data Asset Specification will be flexible, and our current specification contains a small number of essential elements such as: a Global Unique IDentifier (GUID), Originating institution (e.g., "Broad Institute"), Assay type (e.g. “whole genome/exome”, “transcriptome”, “epigenome”) File type (e.g. "fastq", "alignment", “vcf”, “counts”), and tissue source and species name for the sample. The data asset specification also enables us to use readily available internet technologies to get additional information for each asset, such as metadata (e.g. patient variables, project name), and to resolve access issues such as files being hosted on the cloud or local servers.

Common Fund Data Asset Manifest

The ecosystem will support the concept of a “Manifest” that describes a collection of files. The manifests enable bundling lists of CFDE data assets into a machine-readable file using a common format. Manifests will also be used to publish the complete inventories of data from each DCC, and will enable uniform collection of asset metadata, and to support indexing of the assets in the CFDE portal. Manifests are similar in function to users collecting a shopping list on a commercial web site, and manifests for subsets of data located at multiple Common Fund DCCs will be used to transport files to analysis resources, such as analysis pipelines hosted at Terra or Cavatica. While a standard for manifests may not be adopted by the broader data resource community, the CFDE project represents an excellent opportunity to drive creation of a standard for all of the Common Fund DCCs, and we expect this approach to be compatible with other federated systems (e.g. GA4GH) as they emerge.

Common Fund Data Asset API

Another important element of the ecosystem will be the standardization and publishing of an application programming interface (API) that can be used by data consumers to retrieve the inventories, the data asset specification, and additional metadata associated with the assets. This will allow for consumers of these inventories to be able to programatically interrogate the federated system for information that may be relevant to a consuming service.

The CFDE Portal

We will provide a portal enabling users as we as administrators to search all of the federated data assets at each Common Fund Program. The CFDE portal will increase a user's ability to find these important resources, as well as mix and match sets of data from each site to use in subsequent analysis. We refer to lists of assets as manifests, which are similar in function to a shopping cart on a commercial web site. Generation of user-specified manifests will enable users move information off the portal for use in the analysis tool of their choice. Other important functions will be developed in the portal over the next year. End users will be able to answer the question: "where are all the RNA-Seq datasets associated with all Common Fund programs?". Similarly, Administrators and Program Officers at Common Fund will be able to go to a single website and view the growth of data from their program over time, to review objective FAIR metrics for these assets, to understand download statistics and geographic distribution, and view the degree of harmonization of these data in comparison to other sites.

Functionality of the portal will be expanded to include additional usage information from each of the programs. For example, we plan to request and display portal metrics such as the number of users that register at each of their sites, and how often their data is downloaded or analyzed. Once this capability is established, an important outcome of the CFDE will be to give Common Fund leadership the new ability to objectively review the overall use of resources at each data center, and to easily perform that review in comparison to all other Common Fund data centers. We anticipate this type of information will assist in making better informed decisions with respect to maintaining and prioritizing which Common Fund datasets over time.

Fair Metrics

Under the CFDE, each data center’s inventory will be evaluated consistently based on FAIRshake, and the Coordinating Center will work with the individual Common Fund programs to adjust FAIR measures to meet the needs of the Common Fund. This approach overcomes a major obstacle for the Programs, because the Programs can not easily work with other groups to align around a common set of metrics.

In a pilot study of data of 7 CF DCCs we employed FAIRshake (https://fairshake.cloud) to evaluate the FAIRness of digital objects including datasets, tools, and repositories. This required mapping metadata elements from each DCC to FAIR metrics according to a customized rubric that was created from the list of case studies developed by CFDE members. The conversion process was essentially a manual process requiring customized scripts and a subjective selection of metadata elements that were mapped to pre-existing ontologies and controlled vocabularies. Data transformation was performed using the Frictionless Tabular Data Package, a simple electronic format used to describe a collection of data (https://frictionlessdata.io/data-packages).

Retrospective evaluation of FAIRness for Common Fund data is of limited value. What will be far more effective is for the CFDE community to create an agreed upon set of metrics across all of the DCCs, and to use these metrics to drive up overall levels of FAIR data. The Coordinating Centers goal is to create consensus based machine-readable standards that will provide quantitative and verifiable FAIRness measures of all Common Fund data. We will achieve this by working with the DCCs to create inventories of DCC data, prioritize which metadata elements should be evaluated, and applying openly accessible rubrics to data hosted at each site. Each participating site will be able to review their inventories prior to public release at the CFDE portal, and portal web pages will that summarize the CFDE FAIR metrics will be published on an on-going basis.

Develop CFDE Best Practices

In order to achieve its goals the CFDE-CC will encourage the DCCs to participate in adopting a series of best practices in order to operationalize FAIRness and promote interoperation between datasets. In the first year, the best practices will require implementation of the Common Fund data asset specification and the Common Fund data asset manifests at each DCC. Other best practices will be developed over time in close collaboration with the DCCs, and disseminated to all groups. Future best practices will include recommendations for single sign on, authorization methods, FISMA compliance and other important implementation elements of CFDE.

Policy Development And Technical Implementation

There are multiple examples of technical solutions that could favorably impact all of the ecosystem members of the CFDE, that also require elements of administrative coordination with these efforts. For example, Common Fund leadership has partnered with the STRIDES initiative, which provides lower-cost cloud services to NIH projects. The CFDE-CC will help ensure the technical implementation of the Common Fund Data Ecosystem resources are tailored to enable each DCC site to be able to take advantage of the STRIDES cost-reduction program. The Researcher Auth Service (RAS) is a service under development by the NIH's Center for Information Technology that will facilitate access to controlled data assets and repositories. Members of the CFDE have also partnered with Globus to enable researchers with an eRA-Commons account to use their credentials to simplify access to controlled data assets. The CFDE-CC will continue to advance this initiative in the coming year, and provide guidance to the CF DCCs in order to make use of the RAS system.

Training Program Development

The CFDE Coordination center will also host a Training Coordination Center (TCC), staffed by experts in bioinformatics curriculum development, teaching, and community building. This center will provide support and resources for the development of DCC specific training programs as well as end-user training on CFDE products and general topics of interest to the Common Fund research community. DCCs will be able to request personalized assistance with all aspects of designing, piloting and refining bioinformatics workshops or webinars. The TCC can also help with logistical support for hosting workshops, as well as providing guidance on how to grow and build a sustainable training program. As part of this effort, the TCC will provide instructor training for the DCCs, and assist with creating useful qualitative and quantitative feedback and assessment tools. In addition to site-specific training, the TCC will also offer training on CFDE products as they become available, and will pilot a ‘general bioinformatics’ workshop curriculum on topics of broad interest within the Common Fund.

DCC Cross-Pollination Events

Individual DCCs have significant expertise in complementary areas, and the CFDE-CC will facilitate conferences to bring DCC personnel together in person to discuss their technological challenges, approaches, and solutions. Annual conferences would serve as an avenue for building cross-DCC collaborations and discussions, and identifying complementary expertise and technologies across the DCCs.

Increased Software Reuse

There are community-based open source software and data projects where engagement could increase the software and data analysis capacity provided by Common Fund projects. Software projects include JupyterHub and BinderHub for data analysis, and the R and Python data science ecosystems; all of these are used by current CF programs. Working with these projects to facilitate broader use of the software within the CF and broader distribution of the data outside the CF is a straightforward opportunity where minimal investment could reap many benefits.

Platform For Innovation

The DCCs are brimming with ideas that could revolutionize the Common Fund. For instance, Metabolomics told us that we underestimate the breadth of diversity of the CF data, and warned that we have become complacent about how exciting it would be to link between datasets to better understand biology. They have a collection of cross-program use cases that span Common Fund and that could open new avenues of research. Similarly, SPARC suggested a more universal way to share across a social network that could be transformative. If implemented, the CFDE could serve as an incentive to increase cooperation, be easier to share with a colleague, easier to get analysis services, add value is added to the data, and lower collaboration barriers underestimate the breadth of diversity of the CF data. The CFDE-CC hopes to provide a forum for discussing, vetting and securing resources for these game changing ideas.