C2M2 Guidelines – Common Fund Data Ecosystem

Table of Contents:
Implementation Details
Metadata
Common Fund Data Asset Manifest
Staff Expectations
Staff Expertise

The following is information to consider in the construction of your Detailed Engagement Plan for the other transaction announcement OTA-20-005. Lora Kutkat is the best point of contact for budgetary questions, but we include here some recommendations for you to consider.

Travel. The CFDE-Coordinating Center will organize two CFDE cross-pollination meetings per year. Please reserve appropriate travel funds to participate in both meetings. (Typically $1500-2000/person/2 days). We recommend that a total of 5-8 personnel (including you) attend each meeting. This year's events will be held in Bethesda, Maryland on May 13-14, 2020 and September 9-10, 2020. For Detailed Plans involving collaborations between multiple DCCs, you may wish to include (but justify) additional funds for travel to visit these sites.

Cost-sharing. NIH has asked that Detailed Plans consider cost-sharing wherever possible in your budget. Examples could be direct cost-sharing of staffing or labor, or could be in-kind, such as hosting a training or other event at your site without cost to the award.

Computing and storage. Do NOT request equipment such as computers, laptops, or storage in your budget. Please DO include a written description of your data storage, compute, and data egress costs. These costs will be handled through the STRIDES program and should not be included as costs in your budget. Only cloud-based costs through the STRIDES program will be considered. Additional information regarding STRIDES will be supplied to you in the near future. Simon Twigger will be the point of contact for CFDE-STRIDES interactions.

Sub-awards. Awards for DCCs are being made separately. Do not include sub-awards to other DCCs or to the CFDE-Coordinating Center for collaborative work. It is okay to include sub-awards to contractors and other external parties for your project.

Project management. Be sure to describe your approach to project management in your proposal. Basic elements of your project management approach should include support for a project manager/point of contact, a project plan composed of deliverables with timeline, mechanisms for requirements management and other documentation, description of use cases, and some form of working group structure for interactions with other DCCs, training groups, and the CFDE-Coordinating Center (CFDE-CC).

Deliverables. Your project plans should be based on deliverables, which represent a sequence of activities that are tied to budgetary allocations over time. From our experience with the CFDE-CC award, NIH OT staff are flexible and understand that deliverables can change over the course of the year. Your proposal should define high level objectives with an approximate timeline based on order they need to be completed for the full 3 years. Define deliverables required to achieve each objective, with an appropriate timeline of the deliverables in 6 month increments. Do not worry, there is plenty of flexibility to refine your deliverables as your project progresses. NIH’s decisions about flexibility are based on such factors as what else depends on the deliverable, whether it loses value after it’s late, among other things.

Use-cases. Be sure to include paragraph-length descriptions of the use-cases that will be achieved by your project. Use-cases may include a description of what analyses, interpretations, and user engagements will be enabled by this activity. The CFDE-CC will collate all use-cases and use the information that you provide to coordinate your use cases with those developed by other teams.

Collaborative modes. We are looking to understand how the DCCs want to work collaboratively and nimbly in the CFDE and with the CFDE-CC. Thus, for example, you should describe any possible working groups or other opportunities that you would like to engage in to support collaborations and information sharing with other DCCs and the CFDE-CC. Propose your preferred modality activities for information sharing and collaboration (e.g., conference calls, Slack, GitHub, conference affinity groups). We want to avoid burdensome organizational infrastructure. Please describe your preferred collaborative style; the CFDE-CC will collate these activities into common approaches that can be shared across teams.

Common Fund Data Asset Specification. The CFDE will simplify discovery of data file assets hosted at the DCCs by creating a specification of a minimal set of descriptors for these assets (the Common Fund Data Asset Specification), and electronically encoding asset inventories supplied by different DCCs into a common format. Several approaches have been taken to developing electronically encoded data assets inventories for biomedical resources over time; however, no single standard has been adopted by the Common Fund DCC community. The CFDE-CC has started to address this problem by developing a prototype data asset specification for review and adoption by all members of the CFDE consortium.

The types of files (e.g., genomic sequence, metagenomic, RNA-Seq, physiological and metabolic data) that are referenced with the Common Fund Data Asset Specification will be flexible, and our current specification contains a small number of essential elements such as: a Global Unique IDentifier (GUID), Originating Institution (e.g., "Broad Institute"), Assay Type (e.g. “whole genome/exome”, “transcriptome”, “epigenome”) File Type (e.g., "fastq", "alignment", “vcf”, “counts”), and Tissue Source (e.g. ‘liver’) for the sample.

Implementation Details

To initiate generation of asset inventories from your site, your data managers will need to serialize data into a table in tab-separated value (TSV) format. After review of your assets, your data managers will use the Common Fund Data Asset Specification, which is based on a data model we refer to as the CrossCut Metadata Model (C2M2). The full C2M2 is fairly rich, and includes concepts such as subject and event. The Asset Specification is a subset of the full C2M2 designed to address key use cases around measuring both the quantity of data stored within the DCC and the scientific parameter space covered. An Asset Specification describes individual, discrete entities, namely files or objects in cloud storage. Lists of assets are then aggregated into Manifests. A Manifest could be a summary of all key assets within a DCC, a single dataset, or a data release. DCCs can provide information about all of their data in either a single Manifest or across several Manifests.

The C2M2 is encoded using frictionless.io JSON Schema specification. FrictionlessIO is a third-party organization that develops data schema standards. The toolchain for generating instances of C2M2 inventories will leverage the Big Data Bag (BDBag) exchange format and provides a robust mechanism for exchanging large and complex data collections. A BDBag defines a dataset and its contents by enumerating the dataset’s elements, regardless of their location, and provides checksum and metadata information required to verify and electronically interpret the data. All schema specifications, associated toolchains, schedules for loading data, planning documents, and RFCs are managed through GitHub repositories.

Metadata

The C2M2 data asset specification will also enable retrieval metadata associated with DCC assets (e.g., clinical study, project name). The kinds of metadata supplied by each DCC have yet to be determined, and will remain largely a function of pragmatic, use-case driven priorities to be established by the DCCs and CFDE-CC. Some biological metadata such as species and anatomy, or clinical data such as patient variables, are likely to be selected by the ecosystem consortium for prioritization using a documented Request for Comment (RFC) process. (We are working to document the CFDE RFC process, see draft here.

Common Fund Data Asset Manifest

The ecosystem will support the concept of a “Data Asset Manifest” that describes a collection of files. The manifests enable bundling lists of CFDE data assets into a machine-readable file using a common format. Manifests will also be used to publish the complete inventories of data from each DCC, and will enable uniform collection of asset metadata, and to support indexing of the assets in the CFDE portal. Manifests for subsets of data located at multiple Common Fund DCCs will be used to transport files to analysis resources, such as analysis pipelines hosted at Terra or Cavatica. While a standard for manifests may not be adopted by the broader data resource community, the CFDE project represents an excellent opportunity to drive creation of a standard for all of the Common Fund DCCs, and we expect this approach to be compatible with other federated systems (e.g. GA4GH) as they emerge.

Staff Expectations

A DCC will need to provide staff to assist with the generation and refinement of the Common Fund Data Asset Specification and Common Fund Data Asset Manifests for their datasets. This work may require up to a total of 2 full time Software or Bioinformatics Engineers or Bioinformatics Analysts. The staff resources required at a particular DCC will depend on the complexity of performing Extraction / Translation and Loading (ETL) on the DCC’s data, the degree to which the DCC team participates in data model development, assisting in generation of training materials, whether GUIDs need to be generated, and other factors. Analysts may be required for metadata refinement or review. The staff required is also likely to vary over time. DCC staff will be expected to participate in working groups led by the CFDE-CC tech team and to review or contribute to RFCs (see above) associated with the C2M2 data model and metadata harmonization approaches. All participants will be expected to attend CFDE face-to-face meetings.

In addition to technical personnel, supervision of those staff will be required, particularly in the first year for inventory submissions. Inventory submissions will need to be reviewed to ensure that data appearing at the CFDE portal are acceptable to the DCC program officer as well as to the DCC's research consortium. Modeled data and inventories will require review from an expert on your collections to ensure that concepts such as "project”, “clinical study" and “data type” are correctly applied and to determine which metadata are appropriate for searching at the CFDE portal. These functions may require 5-20% of a PI or other senior personnel with deep knowledge of your datasets.

Staff Expertise

At a minimum, staff working hands-on with the creation of Asset Manifests should:

Understand their DCC data model (metadata schema)
Be familiar working with tabular data (spreadsheets) and creating text-based tables, including tab-separated value (TSV) files
Be able to query the DCC data asset inventories, whether through DCC databases or the existing DCC data storage
Understand checksums (e.g., MD5, SHA-256) and how to generate them

Staff that are more actively participating in development of the CFDE recommendations and producing richer asset metadata will need:

Familiarity with the Extraction / Translation and Loading (ETL) of data elements into JSON-LD
Experience with GitHub
Proficiency in executing and understanding usage of programs written in Python
Familiarity with data model development and/or usage
Understanding of the relationship of DCC GUIDs to other persistent digital identifiers, and of criteria for data citation
Familiarity with controlled vocabularies, such as UBERON