Asset Manifest Specification
Technical Specification for CFDE Data Assets and Instructions for Preparing Data Asset Manifests
This document reviews the technical details for the Common Fund Data Asset Specification, explains how to build formal Data Asset Manifests describing collections of experimental data files, and describes how to prepare and submit these manifests to the CFDE database. To understand this process, we will review the Crosscut Metadata Model (C2M2), which is used to describe experimental resources. C2M2 is divided into numbered "levels" that reflect increasing degrees of complexity of data and metadata descriptions. C2M2 Level 0 (the subject of this document and in which data assets and the data asset manifest are defined) is the minimum information needed to describe a basic inventory of all of a DCC's digital files; higher C2M2 levels will be useful in supporting queries based on more complex experimental metadata via the CFDE portal.
The CFDE Crosscut Metadata Model (C2M2)
The Common Fund Data Ecosystem group is creating a new software system centered around the Crosscut Metadata Model (C2M2), a flexible technical standard for modeling biomedical experimental resources and data at any of the several predefined levels of model complexity. This system is designed to support powerful cross-dataset and cross-institute searches, custom aggregation of experimental data, and scale-powered statistical analysis methods for the biomedical research community, all at an unprecedented scope.
Using the C2M2 system, Common Fund Data Coordinating Centers (DCCs) will be able to share structured information (metadata) about their experimental resources with the research community, widening and deepening access to usable observational data and accelerating discovery.
DCC Metadata Submissions
DCCs will collect and provide metadata to the CFDE describing experimental resources within their purview. Each metadata submission will take the form of a collection of tab-separated value (TSV) files. Precise formatting requirements for these TSV collections will be specified by JSON schema documents implementing the Data Package meta-specification published by the Frictionless Data group. These schemas will be used by the CFDE software infrastructure to automatically validate submission format compliance and metadata integrity during the metadata ingestion process.
The CFDE will offer the DCCs several alternative metadata submission formats, all of which will be automatically interoperable with the C2M2 system. These alternative formats are arranged in levels tiered by increasing complexity and reflecting anticipated differences in the relative richness of metadata producible by different DCCs at any time. The expectation will be that the metadata submitted and managed by a DCC will be able to transition, over time, through increasingly rich C2M2 modeling levels as the life cycle of the DCC/CFDE technical interaction progresses, which will enable increasingly powerful downstream applications.
C2M2 Level 0: A Basic Metadata Manifest of Digital File Assets
This C2M2 Level 0 specification defines a minimal valid C2M2 instance. DCC metadata submissions at this level of model complexity will be the easiest to produce and will support the simplest available functionality implemented by downstream C2M2-driven applications.
Level 0 Submission Process Overview
Metadata submissions by the DCCs to the CFDE that are compliant with C2M2 Level 0 will consist of two TSV files:
file.tsvwill be a manifest of digital file assets that a DCC wants to introduce into the C2M2 metadata ecosystem. The properties of the
fileentity in the C2M2 Level 0 model (see below for the model diagram and a list of property definitions) will serve as column headers for
file.tsvand each TSV row will represent a single
file. DCCs will prepare
file.tsvusing data describing digital files within their management purview.
namespace.tsvwill serve as a formal structural placeholder for a
namespaceidentifier, which will be assigned to each DCC by the CFDE. The CFDE will create and furnish a
namespace.tsvfile for each DCC to include with Level 0 submissions.
C2M2 Level 0 encodes the most basic file metadata; its use by downstream applications will be limited to informing the least specific level of data accounting, querying, and reporting.
|Level 0 Model Diagram|
Level 0 Technical Specification: Properties of the
||String identifier assigned by the CFDE to the DCC managing this
||Unrestricted-format string identifying this
||The size of this
||CFDE-preferred file checksum string--the output of the SHA-256 cryptographic hash function after being run on this
||Permitted file checksum string--the output of the MD5 message-digest algorithm after being run as a cryptographic hash function on this
||A persistent, resolvable URI generated by a DCC (e.g., using the CFDE minid server) and permanently attached to this
||A filename with no prepended PATH information.|
Level 0 Metadata Submission: frictionless.io
datapackage.json Schema Specification
The JSON schema document formally specifying all data constraints on Level 0 TSVs is located
an example Level-0-compliant TSV submission can be found here (just the
file.tsv portion) and here (as a full BDBag archive).