Open Standards, Open Source Software, Open Collaboration

The Common Fund is developing a data ecosystem (Common Fund Data Ecosystem, CFDE) where Common Fund datasets will be stored in the digital cloud environment. This is necessary because the way that the biomedical research interacts with digital data is changing; no longer can large biomedical datasets be stored or analyzed using local computers and servers. The new CFDE will allow larger amounts of data to be stored and will provide the framework for researchers to analyze data simultaneously from different and diverse datasets.

A key component to the CFDE is making sure that data are onboarded to the cloud environment in a consistent manner. Working with the STRIDES Initiative from the NIH Office of Data Science Strategy (ODSS), the CFDE will develop guidelines to ensure data are stored and organized optimally for proper data versioning and upkeep. Working with the STRIDES Initiative also will provide favorable pricing for cloud data storage and use of Common Fund datasets.

The CFDE will leverage deliverables and lessons learned from the New Models of Data Stewardship program to enhance the utility of Common Fund datasets. This includes a continued effort to improve the FAIRness (Findable, Accessible, Interoperable, and Reusable) of Common Fund datasets. The Common Fund recognizes that each dataset is unique, and as such, a data management plan for details like metadata standards, data authentication and authorization, and data harmonization is being developed for each dataset incorporated into the CFDE.

During its initial development, the CFDE will work with four Common Fund datasets from the Gabriella Miller Kids First Pediatric Research (Kids First), Genotype Tissue Expression (GTEx), Library of Integrated Network-based Cellular Signatures (LINCS), and Human Microbiome Project (HMP) programs. Starting this effort with four unique and complex datasets will allow for a deeper understanding of the issues around using and integrating diverse datatypes, identifying specific needs for individual programs, and help with collaboration across programs to enhance data searching. As best practices and new lessons are learned, they will be applied to additional datasets from Common Fund programs to make the migration of data to the cloud environment more efficient and to enhance the usability of the datasets.

The ultimate goal of the CFDE is for the data to be more usable and useful both within a program’s own data and among datasets from multiple programs. By connecting the data and making them more accessible, the CFDE is intended to enable novel scientific research that was not possible before, including hypothesis generation, discovery, and validation.