October Report – Appendix C – FAIR Assessment Plan for 2020

Written by Avi Ma’ayan, Daniel J.B. Clarke, and Sherry Jenkins

What are FAIR assessments, and why are they needed?

It is accepted that we need to do a better job with making Common Fund (CF) datasets more findable, accessible, interoperable, and reusable (FAIR)1. However, exactly how to achieve this goal is challenging. The FAIR principles provide a framework that covers most of the general things that would need to be considered when going through the process of CF digital products FAIRification. Hence, the FAIR principles serve as a guide for making sure that we “don’t forget anything”. One way to achieve awareness of compliance with FAIR is to perform FAIR assessments. FAIR assessments can be performed by mapping compliance of a digital resource with a specific FAIR requirement, for example, whether a dataset can be accessed via a well-documented API, or whether the website that is hosting a dataset has a license that covers the terms in which the dataset can be used2. The process of FAIRification can then be coupled to an evaluation of FAIRness that measures whether the activities, services, and products generated by the CFDE project cover all the FAIR requirements. In addition, FAIR assessments can inform the CF programs’ data coordination centers (DCCs), and the NIH, about existing gaps that need to be filled. These are gaps between the current state of the data, and other digital objects, on DCC portals, and the required upgrades to make these digital resources adhere to community standards that would render them FAIRer.

What was achieved so far by the CFDE in regards to FAIRification and FAIR assessment?

  • We developed and published FAIRshake (https://fairshake.cloud)2, a system to manually and automatically assess the FAIRness of digital objects including datastes, tools, and repositories.
  • FAIRshake provides FAIR assessments of datasets listed on 7 CF DCCs. These FAIR assessments are visualized as an insignia. FAIR analytics are automatically calculated for each CF DCC as well as for the collective of all CF programs bundled together.
  • The publication that describes FAIRshake was accepted for publication in Cell Systems and it is currently In Press. An older version of the article is available on bioRxiv at: https://www.biorxiv.org/content/10.1101/657676v1
  • We developed scripts to convert metadata that describe CF datasets from 7 CF portals into two community accepted schemas: DATS and Frictionless.
  • The conversion of the CF datasets into Frictionless enables the upload of these datasets into DERIVA3. The database engine that is behind the CFDE portal.
  • We assessed the FAIRness of datasets from 7 CF programs with a customized rubric that was created from the list of case studies developed by CFDE members.
  • We developed scripts to automatically assess the FAIRness of the 7 CF programs. This required mapping metadata elements from each DCC to FAIR metrics that belong to the customized rubric. During this process we manually linked some metadata elements to existing agreed upon ontologies and dictionaries.

What do we plan to achieve in year 2020?

  • Improve the FAIRness of CF digital resources by manually adding links to ontologies and dictionaries.
  • Streamline and harden data ingestion pipelines by documenting versions, creating a portal that enables non-experts to execute these scripts, and associate each step with a FAIR assessment.
  • Assess the FAIRness of tools and workflows by establishing a tools and workflows registry.
  • Display FAIR assessment insignias on the CFDE portal.
  • Harmonize datasets at the data level by developing and cataloging data processing pipelines. Convert datasets into dataframes that can be loaded into Python and R for further analyses including application of machine learning.
  • Develop data visualization components that are independent and can be used as plug-ins to enhance the user experience at the CFDE portal. Develop protocols to enable the community to develop and contribute such data visualization components to the CFDE.

Below we provide more details about each of these planned activities:

Plans to continue to improve the FAIRness of CF digital resources

The experience that we had converting DCCs datasets into the DATS and Frictionless formats, and assessing these datasets for their FAIRness with specific use cases in mind, pointed out that much manual work remains toward improving the FAIRness of these datasets. Hence, much effort is needed to map fields to ontologies and dictionaries, and harmonizing metadata elements across programs. While this activity can be done by the DCCs after some training, we have the expertise to do much of it ourselves. In fact, during the conversion process, we have already done some initial manual mappings and harmonization. In 2020, we plan to continue to guide as well as perform additional FAIRification of CF resources.

Plans to streamline and harden data ingestion pipelines

In 2019, we developed prototype scripts to convert DCC datasets into common catalog consumable schemas, i.e. DATS and Frictionless. This process is critical for the harmonization and presentation of the CF data and metadata on the CFDE portal as well as for performing FAIR assessments. In 2020, we plan to automate and harden these data processing scripts. Specifically, we plan to automate the pipelines that convert metadata from DCCs->DATS->Frictionless->DERIVA (including provenance of all steps). These scripts will be tightly linked to dynamic FAIR assessments throughout process. We expect that the FAIRness will increase/decrease after each processing step. We will also document these data processing scripts, as well as make their execution available from a dashboard with button clicks. This will allow us to track FAIRness over time of the DCC resources in addition to having the capability to propagate changes to the CFDE portal.

Plans to add FAIR assessment of tools and workflows

The initial assessment of the FAIRness of the CF resources in 2019 fully focused on data and datasets. The FAIR assessment plan for 2020 includes the indexing of tools and workflows produced by the CF DCCs as well as by other related community efforts. This effort will produce a catalog of CF tools and workflows with FAIR assessments. Since the metadata of tools and workflow will be organized in a similar way as the metadata for the CF datasets, such catalog of tools and workflow will be made available for searching and browsing on the CFDE portal.

Plans to integrate FAIR assessment visualization into the CFDE query portal

Once the DERIVA portal is working and available, we will enable the display of the FAIR insignia near each digital object that will be hosted on the site. We have already created all the needed hooks to enable such visualizations and initially assessed the FAIRness of digital resources from 7 CF projects. Hence, this effort will require coordination and handshake between FAIRshake and the CFDE instance of DERIVA.

Plans to harmonize datasets at the data level and prepare these datasets for machine learning and other complex data analysis and integration tasks

In 2019, effort went into processing and harmonizing the metadata that describes the CF datasets but the data contained in those datasets was untouched. In 2020, we plan to begin the systematic processing and cataloging of the actual data by identifying data levels, processing scripts, and developing harmonization strategies for abstracting the data to a level where it can be integrated. For example, a GWAS study that called variants, can be integrated with RNA-seq data by converting each data type into gene sets. Metabolomics profiling can be compared to RNA-seq data by applying a model that converts metabolomics data to RNA-seq data and RNA-seq data to metabolomics. In most cases, tools and workflows to perform such abstractions already exist, but these need to be better organized and tested. High level processed data provided with rich metadata will be delivered in dataframe formats that can be consumed by data analytics platforms such as R Studio or Jupyter Notebook (Python). These are data science commonly used data analysis platforms that would benefit from having easy access to CF datasets. Hence, we will develop R and Python libraries specifically for easy access to CF data and metadata. These libraries will access the data and metadata via a well-documented API. Having the data ready for integrative analyses, we will prepare examples that show how machine learning algorithms can be applied to such data. For example, we plan to test whether we can predict metadata elements directly from the data. This particular example will also inform and accelerate the manual FAIRification efforts of CF datasets.

Plans to develop data visualization components

Once the CFDE metadata and datasets are loaded into a catalog such as DERIVA, the data and metadata will be made available for query and download via a user interface. Since these data and metadata will be in a format that is well structured, it could be systematically visualized. Such visualization can be for the purpose of providing summary statistics of what is in the catalog, as well as for exploring the high dimensional structure within the data. Such data visualization capabilities are expected to significantly enrich the user experience. Our group has extensive experience in developing such UI components. In addition, existing UI components developed by the DCCs, and by others, could be integrated into the catalog user facing portal interface. Hence, we plan to develop new, and integrate existing, data visualization components for the CFDE portal. We will focus on data visualization components that are concerned with comparing gene sets and signatures, and querying such sets and signatures against public databases. For example, we developed Clustergrammer4, application: https://amp.pharm.mssm.edu/clustergrammer/ source code: https://github.com/MaayanLab/clustergrammer to visualize heatmaps of any data matrix. We are currently developing ScatterBoard to visualize any dataset as scatter plots that place data objects in 2D or 3D based on their similarity. These components are developed with the React framework so they can be embedded in any website, Jupyter Notebooks, or other web-based system that host biomedical datasets.

References

  1. Wilkinson et. el. Sci Data. 2016 Mar 15;3:160018.
  2. Clarke et al. Cell Systems. 2019 Accepted
  3. Bugacov et al. Proc IEEE Int Conf Escience. 2017 Oct;2017:79-88.
  4. Fernandez et al. Sci Data. 2017 Oct 10;4:170151.