October Report – Approaches to Operationalize the CFDE

Table of Contents:
Approach 1: Data Federation
Approach 2: CFDE Portal Implementation
Approach 3: Training
Approach 4: Addressing Data Incompatibility Concerns
Approach 5: Federating with Data Resources External to the Common Fund
Approach 6: Assessing the Optimal Balance of Cloud Versus On-Premises Computing and Storage
Approach 7: Changing Role of Site Visits

Approach 1: Data Federation

The Common Fund Data Ecosystem (CFDE) will be based on a collection of inventories derived from data that are being hosted on cloud based systems by a number of DCCs. The inventories will describe all the assets at each Program, with this information available via a central catalog to enable discovery of the assets. The advantage of this approach is that formation of the ecosystem does not require the data assets themselves be moved to a central repository, only inventories describing those assets are centralized. Cataloging all of the Common Fund assets is a simple and effective means of liberating data from what would be several siloed repositories, and therefore greatly increases Findability, Accessibility, Interoperability and Reusability of all Common Fund data. This form of data federation can also be extended to programs funded by other institutes, and easily linked to other NIH ecosystems: once an inventory system is available, it can be used by anyone.

There are several very important outcomes for the CFDE federated approach.

The CFDE is future-proofing the Common Fund for interoperability. Interoperability between data silos continues to be of significant interest at the NIH, which recently held an "NIH Workshop on Cloud-Based Platforms Interoperability". The meeting had representation from 4 NIH ICs (Common Fund, NCI, NHGRI, NHLBI). The report from this meeting proposed four major thrust areas to improve interoperability between several groups which included use of the federated asset catalog approach developed by the CFDE tech team.

Common Fund's data portfolio is diverse, and these resources have significant value as individual assets. However, integration of this data with assets at DCCs from other ICs is of critical importance. For example, sequence based technologies for variant detection, whole genome/exome analysis, single cells and human cell atlas, as well as epigenomic analysis will increasingly be used by future Common Fund programs, as well as programs at other ICs. HubMAP also reported a need to integrate their data with sites such as the Human Cell Atlas, LungMAP, and Allen Brain Atlas. The CFDE work on interoperability will ensure Common Fund is able to significantly add value to data assets at each of its programs, and increase their ability to make use of data generated by other Institutes.

The CFDE is defining and measuring FAIR, to guide systematic improvement of Common Fund asset FAIRness. One of the CFDE’s missions is to guide improvement of Common Fund asset FAIRness by providing consistent definitions, metrics, and reports across the Common Fund. Under the CFDE, each data center’s inventory will be evaluated consistently based on FAIRshake, and we will work with the individual Common Fund programs to improve their FAIR measures as well as adjust FAIRshake to meet the needs of the Common Fund (see Appendix C).

By applying the same objective measurements to each Program, we will establish an even playing field across all of the sites. This will incentivise sites to improve individually and learn from each other, and at the same time will lead to a more specific, consistent, and sophisticated set of FAIRness metrics for the CFDE. More importantly, the improvements to each site and across the ecosystem will enhance user abilities to find and make use of Common Fund data.

This approach overcomes a major obstacle for the Programs, because the Programs can not easily work with other groups to align around a common set of metrics. In this scenario the asset inventories generated by each DCC are created by adhering to a common standard coordinated by an external group (i.e., the CFDE tech team).

Approach 2: CFDE Portal Implementation

The CFDE tech team will provide a portal that will enable users to search all of the federated data assets at each Common Fund Program. The CFDE portal will increase a user's ability to find these important resources, as well as mix and match sets of data from each site to use in subsequent analysis. We refer to lists of assets as manifests, which are similar in function to a shopping cart on a commercial web site. Generation of user-specified manifests will enable users move information off the portal for use in the analysis tool of their choice. For example: 1) users will be able to "send" search results to cloud-based workspace environments such as Terra, avoiding data egress charges; 2) manifests can be used in "notebook apps" such as Jupyter Notebooks. Notebook apps are documents that contain a combination of human-readable documents and computer code. These systems are very powerful, in that they let users describe how analyses are performed, and make it easy for users to perform their own analyses; 3) assets can be combined into other downloadable objects that are easily incorporated into popular analysis tools running on the users’ own cloud instance or local compute.

The portal will provide several important and unprecedented functions. For example, for the first time end users to be able to answer the question: "where are all the RNAseq datasets associated with all Common Fund programs?". Similarly, Program Officers at Common Fund will be able to go to a single website and view the growth of data from their program over time, to review objective FAIR metrics for these assets, and view the degree of harmonization of these data in comparison to other sites. In the coming years, we intend to include additional usage information from each of the programs. For example, we plan to request and display portal metrics such as the number of users that register at each of their sites, and how often their data is downloaded or analyzed. Once this capability is established, an important outcome of the CFDE will be to give Common Fund leadership the new ability to objectively review the overall use of resources at each data center, and to easily perform that review in comparison to all other Common Fund data centers. We anticipate this type of information could assist with making better informed decisions with respect to maintaining and prioritizing which Common Fund datasets should be expanded over time.

Approach 3: Training

Our training program for the next year has three key efforts.

First, we will teach people how to use the portal. This will be important both for bringing users to the portal, and for observing what new functionality in the portal is needed. Over time, we also expect the training materials to serve as an entry point to the portal for external researchers seeking interesting and relevant data sets.

Second, we will run training to enable biomedical data scientists to find and analyze large amounts of data close to the data. Our initial training will focus on (1) using the Terra platform to access GTEx data and analyze RNAseq data using the GTEx pipelines, and (2) using the Cavatica platform to access Kids First data and analyze genomic and transcriptomic data. The curriculum and training programs will involve GTEx and Kids First staff. Both the Terra and Cavatica platforms run in the cloud and address the training needs around bringing analyses to the data.

Third, we will run training for clinicians on using the Kids First DRC portal to discover data sets and analyses. This training will focus on exploring the Kids First portal functionality and browsing pre-analyzed data for variants and expression information. We expect this training to both enable more clinicians to make use of the Kids First portal and also to help expand and refine the Kids First portal functionality.

Collectively, these training efforts will allow the CFDE to develop pilot materials that can be expanded further, create assessment instruments to evaluate current efforts and guide future efforts, and expand training functionality at the individual DCCs.

Approach 4: Addressing Data Incompatibility Concerns

The first challenge the Common Fund faces is in making Common Fund data from individual programs findable and accessible in practice. True interoperability within the ecosystem would mean that compatible data sets would not only be findable and accessible and reusable, but that metadata and provenance would follow the data between platforms. In an idealized situation, compatible data sets would be presented contextually, and incompatible data sets would be flagged as incompatible.

Thus we would say that findability and accessibility of data are prerequisites for data reuse, while interoperability is contingent on the data sets being combined. A fully mature data ecosystem will include a large number of interoperable data sets to allow reuse across the ecosystem, while a nascent ecosystem like the CFDE may focus on findability and accessibility. This perspective provides us with progressive stepping stones to guide CFDE development.

We are building a portal to improve findability and access of Common Fund data, as preconditions for improving data reuse and interoperability across the Common Fund. This is because our deep dives suggest that while a specific, familiar data set is easy to search, it is effectively impossible to discover new data across the entire Common Fund. This approach enables talented and motivated users to find relevant data and directs them to the original data sources.

We will combat the inevitable increase in data mis-use and inappropriate combinations of data sets with user training. The majority of our proposed training efforts with GTEx and Kids First focus on enabling sophisticated biomedical data scientists to use flexible cloud-based platforms to analyze data. This approach allows expert biologists and clinicians with hypothesis driven questions to lead the scientific inquiry rather than having to delegate to a more technically savvy, but computationally experienced bioinformatician. We will lower technical barriers such as difficult data access and lack of defined workflows, while allowing experts to bring their own biological expertise and questions to their data analysis. Our other training efforts for 2020 (Kids First clinician-centric training) focus on training highly motivated users in how to effectively search existing data analyses and results to answer specific questions: here the Kids First portal and data analyses have been provided by subject-matter experts at the Kids First DRC.

We will also pilot hackathons and data reuse postdocs. These activities provide talented and motivated users that are already close to the data, or can take the time to develop deep expertise in specific data resources. We expect these activities to both expand and refine our understanding of which data sets are interoperable

In sum, our 2020 activities acknowledge concerns around interoperability and expertise, and will expand responsible data reuse while increasing our set of available use cases.

Approach 5: Federating with Data Resources External to the Common Fund

Several other efforts are underway to establish cloud-based data platforms across NIH (e.g., NHGRI-Anvil, NHLBI-STAGE, and NCI-TCGA), and attendees at the recent NIH Workshop on Cloud-Based Platforms Interoperability agreed that adopting approaches similar to what we are using for the CDFE could greatly assist in expanding capability for all users. Final resolution of standards for federation across all of NIH will take some time, but several steps can be taken to ensure the internal standard adopted by the CFDE is either compatible with (or serves as a prototype for) a system that could be used by many other external resources. The following steps will be taken to ensure federation compatibility with ecosystems external to the CFDE.

First, we will keep in touch with national and international interoperability efforts. At the national level, our primary effort will be to connect with the NIH Interoperability working group, that represents four ICs (NHGRI, NHLBI, NCI, and Common Fund) and in particular includes the NIH ODSS. At the international level, we will connect with GA4GH which is the main standards body for genomics data.

Second, we will work with Common Fund programs to help them adopt and operationalize standards, and help channel feedback from individual programs about drawbacks and incompatibilities of emerging standards. This will help ensure that future standards are not incompatible with Common Fund program needs.

Third, we will work with users to identify challenges and opportunities with emerging standards. For example, if a standard provides users with opportunities to improve data discovery and reuse, we will provide training materials showcasing this. Conversely, if an emerging technical consensus blocks a specific use case, we will bring this use case back to the standards developers.

And fourth, we will work to expand the scope of existing interoperability efforts to automatically include Common Fund assets. For example, while our 2020 plans are focused on building a portal, the Common Fund asset inventory underlying the portal will be directly usable by other efforts and other portals.

Approach 6: Assessing the Optimal Balance of Cloud Versus On-Premises Computing and Storage

One role for the CFDE will be to facilitate a discussion of on-premises, cloud, and hybrid models to weigh the relative strengths and weaknesses. For example, the Human Microbiome Project was all local, LINCS is currently local but considering a transition to the cloud, and Kids First has been entirely cloud-based from the beginning. Although the current trend is towards cloud based resources, there may be many trade-offs in moving everything to the cloud.

The cloud is now more of a business model than a different technical configuration. All the technical advantages once provided by cloud computing (e.g., VPNs, containers, workflow, serverless solutions) are now easy to implement in an on-premises solution. In general, smaller projects are likely to save money by hosting everything in the cloud. As projects grow, however, the cheaper option becomes an on-premises solution with cloud-computing available for bursts where additional compute capacity is needed with no permanent cloud storage. However, there are many other considerations, both for the day to day workings within a data center, and the implications for long term sustainability. The CFDE will work to get perspectives from within the Programs as well as Common Fund leadership to provide guidance for new Programs deciding on the appropriate infrastructure for their project.

Approach 7: Changing Role of Site Visits

Connecting the Common Fund Programs into a thriving Data Ecosystem will require much more than merely solving technical challenges. The larger challenge of this project is social: the technical solutions rely on consensus building among the stakeholders within the CF, and such a consensus can only be reached by fostering a community where cross-program discussion and collaboration are incentivized and recognized by NIH leadership as important goals in and of themselves. The site visits and the establishment of long-term DCC engagement are both critical to consensus building.

As more Common Fund Programs engage with our reports and gain familiarity with the CFDE’s proposals, we expect that there will be both more excitement, and more skepticism, about our work to increase interoperability across the Common Fund. Our interactions with Common Fund Program PIs continue to reinforce the utility of our in person meetings both to learn about the incredible work of each program as well as to build trusted relationships across programs. These relationships, that allow for both enthusiastic and dissenting opinions, are fundamental for creating a community where everyone knows their input is valued and is incentivised to work towards creating a thriving ecosystem. As our focus shifts from initial introductions to sustained engagement, we will work to create more spaces within the CFDE for Common Fund Program PIs to provide input about the direction and work of the CFDE.