Table of Contents:
Stanford Molecular Transducers of Physical Activity Consortium (MoTrPAC) Bioinformatics Center (BIC) Site Visit
Harmonization and Metadata
SSO (Single Sign-on)
Stanford Molecular Transducers of Physical Activity Consortium (MoTrPAC) Bioinformatics Center (BIC) Site Visit
Location: Stanford University, Falk Cardiovascular Research Center
Date: Tuesday, October 9, 2019
Attendees: Representatives in attendance from the CFDE were Amanda Charbonneau (UCD), Nathan Gaddis (RTI), Titus Brown (UCD), Owen White (UMB) and Anup Mahurkar (UMB). The representatives from MoTrPAC BIC were Euan Ashley (Co-PI), Matt Wheeler (Co-PI), Ashley Xia (NIH Project Scientist), Steve Hershman (Director of mHealth), Malene Lindholm (Postdoc), Karen Dalton (Software Developer), Jimmy Zhen (Software Developer), Young Kim (Software Developer), Shruti Marwaha (Research Engineer), David Jimenez-Morales (Computational Biologist), David Amar (Biostatistician), Archana Raja (Computational Biologist), and Elizabeth Chen (Graduate Student, Biostatistics).
The CFDE engagement team met with representatives of the Molecular Transducers of Physical Activity Consortium (MoTrPAC) Bioinformatics Center (BIC) on Tuesday, October 8, 2019 and Wednesday October 9, 2019 at the Falk Cardiovascular Research Center at Stanford University to discuss their work in support of the NIH Common Fund’s MoTrPAC program. During the meeting, we used the agenda at the end of this document as an informal guide for structuring the day.
The engagement team began by reviewing their goals for the meeting, which included learning about the structure and goals of the MoTrPAC BIC, including specifics about the data they host, as well as information about training, organization, and the overall set of priorities for their group. In turn, MoTrPAC BIC representatives provided us with a comprehensive overview of their work on the MoTrPAC program to date, including descriptions of the types of data generated by the consortium, the BIC pipelines for data QC, processing, and analysis, and a demonstration of the web portal for data distribution.
MoTrPAC is an NIH Common Fund program tasked with creating a “map” of the molecular changes that occur during and after exercise with the goal of understanding the mechanisms by which physical activity improves health and prevents disease. The $200 million project funds 3 preclinical animal study sites, 7 clinical centers, 7 chemical analysis sites, a consortium coordinating center, and the BIC, our hosts for this site visit. The consortium coordinating center provides overall management of consortium activities, including protocol development, establishment of standards for data collection, intra-consortium communications, and general administrative tasks. The preclinical animal study sites and clinical centers carry out acute exercise testing, exercise training programs, and collection of biospecimens and other physiological measurements in rats and humans, respectively. The chemical analysis sites then process the biospecimens and generate a variety of omics data, which are passed to the BIC for processing, QC, analysis, and distribution to the scientific community.
The human studies arm of MoTrPAC is a mechanistic randomized controlled trial consisting of 2280 adult participants (1980 sedentary, 300 highly active) and 300 pediatric participants. All adult participants undergo baseline acute exercise testing and biospecimen collection. For the acute exercise test, muscle and adipose biopsies and blood samples are taken at various time points surrounding a bout of acute exercise. A variety of physiological measurements are also taken, including basic history and physical exam, graded exercise test with 12-lead ECG, fasted blood screening, anthropometric measurements, bone density scans, and behavioral questionnaires. The 1980 sedentary participants are randomized to one of three intervention groups: endurance (n=840), resistance (n=840), or control (n=300). They undergo a 12-week supervised exercise program (endurance and resistance) or no intervention and then are subjected to a second round of acute exercise testing and biospecimen collection. A similar study design is being applied to the pediatric participants, but with a smaller number of participants and no resistance arm. In total the trials will collect ~53,100 blood samples, ~19,500 muscle biopsies, and ~9900 adipose biopsies. In addition to the core studies described above, a variety of ancillary studies will be carried out.
The rat arm of MoTrPAC consists of two phases, an acute exercise time course study and a training time course study. For the acute exercise time course, tissue samples are collected at seven time points following an acute bout of exercise. Including controls, tissues will be collected from 216 rats (12 male, 12 female per time point; 24 male, 24 female controls). For the training time course, tissue samples are taken after four different lengths of training (120 total rats - 12 male, 12 female per time point; 12 male, 12 female controls). For both phases of the rat studies, 21 tissues are collected from each rat, including blood, heart, lung, kidney, and brain.
A diversity of omics data are being generated from the collected human and rat tissue samples, including genomics (RNA-Seq, ATAC-seq, Methyl-cap, RRBS, and WGS), proteomics (global, phospho, acyl/acetyl, and redox), and metabolomics (targeted and untargeted reversed phase chromatography, positive and negative HILIC, and lipidomics) data. There is great variation among the many MoTrPAC sites in the technologies being used to generate the omics data, including heterogeneity within data types and some proprietary technologies.
The BIC is responsible for the daunting task of harmonizing the large quantity and diversity of data and metadata being generated by the consortium and performing meaningful integrative analyses across these omics data types. Although they are fairly early in the process, they have made significant progress towards these goals.
The MoTrPAC BIC is still in the startup phase of its lifecycle, approximately one year into its projected six year funding period. The full BIC team has only been in place for a few months, and most team members have been in place for less than a year. They have developed data processing, QC, and analysis pipelines for some of the data types they will be receiving. In addition, they have established a public data portal website (https://motrpac-data.org/) and in October had their first external data release, which included raw data and counts from 6-month old rats who had performed an acute bout of endurance exercise. Much of the BIC effort in their first year has focused on establishing relationships and interacting with researchers at the MoTrPAC research sites. They noted that this “social” aspect of the project has taken up more time than anticipated. The interactions with MoTrPAC researchers have largely been aimed at gaining a thorough understanding of the data that is being generated. There has also been a significant amount of back-and-forth communication regarding pipeline and analysis decisions, as well as data submission formats, standards, and timelines. Another area requiring an unexpectedly large amount of effort in the first year has been dealing with the heterogeneity of sites, technologies, and techniques within the consortium. In many cases, there are multiple ways a certain data type is captured across sites, resulting in different formats for the data files the BIC receives. Adapting pipelines to work with the varying inputs has added an extra layer of complexity to an already complex endeavor. Given the early stage of the BIC, their priority has understandably been ensuring that they deliver high quality data to the research community and that the pipelines and procedures they are developing are robust and appropriate for accomplishing the MoTrPAC goals. The immensity of those tasks has so far prevented focus on issues such as internal and external user support, ensuring that all FAIR principles are addressed, inter-DCC collaborations, and long-term sustainability, particularly with respect to data egress costs. However, there is enthusiasm for tackling these other issues if/when resources are available to do so.
The flow of data within the BIC is illustrated in Figure 1. When datasets are received, they are first processed using standard pipelines specific to the data type. The data then undergo quality control checks and are integrated with existing datasets via multisite intra- and inter-omic analyses. The next step is to analyze the data both within the dataset, e.g., differential expression and enrichment analyses, and in the context of other datasets in integrative multi-omics analyses. Finally, both raw and analyzed data are shared with internal and external users.
The data submission process involves MoTrPAC researchers uploading datasets and associated metadata to a Google Cloud bucket. When a dataset is uploaded, BIC analysts receive a notification and the data undergo basic automated QC and completeness checks. BIC analysts then work with the data submitter to resolve any issues, fill in missing information, and get clarification where necessary. The data then enter the process described in Figure 1.
The primary endpoint for accessing the MoTrPAC data is the MoTrPAC Data Hub (https://motrpac-data.org/). The first data release occurred in October and consisted of raw data and counts from the rat arm of the study. Future releases will contain human data as well as analyzed data from both species. The data release buckets are static and contain data, metadata, and documents describing any processing done on the data. Advanced users can load datasets directly into their data analysis environments from the data release buckets.
A variety of tools for searching, analyzing and visualizing the MoTrPAC are under development or planned for the future. The first tool that will be released is a faceted search tool to assist users in identifying datasets of interest. A shopping cart feature for assembling data downloads is also in development. In addition to accessing data directly, there will be tools for customizable web-based data visualizations provided on the site in the upcoming year.
The MoTrPAC BIC uses the Google Cloud Platform (GCP) to host their systems and data. Among the GCP services the BIC utilizes are Buckets for data storage, in addition to Google Compute Engine (GCE) for website VMs and services (including pipelines), Cloud DNS, and Google Container Registry. The choice of GCP was largely driven by the fact that Stanford has an existing Business Associates Agreement (BAA) with Google for PHI data, but not some of the other major Cloud providers (e.g., Amazon Web Services). The BIC has not worked with the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative in setting up their Cloud services and has no immediate plans to do so.
The technology stack the BIC utilizes for their data pipelines, QC, and analysis includes the following components:
- Workflow Definition Language (WDL). Workflow language used to define data-processing pipelines.
- Cromwell Server. Workflow management system that can run WDL scripts and supports GCP and other cloud platforms. Exposes REST endpoints for managing workflows. The plan is to move to a tool called Caper (Cromwell Assisted Pipeline ExecutoR) which is a wrapper Python package for Cromwell.
- Docker. Provides containers configured with software and dependencies needed for running pipelines.
- GitHub. Version control and hosting for pipelines and notebooks.
- Jupyter Notebook (Python/R). Full documentation of methods in an executable notebook format.
- GCP. Hosting of data, infrastructure, services, and Cromwell server.
For the MoTrPAC Data Hub, some of the technologies being used are as follows:
- Terraform. Tool for building, configuring and managing infrastructure required for the web portal.
- Auth0. Single sign on (SSO) Identity as a service (IDaaS) solution which can be configured to allow login from multiple providers.
- Services. Various services are used for functionality of the web site. For example, a service was created by the MoTrPAC developers to generate temporary signed URLs for data downloads.
- APIs. Used to retrieve data from other data resources with public-facing APIs . For instance, the MyGeneInfo API provides gene details that are displayed on the site.
- GitHub. Version control and hosting for code.
- GCP. Web and data hosting.
The BIC plans to use dbGaP to distribute human sequencing data, but have not yet initiated this process. Deposition to a variety of relevant external resources such as Metabolomics Workbench are planned.
The MoTrPAC BIC supports all levels of data analysis for the MoTrPAC consortium, including processing of raw data, first- and second-level QA/QC and normalization, multivariate analysis, differential expression analysis, and integration of multiple omics.
The BIC has devoted a great deal of effort in their first year towards developing standard data processing pipelines that are portable and ensure reproducibility. They noted that their goal is not necessarily to create pipelines that are compatible with those used by other resources, but rather to create pipelines that do what is required for the purposes of MoTrPAC. That being said, they have generally tried to utilize and adapt existing pipelines where possible.
The most mature pipelines at this point are those for RNA-Seq, Reduced representation bisulfite sequencing (RRBS) and ATAC-seq. The pipelines start with raw FASTQ files and process all the way through QC and metrics reports. Originally, the BIC planned to use the ENCODE pipelines for all of these data types, but found that they did not fit the MoTrPAC needs for RNA-Seq and RRBS. For these two data types, they instead adapted Snakemake pipelines created by Mt. Sinai to the WDL/Cromwell framework. At the time of our visit, they had already processed 1280 samples with their RNA-Seq pipeline. For ATAC-seq, the ENCODE pipeline proved to be sufficient for their needs.
The proteomics pipelines pose a greater challenge than the sequencing pipelines due to proteomics being a less mature field in general. The BIC has decided to adapt Windows pipelines created by PNNL, a leader in the field of proteomics and MoTrPAC consortium member, to the WDL/Cromwell framework. These pipelines are under development.
The most challenging pipelines to develop are those for metabolomics. There is currently little consensus within the MoTrPAC consortium about what should be done with these data. Making things more complicated, there are 6 different sites generating the metabolomics data, all with different instruments and software. The BIC is actively working with the Common Fund Metabolomics Program as they develop their approach to the metabolomics data.
The BIC will eventually make all of their data processing and analysis pipelines public through GitHub. However, they are concerned that doing so might increase their support burden as more people use, and have questions about, the workflows. They do not have the personnel/budget to provide that kind of support. They cited ENCODE as an example of a provider of open source pipelines who gets pinged with a lot of questions. Providing this type of support is something the BIC would definitely like to see the CFDE tackle.
For data analysis, the BIC will perform a variety of analyses within and across datasets and make results available to the research community, but also eventually plans to provide tools with which users can perform their own analyses of the MoTrPAC data. The types of activities that the BIC intends to undertake include the following: QC, normalization, merging of datasets, comparison of data from different sites, differential expression analyses, enrichment analyses, interomic factorization analyses, and meta-analyses. Their activities will be aimed at answering the research questions of interest to MoTrPAC. As of now, no decisions have been made regarding what analysis platform and tools to provide users for their own investigations.
While the primary focus of the BIC is to cover the specific analysis needs of the MoTrPAC consortium, they are open to engaging in cross-DCC analyses provided that additional funding is provided to cover these efforts. In addition, they would like to see the CFDE identify cross-DCC collaboration opportunities and take the lead in organizing such efforts.
As discussed earlier, the BIC provides access to MoTrPAC data via the web-based MoTrPAC Data Hub. Their goal is to rapidly release raw and analyzed data to the greater research community. To download data, non-consortium users must create an account and sign a Data Use Agreement, which includes the following conditions:
- Users may not use the data for publications of any sort or publically host or disseminate the data prior to the embargo deadline. For the first data release, the length of the embargo is 15 months from the date of external release.
- Users are permitted to use the data for analyses supporting grant submissions prior to the embargo deadline.
- Users must cite MoTrPAC when using the data in a publication.
- Users must notify MoTrPAC of publications that use the MoTrPAC data.
Currently, the BIC is paying egress costs associated with data downloads out of their budget. Due to the potential high costs associated with downloads of the raw data, the links to download the raw data are “hidden” to discourage unnecessary downloads, i.e., users have to click an extra link to get to them. The BIC recognizes that this approach may not be sustainable long term and that some other mechanism may be required to cover the egress costs.
Given that the BIC is still in the ramp-up stage of their lifecycle, they have not made concrete plans for other mechanisms of data access, e.g., providing a workbench such as Cavatica or Terra where users can bring their own data and perform cross-study analyses with the MoTrPAC data.
At this point in their development, the BIC and the MoTrPAC consortium in general are focused on establishing standards and harmonizing data within the consortium and are not ready/do not have sufficient resources to address incorporating external standards or ontologies or mapping to other resources. Within MoTrPAC, IDs are assigned to all participants and biospecimens, and the provenance of all data is carefully tracked. They have developed standard protocols and forms for data collection across sites, e.g., ~80 forms for measurements in the human participants. They also use standard codes for tissues and other metadata. For data submissions, there are guidelines for what metadata to provide and a format for how the metadata should be supplied, but not metadata forms per se. The BIC is still in the process of surveying the data landscape and determining the specific metadata they will collect for their various data types and technologies. As mentioned earlier, the BIC is also developing standard pipelines for each data type to ensure uniform processing and enable integration across datasets. Overall, within MoTrPAC, the level of standardization is quite high. Their guiding principle in their approach to metadata and standardization is enabling meaningful analyses of their data.
One topic of conversation around harmonization was whether it is a reasonable goal to harmonize data processing pipelines across different DCCs to enable easy integration of data. The BIC put forward a number of arguments against such an approach (or at least major impediments):
- Different DCCs have different needs and data analysis goals, and pipelines must be tailored for the specific goals of each DCC.
- Pipelines are not static, and changes to “universal” pipelines would require reprocessing of all datasets at great cost.
- Pipelines don’t fully solve the problem of integrating across resources. Other obstacles, e.g., batch effects, are not accounted for by standard pipelines.
As an alternative, the BIC proposed having some tool or resource that compares analysis pipelines and provides an assessment of the compatibility of data emerging from different pipelines. They noted that knowing that pipelines are, e.g, 95% compatible is sufficient in many cases and a more attainable and cheaper goal than trying to apply a standard pipeline to every dataset across the Common Fund projects.
The BIC is funded under the U24 Cooperative Agreement mechanism for a period of six years. Given that the MoTrPAC BIC is in a very early stage in its lifecycle, the long-term sustainability of its resources has not been a major focus. That being said, the use of GCP and primarily open source solutions means that the resource is not intrinsically tied to Stanford and could be transferred to and maintained by the NIH or a third party following the end of the performance period. Obviously, given the necessary funding, Stanford could also continue to maintain the resource at the end of the performance period.
In terms of short-term sustainability, the BIC noted that several aspects of the project have taken a greater amount of effort than anticipated, which has limited the amount of their budget they have available for other key areas. Specifically, the extensive heterogeneity of data types, platforms, and software used by the research centers has added significantly to the burden on the BIC. In addition, the “social” aspect of the project has been far greater than expected, i.e., establishing relationships and working with the data generators around pipeline development, metadata, data submissions, and other issues. They also forecast that other, unstarted, activities mandated by their award are also likely to take more resources than their award was written for. Directives such as ‘put your data in dbGaP’ seem simple on paper, but require a lot of hidden work. Aside from the work of gathering the data and preparing it to meet dbGaP requirements, it can require weeks or months of time setting up dbGaP contacts, learning the database, and other ‘soft’ work to get ready to begin preparing the data. As a result, less explicitly mandated areas such as internal and external user support and training have by necessity had lower priority. Devoted attention to complying with FAIR principles has also not been possible. In addition, as discussed earlier, the BIC has concerns about the sustainability of their model for dealing with egress costs. These are areas where the BIC would welcome help from the CFDE.
The BIC indicated that support/training within the MoTrPAC consortium is one of the areas where they could definitely use assistance. As noted above, interactions with the data generators have taken up much more time than they anticipated in the first year. There is a wide range of technical abilities among the data generators, including those who only use Microsoft Excel, and many need assistance with tasks such as format conversion and data submission. The extra time spent providing support to consortium data generators has interfered with the ability of the BIC to develop the technical solutions necessary for such a complex project and has made it extraordinarily strenuous to meet data release deadlines while maintaining high data standards.
Support/training for external users of the MoTrPAC data is another area where the BIC would like to have support from the CFDE, even voicing that they would be happy to have the CFDE fully take over this task. They expressed concern that they do not have the personnel or budget to handle what could potentially be a large volume of support requests from consumers of the data. One consequence of this concern is that the BIC is not planning to immediately make their data processing pipelines public because they know that they are not able to adequately provide external-to-MoTrPAC-consortia support for them at this time.
The BIC supports the idea of establishing a set of best practices guidelines that outline common pitfalls and misinterpretations for different data types and sharing these expectations with users who wish to publish with the MoTrPAC data. However, they were not overly concerned about data consumers misusing the data, i.e., performing analyses or drawing conclusions based on the MoTrPAC data that aren’t valid or statistically sound. The BIC staff indicated that they view this issue as being largely out of their control and not worth devoting much effort to. They are much more interested in creating and providing quality datasets than policing their users science.
Complying with Findable-Accessible-Interoperable-Reproducible (FAIR) principles is another task for which the MoTrPAC BIC would be interested in receiving support from the CFDE. They expressed uncertainty about what rubric(s) are used to judge if a DCC is “FAIR” in practice and were concerned that they will be judged on FAIRness without fully understanding the criteria they will be judged on. In addition, they were worried about the fairness of applying uniform FAIR expectations across DCCs given that individual DCCs may be limited by restrictions specific to their project. As with training and support, the BIC felt that they do not have sufficient resources to address the FAIR principles given the high burden associated with carrying out their core responsibilities. They would be happy to offload FAIR compliance to the CFDE.
The MoTrPAC BIC is interested in the concept of DCC cross-pollination, but does not feel that they are currently in a position to shepherd or drive such efforts. In addition, they do not think that their budget would accommodate such efforts and that supplemental funding would be needed. They also feel that they lack sufficient knowledge of the other DCCs to allow them to easily come up with collaborative cross-cutting projects. The BIC would like to see the CFDE take the lead in this area and identify potential areas of collaboration, as well as provide supplemental funding for such endeavors.
Of note, the BIC PIs, Euan Ashley and Matt Wheeler, are also PIs for one of the clinical sites of the Common Fund Undiagnosed Diseases Network (UDN), which could facilitate collaboration between the BIC and UDN DCCs.
The MoTrPAC BIC uses Auth0 as their SSO solution for accessing the data portal. Initially, they enabled login with Auth0 using Google credentials, but not eRA Commons or other credentials. However, they noted that they may have to switch to using Red Hat SSO (an enterprise version of Keycloak) due to a possible internal mandate from the Stanford Medicine Technology and Digital Solutions group. The BIC would prefer to stay with Auth0. At this point, login with Google credentials via Auth0 has been temporarily disabled until this issue is resolved.
Like many of the other Common Fund DCCs, the MoTrPAC BIC has opted for a cloud-based infrastructure and primarily open-source technology solutions. Consequently, their system is inherently fairly portable and reusable. Some aspects are specific to their cloud provider, GCP, but equivalents of the GCP services they use are provided by most major cloud providers.
A good illustrative example of the reusability of the infrastructure being developed by the BIC is their data processing pipelines. The pipeline scripts are written in WDL and run using the Cromwell workflow management system, both of which are open-source workflow solutions offered by the Broad Institute. Unlike many workflow management systems, Cromwell can be used with many of the most popular cloud providers. Environments (containers) for executing the pipelines are created using Docker, which means that they can be run on any local or cloud system that is running Docker. Given these components, it would be fairly simple for others to set up their own environment for running the data processing pipelines.
The MoTrPAC BIC identified a number of areas that have been challenging in the early stages or that they anticipate being challenging as they move forward. They also indicated several key areas where they think the CFDE could substantially contribute to their success. We engaged in a productive conversation about possible solutions for these challenges. Below is a summary of some of the challenges that were discussed. Many of these were described in more detail in earlier sections.
Startup. The BIC encountered a variety of unexpected challenges associated with the startup of their center that were time-consuming and prevented them from producing as polished a product as they desired for the first data release. These challenges included getting a handle on the heterogeneity of data, technologies and software across the MoTrPAC consortium and providing support to the data generators.
FAIR Compliance. As discussed earlier, the BIC identified concerns about how to ensure that their resource is FAIR compliant. On the most basic level, they are unsure about what FAIR means in practice and what the criteria will be used to assess the FAIRness of their data. They also expressed worry about applying the same FAIR standards across all DCCs without taking into account the restrictions that individual DCCs might be limited by.
Interoperability. The BIC is very supportive of the idea of making their resources interoperable with other databases and resources. One of the main challenges identified around this issue is that there is currently not even a standard mechanism to find out what assets are maintained by each CF program.
Internal User Support & Training. The MoTrPAC data generators have a wide range of technological abilities, and some need help with the most basic of data-related tasks. Supporting these internal users has occupied more of the BIC’s time that anticipated.
External User Support & Training. The BIC is concerned that they lack sufficient personnel, resources and budget to adequately support and train external users of their data. As a result, they are hesitant to add to the support burden by, for example, making their data processing pipelines public.
Testing. There are several challenges the BIC faces with respect to testing. Given the complexity of the data, the support burden within the consortium, and difficulty getting data generators to comply with deadlines, they felt that their deadline for the first data release did not leave sufficient time for comprehensive testing. In addition, there are few researchers in the consortium that are qualified to assess the full catalog of MoTrPAC data, and they often do not have the time or incentive to assist in the testing process.
Harmonization. The BIC has some misgivings about the feasibility of standardizing data processing pipelines across resources. They are making use of existing standard pipelines where possible, but are finding that some modifications are necessary to suit the specific needs of MoTrPAC. Also, they have concerns about the implications of having a standard pipeline, for instance the cost of reprocessing all data when switching to a standard pipeline or when there are changes in the pipeline.
Infrastructure. The BIC indicated that they are hesitant to tie themselves to an analysis platform such as Terra or Cavatica because the world of cloud analysis tools is constantly shifting. As MoTrPAC is young and has few end users asking for analysis tools, the BIC is concerned about the potential risks of choosing a platform now. They do not want to tie themselves to an analysis platform that may not exist in a few years due to loss of support, for example. Or to put a great deal of effort into engaging a platform now only to move to a new system that better suits their end users in a couple of years.
DCC Cross-pollination. The BIC is interested in cross-DCC projects, but they do not know much about the other Common Fund DCCs, and consequently identifying areas of synergy would be difficult and time-consuming. Nor do they have the budget or personnel to tackle this task.
Financial. In general, the BIC feels that there are a number of implied responsibilities and hidden costs associated with their role in MoTrPAC that are not accounted for in the budget because they were not made explicit. They gave several specific examples such as the hidden time and effort required to get data into dbGaP; the harmonization of pipelines with other sites, which could entail the expensive task of reprocessing of all the MoTrPAC data; and the issue of egress costs for data downloads, which the BIC is currently covering, but could become unmanageable depending on the level of interest in the MoTrPAC data.
We had a productive conversation about possible solutions to the challenges facing the BIC, and they were very forthcoming about the areas where they think the CFDE could most benefit their efforts. Below are some of the solutions discussed for the challenges described above.
Startup. There is a vast amount of experience among different institutions in setting up and managing DCCs, but there is not an easy way for new DCCs to access this base of knowledge. Even the process of establishing contact with the appropriate personnel at existing DCCs to ask questions can be daunting. We discussed the possibility of the CFDE establishing a knowledge base of accumulated DCC wisdom. Possible knowledge base content includes the following: expertise and contact information of personnel at existing or former DCCs; standard metadata fields for different data types; standards and ontologies; pipelines for data processing, QC and analysis; FAIR guidelines; and technological solutions for building biological databases.
FAIR Compliance. The BIC expressed interest in offloading the task of ensuring that their data is FAIR compliant. Part of the issue is a lack of understanding of what exactly being FAIR involves in practice. We discussed the idea of the CFDE establishing detailed FAIR guidelines that more clearly define for DCCs how to satisfy the FAIR guidelines. Providing clear criteria will allow the DCCs to better plan and budget for this responsibility.
Interoperability. To address the lack of standard mechanism for discovering the assets present Common Fund DCCs, we discussed creating an easily queryable API that provides asset inventories in a standard format.
Internal/External User Support & Training. The BIC would definitely like to see the CFDE take on a large part of the support and training burden for their internal and external users. Specific ideas included the following:
- Seminar providing an overview of Common Fund data resources
- Tutorials/webinars for a variety of tasks users might want to perform
- Centralized help desk
Testing. The CFDE could provide dedicated staff with the required expertise for testing releases. The BIC was very enthusiastic about this possibility.
Harmonization. The BIC expressed interest in a tool or resource that would provide an assessment of the compatibility of data produced by different processing pipelines. Alternatively/additionally, the BIC would like to see the creation of a set of “blessed” pipelines that would be used across resources. However, they worry about the cost of having to reprocess data using these “blessed” pipelines. The BIC also suggested that the NIH provide incentives to encourage compatibility of data from different resources, e.g., RFAs that promote integration of data from different sources.
DCC Cross-pollination. The BIC indicated that they would like to see the CFDE coordinate pilot projects between DCCs. The CFDE would identify opportunities for collaboration and cross-resource analyses and provide supplemental funding to the DCCs to carry out these projects.
Financial. There was general support for the CFDE providing supplemental funding to cover the costs of creating an integrated Common Fund data network. The budget for the BIC is largely committed to fulfilling their core responsibilities for the MoTrPAC consortium.
The BIC would be interested in participating in cross-DCC analyses and collaborations, given that the CFDE identify areas of synergy and specific ideas for projects and provide supplemental funding to cover the costs. One specific idea that was mentioned was performing a meta-analysis of RNA-Seq data from different resources.
We discussed several different ideas that could be game changers for the CFDE:
Common Fund Data Ecosystem Data Ingestion/Egress System (CoFundIES). Karen Dalton proposed that the CFDE develop a system through which all Common Fund data passes. Her description of the system is as follows:
Centralized Ingress/Egress location... similar to an Identity as a Service provider like Auth0 provider (which offloads the integration of over 40 different social and enterprise identity services on the developers’ behalf), CoFundIES provides a way for Common Fund Data providers to write to a standardized format and that data is distributed to appropriate third party/Common Fund/NIH resources. Each DCC can select the output destination(s) as mandated by their purpose.
Each Common Fund Program no longer has to find and maintain customized connections to each data resource. Data comes to them. This allows tracking and location and findability of data (both it's home DCC origin and where a subset of the information is sent in transit and will live [in perpetuity or simply for now, while the tertiary resource has funding]).
CoFundIES does not need to hold the data, except during the transfer/transformation process. This may be represented in a systems diagram, similar to the Pub/Sub system used in Apache Kafka.
Tool/Technology Library. The CFDE could develop standard implementations of tools and technologies (e.g., using Docker) that new and existing DCCs could use to rapidly add new functionality without reinventing the wheel. Karen Dalton proposed the concept as “Common Tools for Common (Fund) Tasks”
Harmonization Tool. The CFDE could develop a harmonization tool that performs some meaningful comparison of data processed using different pipelines and produces an assessment of their compatibility that includes a measure of statistical uncertainty.
Short introductions from engagement team members and attending DCC members. The overarching goal for the engagement team is to collect value and process data about the DCC. Values data will include things like: mission, vision, goals, stakeholders, and challenges. Process data includes: data-types and formats maintained, tools and resources owned by the DCC that they would like to have broader use, points of contact for follow up on technical resources, etc.
9:30-10am DCC Overview
Short overview of DCC. Can be formal or informal. Suggested topics to cover: What is your vision for your organization? What big problems are you trying to solve? What are your big goals for the next year? Who do you see as your most important users/stakeholders? What project(s) is currently taking up the bulk of your effort/time? What areas of your organization are you putting the most resources into? What is the rough composition of your user base in terms of discipline? Do you have any challenges that are blocking implementation of your current goals?
10am-Noon Goals Assessment
An exercise to get an idea of what types of things are important, what types of things are challenges, what do you dedicate your time/resources towards, and what types of things are not current priorities. Given a list of common goals provided by the engagement team, plus any additional goals the DCC would like to add, DCC members will prioritize goals into both timescale: “Solved/Finished”, “Current-Input wanted”, “Current-Handled”, “Future-planned”, “Future-unplanned”, “NA to our org” and for desirability: “Critical”, “Nice to have”, “Neutral”, “Unnecessary”, and “NA to our org”. The engagement team will work to understand the reasons for prioritization, but will not actively participate in making or guiding decisions.
- Increase end user engagement X% over Y years
- Move data to cloud
- Metadata harmonized within DCC
- Metadata harmonized with _________
- Metadata harmonized across Common Fund
- Implement new service/pipeline ____________
- Increase number of eyeballs at your site
- CF Data Portal
- Single Sign On
- Pre-filtered/harmonized data conglomerations
- A dashboard for monitoring data in cloud
- User-led training for end users (i.e. written tutorials)
- Webinars, MOOCs, or similar outreach/trainings for end users
- In-person, instructor led trainings for end users
- A NIH cloud playbook
- Full Stacks access
- Developing a data management plan
- Increased FAIRness
- Governance role in CFDE
Lunch: as a group, or seperate, whatever is convenient for MoTrPAC staff
1-2pm Open discussion (with breaks)
Using the results of the morning’s exercise and a collaborative format, iteratively discuss goals, blockers, etc., such that the DCC agrees that the engagement team can accurately describe their answers, motivations and goals.
- Do you intend to host data on a cloud service?
- Have you already started using cloud hosting? If yes:
- Approximately how much of your data have you uploaded? How long did that take? How are you tracking progress?
- What challenges have you faced?
- How have you dealt with those challenges?
- What potential future problems with cloud hosting are you watching for?
- Does your org use eRA Commons IDs? Do the IDs meet your sign on needs?
- If yes, did you have/are you having challenges implementing them?
- If no, what do you use? What advantages does your system provide your org?
- What is the rough composition of your user base in terms of discipline?
- What if any, use cases do you have documented? Undocumented?
- What things do people currently love to do with your data?
- What things would people love to do with your data, but currently can’t (or can’t easily)?
- What pipelines are best suited to your data types?
- What are the challenges associated with those desired uses?
- What other kinds of users would you want to attract to your data?
Review of metadata:
- What's metadata is important for your org? For your users?
- Do all of your datasets have approximately the same metadata? Or do you have many levels of completeness?
- Do you have any data already linked to outside resources?
- Did you find the linking process easy? Challenging? Why?
- What kinds of datasets would you like to link into your collection?
- What implementation and schemas do you already have (or want)?
- What standards do you have (or want)?
- What automated systems do you currently have for obtaining metadata and raw data?
- What training resources do you already have?
- What training resources would you like to offer? On what timescale?
- What challenges keep you from offering the training you’d like?
- How do users currently obtain access to your data?
- What are your concerns about human data protection?
- What potential challenges do you see in bringing in new datasets?
NIH Cloud Guidebook:
- What would you like to see included?
- What would be better left to individual DCCs to decide?
- Would you be interested in contributing to it?
- Has your org done any self assessments or outside assessments for FAIRness?
- Are there any aspects of FAIR that are particularly important for your org?
- Are there any aspects of FAIR that your org is not interested in?
- What potential challenges do you see in making your data more FAIR?
- What search terms would make your data stand out in a shared DC search engine?
- Does your org have any dream initiatives that could be realized with extra resources? What resources would you need?
- If you had free access to a Google Engineer for a month, what project would you give them?
- Any other topics/questions the DCC would like to cover
9-10am Review of goals and CFC involvement
A quick review of what topics are priorities for the DCC with suggestions from engagement team on how we can help.
10-noon Open Discussion, Thoroughness checking
DCC reflection on suggestions, open discussion to find shared solutions. Touch on any questions not covered previously, ensure we have information on
- datatypes they maintain
- formats etc of same
- tools / resources they think might be useful for the project
- points of contact “Who is the best point of contact for your metadata schemas, your use cases, the survey of all your data types?”
- Who would like to be added to our governance mailing list?
Or contact info/instructions on how to get that information offline.