July Report – Recommendations for Resource Allocation and Investment by the Common Fund

Table of Contents:
Recommendation 1: Support the current DCCs with targeted investments
Recommendation 2: Support the current DCCs with cross-DCC investments
Recommendation 3: Support a shift in current CFDE activities to support transformative activities
Recommendation 4: Invest in new transformative activities by the current CFDE Team
Recommendation 5: Invest in long-term ecosystem support with targeted RFAs

We have identified potential for investment in 5 categories. Projected costs in these sections are based on reported costs from DCCs where possible. All other costs are estimated using figures from Appendix H. Estimations for salaries do not contain fringe benefit costs, nor do estimations include costs associated with institutional facilities and administrative rates (F&R).

Recommendation 1: Support the current DCCs with targeted investments.

These recommendations focus on investments into individual DCCs that would leverage and expand their existing capabilities.

Data storage and computing. Costs associated with storage and computing will vary significantly for each project. The proportions of storage versus computing will also be different for each group because some centers require more or less computational analysis or quality control improvements to their data. For example, KF, GTEx and HMP currently have no NIH support for storage or computing, and LINCS will have no funding for storage or computing June 2020. It is impossible to know the exact costs, however we can bound the range of costs for data centers that are similar in size to those we have interviewed. DCCs with data obligations similar to that of GTEx (~600gb of data) could incur up to $250k of cloud-based storage and computing a year. Larger projects use considerably more resources, and cost significantly more: Kids First reports that their current AWS storage bill is about $70,000 per month ($840,000 per year). Note that while GTEx, HMP and LINCS are end of life-cycle DCCs, and will have relatively static yearly costs, younger DCCs will require increasing investment each year. Kids First projects their data growth at 200TB a year, which suggests that their cloud storage costs will double in four to five years.

Given their lack of funding, the HMP indicated that moving their data to a stable, professionally managed file system (such as Google or Amazon cloud) was a top priority. They also noted that updating their pipelines to use more modern technology will be key to providing quality datasets to users going forward. Addresses: Data maintenance and access.

Targeted training programs. Several DCCs would like to develop training materials to upskill their user community to make more and better use of their data. There are two types of costs for targeted training: creating/maintaining the materials and the cost of running workshops.

Creating workshop materials is best done in pairs or small groups, as to benefit from multiple experts. We estimate that creating an entirely new, two day workshop would require the equivalent of about three months of full time work for two people, or $30,000 - $57,500 for a Bioinformatics Analyst and Bioinformatics Engineer pair.

Running workshops requires supporting event hosting, paying instructor travel and accommodation costs, and the administrative costs associated with registration and event organization. If staff instructors or volunteer instructors are not available, costs could also include temporary instructor pay. Optionally, workshops can include sponsoring some or all of the travel and accommodations for learners. We estimate that a minimal workshop for ~30 learners, with volunteer instructors or onsite staff and no learner sponsorship, would cost $10,100 - $19,100 per workshop. We expect a fully sponsored workshop for 30 learners to range from $57,100 - $99,300 per workshop.

Webinars typically have approximately the same the administrative overhead of a two day workshop, but require less material generation, and no travel. They primarily serve already expert users, however, and are not effective at stimulating data reuse in our experience. The number and type of people required will vary greatly depending on the topic of the webinar, however we estimate that a 3 experts would spend at about 40 hours each on prep work, and that there would be an additional 40 hours of administrative work, for a range of $5000-$9000 per webinar.

Kids First is interested in developing training for several different user communities, including training for clinicians in using their portal and training for biomedical data scientists who are new to working with clinical data. Additional funding for a training-focused hire (1 FTE) would help them develop this capacity and run events.

Likewise, GTEx would like to develop several sets of training materials, including:

  • A short intensive workshop on using the Broad Institute’s Terra compute analysis platform to analyze RNA-Seq data with the GTEx pipeline. This would enable users to compare their RNA-Seq data with GTEx results.
  • A webinar series demonstrating the correct way to use their upcoming eQTL analysis data release.

GTEx training could be addressed with a combination of funding a training-focused hire (1 FTE) and burst funding to support the creation of individual workshop materials. Addresses: User training needs.

End of lifecycle support. The HMP group has expended significant resources towards metadata curation, value addition to their data (such as generation of assemblies, gene catalogs, and annotation of reference bacterial genomes); however, the HMP DCC no longer receives support from the Common Fund. It is unclear what will happen to the HMP resources. This situation is an example of how the CFDE can serve as a steward for data and tool sustainability. The processing pipelines used by the HMP group are well documented, but these pipelines are based on software that should be updated. We propose to work with the HMP group to modernize their data processing systems, ensure they are documented, and reprocess the data prior to making it available on the CFDE. An estimated 2 FTEs at the HMP DCC would be required to migrate their data to a cloud-based platform using the Common Fund best practices, collect their documented standard operational procedures, transition their analysis pipelines to re-usable Docker containers, and transition the capability of processing their data to an external team. Costs are estimated to be no more than $260,000 for this effort. Addresses: Ageing infrastructure, Asset specification incompatibilities, and Life stage challenges.

Recommendation 2: Support the current DCCs with cross-DCC investments.

These recommendations focus on investments that would benefit multiple DCCs. Costs for personnel, travel and workshops found in this section are based on estimations outlined in Appendix H. Estimations for salaries do not contain fringe benefit costs, and no estimations include costs associated with institutional facilities and administrative rates (F&R).

Participation in CFDE best practices. We anticipate that each DCC hosting public data will require personnel to participate in CFDE Best Practices in the first year; these FTEs will assist with generation and refinement of the Common Fund Asset Specification, and participate in building Common Fund Asset Manifests described below. These systems serve as the basis to inventory the data assets at each DCC so they will be findable and interoperable. Best practices will also require that DCC staff aid in developing GUIDs, work on training materials, and attend CFDE face-to-face meetings. Oversight for these personnel will be performed in part by the CFDE tech team.This work will likely require 2 Software or Bioinformatics Engineers, or Bioinformatics Analysts averaging no more than $260,000 per site, per year. Addresses: Lack of common practices, Insufficient DCC engagement and Obstacles to Cooperation.

DCC cross-pollination events. Individual DCCs have significant expertise in complementary areas, and we recommend that the Common Fund support a two day conference to bring DCC personnel together in person to discuss their technological challenges, approaches, and solutions. Kids First and HMP were particularly interested in sharing technical solutions and connecting their teams with other DCCs, with the goal of de-siloing not only their data but their people. In fact, the HMP told us that they have been offering informal advice and mentoring to younger data centers, though not ones on CF’s priority list. Annual conferences would serve as an avenue for building cross-DCC collaborations and discussions, and identifying complementary expertise and technologies across the DCCs. The anticipated costs for a 2-day conference for 30 people (20-25 attendees and 5-10 DCC staff) are between $49,000 and $84,000. Addresses: Expertise silos, Obstacles to cooperation, and Insufficient DCC engagement.

DCC-to-DCC joint exercise. We suggest Common Fund support joint exercises between DCCs to engage in development, analysis, or training that involves using datasets from each site. We recommend starting with GTEx and Kids First, because their data assets are complementary, and both groups expressed enthusiasm to be more closely aligned during our interviews. A proposal for the GTEx/Kids First joint exercise can be found in Appendix I. Estimated costs for this exercise would be two FTEs at each site to work on data management issues and harmonization, and to deploy analyses to the computational platforms. This exercise would also serve as a pilot project to demonstrate the power of data reuse and illuminate challenges in increasing data interoperation. Additional time to support involvement from PIs will also be needed to assist with coordination and project development. Costs are estimated to total $897,600. Addresses: Expertise silos.

Infrastructure re-use. The DCCs have a significant amount of infrastructure capability, and have implemented many analytical tools that could be leveraged for re-use by other DCCs. Unfortunately the primary reason DCCs are unable to share resources such as these is due to resource limitations, e.g., the DCCs were not originally funded to make tools or infrastructure into exportable products. We recommend that additional funding in the form of small project grants be offered to incentivize the DCCs to share these tools and resources beyond their project. Awards are optional, and should be allocated based on requests from each site. Activities are likely to require 1-2 Software or Bioinformatics Engineers, or Bioinformatics Analysts averaging no more than $260,000 per site per year. Addresses: Expertise silos.

Recommendation 3: Support a shift in current CFDE activities to support transformative activities.

These recommendations focus on activities performed by the CFDE tech team and will be completed by leveraging our current set of deliverables, outlined in Appendix G. These efforts require no new funding and will involve relatively minor course corrections using our current set of deliverables. These activities will need to continue after December 2019, and funds should continue to be reserved to support them at or above our current level.

Communication within Common Fund program. A number of steps can be taken to accelerate operationalization of CFDE. First, we strongly recommend distribution of this report to the DCC principal investigators (PIs), and that Common Fund leadership meet with the PIs to review and discuss the contents of the report with this group. We recommend this be conducted in combination with the CFDE engagement team (to review the details of the project), the Common Fund Program Officers, and other members of Common Fund leadership. Addresses: Insufficient DCC engagement

Engagement with Common Fund. The CFDE group recommends meeting with senior leadership of Common Fund for an extended (~3 hour) discussion in order to better understand their priorities, their goals for the coming year, and how the CFDE can assist with those goals. This will better enable us to create strategies for Data access barriers, Data maintenance and access, Ageing infrastructure, Expertise silos, and Obstacles to cooperation.

Common Fund Data Asset Specification. Each of the DCCs host files (e.g., genomic sequence, metagenomic, RNA-Seq, physiological and metabolic data) and we can greatly simplify discovery of these assets by creating a specification for a set of descriptors for each of these files. The specification will contain a small number elements such as:

  • Global Unique IDentifier (GUID)
  • Originating institution (e.g., "Broad Institute")
  • File type (e.g. "RNA-Seq", "GWAS")
  • Tissue source and species name for the sample

The data asset specification will be encoded in a computer readable format, and will enable us to use readily available internet technologies to get additional information for each asset, such as the metadata (e.g. patient variables, project name), and to resolve access issues such as files being hosted on the cloud or local servers. While many implementations for electronically encoded data assets of biomedical resources have been proposed in the literature, no single standard has been adopted by the Common Fund DCC community. However, there is a high likelihood of achieving adoption across several groups if a consensus-building process is carefully managed by the NIH and the CFDE team. Addresses: Asset specification incompatibilities

Common Fund Data Asset Manifest. The ability to bundle a list of CFDE data assets into a machine-readable file will greatly facilitate finding datasets among DCCs, and effectively transporting these datasets to resources such as cloud-based analysis tools. We refer to the list of Common Fund assets as manifests. The DCCs will generate manifests for all of their data assets at their center, enabling both comprehensive inventories for all of their files and the use of that information to find and access all of their data in a CFDE portal. Manifests are similar in function to users collecting a shopping list on a commercial web site, and manifests for subsets of data located at multiple Common Fund DCCs will be used to transport files to analysis resources, such as analysis pipelines hosted at Terra or Cavatica. While a standard for manifests has not been developed or adopted by the broader community, the CFDE project represents an excellent opportunity to drive creation of a standard for all of the Common Fund DCCs. Addresses: Asset transport.

Use of FAIRness as an organizational tool. An incentive to motivate the DCCs to converge on inter-DCC compatible standards to represent their data will be to how DCC assets are hosted on the internet (e.g., CFDE Best Practices), and by verifying that each DCC is participating in these best practices by measuring their compliance with specific, concrete FAIRness metrics. These activities will serve to break through the dilemma presented to each DCC who want to participate in CFDE but would only do so if all of the DCCs also participate in a common set of best practices. Addresses: Obstacles to Cooperation.

CFDE Portal. Construction of our portal is underway. Demonstrations for usage of the site will occur around October. Data collection is still preliminary, but depending on the success of the DCCs generating manifests of their data assets, it is likely we will demonstrate query capability across most of the DCCs hosting data. Usage of assets combined between DCCs, as well as passing those assets to computational platforms, will be possible prior to the end of the year. Address issues associated with finding Common Fund data assets that were described in Asset specification incompatibilities.

Computational analysis platforms. Our team will investigate potential computational platforms to perform analyses of CFDE data in the next few months. We will review the total list of platforms with Common Fund for approval and recommend these sites should be considered include: Terra, Cavatica, and other stack providers from the Data Commons Pilot Project (e.g., 7 Bridges and others). Coordination with these providers will involve testing utilization of the Common Fund Data Asset Manifest system, and developing pipelines that can be deployed on their platform. The analysis platforms will be used for the DCC joint exercises described in the Cost Estimation section, as well as training. Addresses: Increased data analysis and problem complexity.

Continued engagement. One result of this engagement was to reinforce how essential a role site visits play in building an effective working relationship with the DCC staff. These visits were very effective for establishing trust, understanding the goals of each DCC, exploring incentives for the DCCs to participate with CFDE, and discovering important resources developed by the DCCs that could be utilized by other groups. These activities will continue to gather additional information from the remaining DCCs. Sustained, high-level engagement of all DCCs may require an increase in post-December CFDE funding. Addresses: Insufficient DCC engagement

User training. The realization that multiple DCCs were tremendously interested in training programs was an unexpected result of our close engagement with them. Since training is closely linked to actual data reuse, and the best (and perhaps only) way to evaluate actual interoperability is to demonstrate it, we see training as one key to operationalizing FAIRness. Training can also reduce the substantial user support burden common to all of the DCCs we interviewed.

There are a variety of training modalities already being used by the DCCs we interviewed. These include webinars on data analysis and portal use (LINCS and GTEx), MOOCs (LINCS), and in-person workshops (HMP). Despite intense interest in continuing these activities, relatively little training is currently happening, and most training materials are at least somewhat out of date. CFDE can play a catalytic role in data reuse by bringing expertise in designing user engagement and training to DCCs, helping create training materials for new users, and supporting evaluation and assessment activities. CFDE can provide support for training by supporting existing training activities such as webinars, helping develop new materials to support training within and across DCCs, and working with DCCs to pilot new workshops for their user communities. CFDE also has significant expertise in evaluation and assessment of training activities, as well as pedagogical techniques and strategies for user-focused training and engagement. Depending on the level of support DCCs require for development of training programs, this effort may require an increase in post-December CFDE funding.

Based on the results from our DCC engagement, CFDE plans to engage in the following pilot training efforts through the remainder of this year:

  • Kids First is intensely interested in developing a hands-on workshop series to train clinicians to use their portal. We will work with KF to outline, write, and pilot workshops focused on clinicians.
  • CFDE will work with GTEx to pilot a webinar series on eQTL use and interpretation. These webinars will be recorded and made available through the GTEx site for users.
  • We will also work with GTEx and the Terra platform to build and pilot a curriculum for a two-day hands-on data analysis workshop that teaches users how to use the GTEx analysis pipelines on the Terra platform for their own data, so that they can reuse GTEx data for their own analyses.
  • We will work with KF to build and pilot a similar two-day curriculum for the Cavatica data analysis platform.

The DCCs are strongly motivated to participate in these training activities because they see it as serving their user community and reducing their support burden. Incentives for user communities to attend these trainings will be integrated into the materials as we develop them. Addresses: User Training Needs, Support burden, Increased data analysis and problem complexity.

Recommendation 4: Invest in new transformative activities by the current CFDE Team

We encourage Common Fund to consider reserving funds for activities described in this section. The activities could be achieved through a combination of repurposing existing deliverables by CFDE technical team, and/or through addition of new funds. Cost estimates are not provided here but will be generated upon request from the NIH Common Fund.

User Helpbot: a first point of contact help desk for all CFDE users. Staff at the DCCs have generated a considerable amount of documentation for their users, but unfortunately users rarely read these materials prior to contacting the DCC for additional help. This observation led to the realization that creation of CFDE is likely to increase the help desk burden of all DCCs. In addition to creating new, potentially confusing, ways for users to combine data, CFDE will draw additional users to all sites, many of whom lack bioinformatics expertise. It would be of significant positive impact to create a common front-end “helpbot” service that would be available at all DCC web sites created for CFDE. This service could potentially use AI approaches to first filter user questions, to see if questions could be answered from FAQ content, and to handle general bioinformatic questions that are not related to any particular DCC. This "user helpbot" could significantly lower the support burden for each DCC, and unify the bioinformatic educational information supplied to CFDE users. Addresses: Increased data analysis and problem complexity and Support burden.

A CFDE search plugin. It will be straightforward for CFDE to create a web based plugin to be used by all of the DCCs, to assist with accessing data between sites. The advantage to creating a common plugin for all of the sites is that it would reduce costs associated with multiple DCCs creating interfaces that perform the same function, and would simplify the user's experience by creating a single search tool. By having a single development team create the plugin, it would also mean we could rapidly respond to changes to the underlying implementation of the minimal metadata standards and CFDE manifest system. An interface component that is used at the websites of all the DCCs will provide a common “look and feel” of the websites across the CFDE. The search plugin address issues associated with finding Common Fund data assets described in Asset specification incompatibilities and reduces unnecessary re-development of similar technologies of the DCCs described in Expertise silos.

Improved sign on and authorization. Among the most impactful, disruptive, capabilities that could be achieved by CFDE would be to provide DCC users with a single sign on step to access protected data across the Common Fund, and to enable researchers to query on those data using a user-friendly interface. Appendix J reviews many important considerations relevant to protected data access, some considerations are briefly discussed here. Authentication refers to an electronic means to verify a user's identify. Authentication is similar in function to using a passport when entering a country. Single sign-on (SSO) is similar to the using the same passport to enter different countries. In the case of the CFDE, it would be desirable to have an authentication system whereby a user logs in with a single username and password, and is granted access to multiple data from multiple DCCs. Many authentication and SSO systems exist but a single system that works across all of the CF DCCs has yet to be adopted, it is not clear which system should be adapted to the CFDE portal. Whichever system is selected, it will also be valuable use this common SSO and auth system to allow users to pass data assets to computational analysis platforms such as Terra and Cavatica. The CFDE technical team can review the Kids First SSO and auth strategy to see if that can be applied more broadly to work for the CFDE. Alternatively, members of the CFDE technical team have developed effective production system that is used by a number of hospitals, research institutions, government agencies, and NIH centers -- this system could be readily applied to the CFDE. Addresses: Data access barriers.

Reduce costs for FISMA compliance. The Federal Information Security Management Act (FISMA) defines a set of controls to protect computer information and operations from security breaches. Requirements include maintaining an inventory of IT systems, categorizing data systems by risk level, maintaining a security plan, utilizing security controls, conducting risk assessments, obtaining certification, and conducting continuous monitoring. These activities often incur more than $100,000 in costs to institutions obtaining FISMA compliance. However, the benefit of obtaining FISMA compliance is that each DCC would increase its ability to host human protected data, obtain trusted partner status for managing data (even just metadata) that are currently hosted at dbGaP, and to share data with other FISMA-compliant DCCs. This would enable each DCC to provide a much wider range of important services to its users. The CFDE tech team is engaged in evaluation of FISMA compliance under our current set of deliverables; however, we recommend that Common Fund examine more ambitious mechanisms to enable the CFDE tech team help reduce the cost burden of DCCs seeking to become FISMA compliant. Addresses: Data access barriers.

Develop CFDE Best Practices. In order to achieve its goals CFDE will encourage the DCCs to participate in adopting a series of best practices in order to operationalize FAIRness and promote interoperation between datasets. In the first year, the best practices will require implementation of the Common Fund data asset specification and the Common Fund data asset manifests at each DCC. Other best practices will be developed over time in close collaboration with the DCCs, and disseminated to all groups. Future best practices will include recommendations for single sign on, authorization methods, FISMA compliance and other important implementation elements of CFDE. Addresses: Lack of common practices and Obstacles to cooperation.

Lifecycle support. CFDE has the potential to provide comprehensive sustainability of Common Fund data that addresses two needs: helping newly formed DCCs to join CFDE, and participating in the best FAIRness practices for CFDE to thrive. These best practices (e.g., Common Fund data asset specification) will also enable findability, accessibility and reusability long after DCCs are decommissioned, provided that information continues to be hosted on cloud systems. We will also produce software (such as the CFDE search plugin or helpbot) that will reduce costs of individual DCCs implementing similar systems. In addition to this, CFDE will develop a robust end of life cycle program to manage continued stewardship of data as DCCs are deactivated. One example project for end of lifecycle support was listed in the ‘Recommendation 1: Support the current DCCs section with targeted investments’. Similar projects should be considered in future years. Addresses: Life stage challenges.

Incentivize participation. Once the rules for engagement are clearly operationalized by CFDE Best Practices and enforced through open FAIRness evaluation, Common Fund has several options to incentivize the DCCs to abide by CFDE Best Practices. This can be achieved by supplying resources to each center to assist in the development of improved data asset specifications, granting access to services (such as cloud-based workspaces) that would rely on the use of CFDE Best Practices, and reducing storage costs associated with CFDE compliant data. Compliance with CFDE Best Practices, can also be linked to cost reductions associated with the STRIDES program. Offering RFAs to external to groups that use CFDE Best Practices should also be considered, as well as providing short to medium term grants for DCC-to-DCC projects like increasing the harmonization level of data between sites and demonstrating cross-dataset usability. Addresses: Obstacles to Cooperation.

Engagement to spread best practices. Ideally, development, maintenance and dissemination of CFDE Best Practices will be done in close consultation with the DCCs, while still being managed by an independently operating engagement team to ensure no single DCC dominates development of CFDE Best Practices. The CFDE engagement team currently serves this role, and can continue to regularly engage the DCCs, promote DCC-to-DCC interactions, develop documentation, participate in training, and perform joint exercises across the DCCs to test the effectiveness of interoperation. The engagement team will also disseminate all documented best practices to the DCC over time. Addresses: Obstacles to Cooperation, Expertise silos, and Insufficient DCC engagement.

Recommendation 5: Invest in long-term ecosystem support with targeted RFAs.

CFDE will rapidly grow into a Common Fund resource that will accelerate new biomedical discoveries by providing a cloud-based platform where investigators can store, share, access and compute on biomedical resources. Several key funding opportunities that leverage CFDE should be considered.

Training opportunities. The training needs for the CFDE will grow in concert with the CFDE resources and usage, and training can also spur adoption of CFDE resources by new users. Training needs include basic training for cloud computing, technical execution of workflows on the Terra and Cavatica platforms, building bioinformatics competency among users of the Common Fund resources, and clinician-focused training programs that integrate with specific CF resources. We recommend that the Common Fund consider issuing RFAs to develop open educational resources and curricula and deliver workshops to target user populations. We also suggest that the Common Fund invest in a “data use” postdoc program that would support deep cross-DCC data use by biomedical researchers funded specifically for this purpose. Depending on the size and scope of the RFAs, a training coordination center may also be important for connecting and facilitating training activities. Addresses: Increased data analysis and problem complexity, User Training needs, and Support Burden.

CFDE sharing of best practices and sharing. The CFDE will need a coordination center, to maintain institutional memory across multiple DCC lifecycles. This coordination center would maintain engagement activities, coordinate and iterate on standards and implementations, maintain beginning and end lifecycle documentation, coordinate events, and track common practice as it emerges. Addresses: Obstacles to Cooperation, Insufficient DCC engagement, Life stage challenges, Lack of common practices.

Analytical development. CFDE will adopt common systems to find biomedical resources (such as datasets from multiple cohorts) and then easily transfer lists of those datasets to other resources. We recommend the Common Fund consider issuing RFAs that would incentivise analytical tool developers to create and adapt analytical tool to take advantage of CFDE environment, and to link these tools to cloud based computing environments. Addresses: Increased data analysis and problem complexity.

Cloud workspace support. Several groups are well-positioned to offer computational workspaces to enable CFDE users to perform analysis. These infrastructures could be rapidly adapted to CFDE data standards, and specially tailored to support the types of data hosted by the Common Fund DCCs. The cloud workspace providers could develop novel query tools, computational pipelines, and create social-network based sharing systems to support consortiums. Addresses: Increased data analysis and problem complexity and Data access barriers.