December Report – Challenges Faced by DCCs

Table of Contents:
Lifecycle Stage
Fairness Standards
Sustainability
User Training Needs
Findability And Interoperability
Human Subject Data
Engagement / Mobilization

Lifecycle Stage

There are common sets of life stage challenges among DCCs as they are initially ramping up, as they are expanding to meet the needs of their research consortium, and as they come towards completion of their funding.

Early lifecycle issues revolve around the number of important decisions that must be made in a short time frame, often with little guidance or institutional knowledge. Beginning a project typically requires hiring qualified experts as quickly as possible. Choosing infrastructure technology and establishing protected data policies within the consortium and with NIH requires a great deal of subject matter expertise and forward planning. However, at most institutions, awardees often need to wait until their awards are processed to begin advertising open positions, taking on infrastructure contracts, or devoting their own time and effort to the project. This means that most DCCs have a 6-12 month lag time between their award start and having a full, on-boarded, work-force, ready to begin the ‘real’ work. Since award money is typically split evenly over the grant period, most DCCs reported having far more money than they could spend in the first year. Unfortunately, this money is often lost, as DCCs are not typically allowed to carry over funds.

Middle-aged DCCs face both social and technical challenges. On the technical side, DCCs must create, test, and iterate on infrastructure such as data submission, validation, and processing pipelines. Socially, CF Programs often operate without any strong community. Data generators, DCCs, tool creators and others in the program all have seperate grants, and little or no incentive to work together. DCCs often find that although their mandate is to collate and curate data, they instead spend the vast majority of their time trying to persuade data generators to regularly submit data and metadata, which alters both the type of data that the DCCs end up with and the rate at which they receive it. Many DCCs also reported that they were unexpectedly mandated to collaborate with other programs, and need to meet new technical challenges in their middle years. By this stage, the personnel and infrastructure actually needed to run the DCC may bear only passing resemblance to what was originally proposed and planned for. These lifecycle challenges can result in slow ramp-up, suboptimal infrastructure choices, delays in getting robust data pipelines in place, fragile software, and lost opportunities for data reuse.

Mature DCCs have considerable expertise, infrastructure, and other important assets that could potentially be used by other sites, but sharing their experience and assets is limited by both time and financial concerns. Although their budget has remained static, mature DCCs are now at their peak funding requirements: they have a full workforce, more data than ever, and a large number of users. There is also a great deal of uncertainty about what happens to a DCCs data once CF funding ends. For instance, detailed knowledge about data processing procedures is vitally important for maintaining the value and utility of the datasets. Unfortunately, thorough internal documentation is time consuming and difficult to provide and thus may not be a priority, especially if the middle years involved a great deal of pipeline fluidity and infrastructure change. As there are currently no standardized mechanisms for stable identifiers for data, or custodians to track those assets, there is no easy way to merge a DCCs data into another repository and end users are far less able to find those resources after the DCC is deactivated. The problem of storage costs for maintaining the data of deactivated DCCs have also not been solved by the CF programs, nor has the need for user help, software maintenance, web hosting considerations, or many other long-term costs associated with keeping a community resource alive.

Fairness Standards

While each Common Fund program has different mandates, communities, data types and pipelines, their data coordinating centers are all working towards the same basic goal: to coalesce that programs data into something broadly useful. In short, making data Findable, Accessible, Interoperable, and Reusable is the express purpose of the DCCs. So, it is unsurprising that every DCC staff person we spoke to was dedicated to the meeting the FAIR principles for their data. But while everyone agrees with FAIRness in theory, there is no one unified set of practices that DCCs can use to make their data more FAIR. Our analysis of the DCCs assets confirmed each individual group has reached a high level of FAIRness for their data; yet, there are no specific criteria that DCCs can use to converge on a common approach to FAIR across the entire CF portfolio, and how each DCC actually implements FAIR principles varies widely. This makes creating a single objective metric to measure or compare FAIR across programs nearly impossible. There are also few incentives for any given DCC to take on the monumental task of trying to create a unified set of standards, and to socialize them across the Common Fund; nor do any DCCs have the time or resources to do so.

Sustainability

A recurring theme from the CFDE’s site visits in 2019 (see Appendices from this report, as well as the July and October reports) is the challenge of paying for data storage and compute. Issues raised include basic data storage, serving raw and processed data to internal and external users, and supporting the cost of small-scale and large-scale analyses with unpredictable compute volumes. These costs are fundamental to the basic utility of the datasets being generated by Common Fund programs and the inability of programs like the HMP and LINCS DCC to pay for continued data hosting represents a major threat to the continued utility of these datasets.

Cloud-based technologies do not eliminate costs associated with data download, storage, transfer and analysis: as one DCC PI noted, “Open Data is not Free Data”. If a goal of the NIH is to increase the number of users accessing data and analysis capability on cloud systems, possibly The biggest threat to the CF goal of increasing the number of users accessing data in the cloud is the absence of a unified, streamlined, and cost-effective mechanism to host and manage data across all the Common Fund centers. The nascent STRIDES program may help reduce storage and analysis costs for the NIH, but that does not necessarily lend itself to increased use in the community. End users are accustomed to downloading data from free resources such as dbGAP, and using it locally. Moving these resources to the cloud requires end users to either pay egress costs, which they seldom have funding for, or to work in a cloud environment, which few have the computational background to use, and which can be quite costly.

The exact model for how the STRIDES program applies to the CF DCCs also remains unclear, and the challenge of managing these costs in the long term is compounded by the lack of clear policies and metrics at the Common Fund for supporting data resources beyond the lifetime of the project. While the STRIDES program provides one avenue for consolidated billing that could be independent of an award, it is only being used by one Common Fund program (Metabolomics) of the nine we have interviewed. Regardless of the funding source, or the reductions made available through STRIDES, costs for data hosting should be managed across all sites using a comprehensive approach that includes considerations for data reuse by their end users.

Some DCCs noted that the use of on-premise infrastructure still remains an attractive model in comparison to cloud-based systems. As the data ecosystem continues to expand, the balance of cloud-based capability versus local (on-premises) computing, will continue to be complex decision that relates not only to the needs of each individual program, but also to the long-term sustainability of the data resources, and NIH plans.

User Training Needs

Most DCCs reported that user support takes up a substantial fraction of their time and effort. For some, like GTEx and Kids First, their support burden primarily comes from end users. Both programs told us that they have constant questions and help requests from their community. For other programs, the support burden is internal to the consortium. 4D Nucleome and Metabolomics, for example, devote most of their resources to helping their data providers format and submit data to the DCC. Nearly every DCC told us they would welcome both extra resources, and help with building more effective training.

Many DCCs were interested in expanding biomedical user support and training for users inside and outside the project. A range of specific motivations emerged from our site visit; these included increasing access to data, educating users in appropriate analyses, supporting more sophisticated data analysis, driving more integrative analysis, and identifying opportunities for enhancing DCC infrastructure and functionality. Despite this interest, relatively few DCCs have a significant training mandate or an active training program. The support burden that additional users would add to projects was also a significant concern. A related concern raised by particularly widely used resources such as GTEx was that many biomedical scientists need introductory bioinformatics training in order to use the DCC resources, but this kind of training is too broad in scope for any one project to tackle. Finally, many DCCs are interested in offering more training but even with funding would lack the expertise to develop materials, offer workshops, and do assessments.

Findability And Interoperability

The CF DCCs host a rich set of file types (e.g., genomic sequence, metagenomic, RNA-Seq, physiological and metabolic data) but it currently is not possible to discover these files across DCCs. Metadata, which is crucial to making use of these assets, is not readily accessible to users in a uniform manner. At present there are several interoperation efforts, including projects within the Common Fund (for e.g. GTEx/Kids First, MoTrPAC/Metabolomics) and outside the Common Fund (for e.g. GTEx/Kids First/ANViL, HuBMAP/HCA). Several DCCs, as well as some end users, have told us that the ability to combine dataset cohorts across DCCs is highly desirable as it would directly address important biomedical and clinical questions. Unfortunately, there are no unified practices for electronic formatting and transport of Common Fund data, and no standardized interoperable mechanism for transporting datasets between DCC portals, or to platforms such as Terra or Cavatica for data analysis.

Combining just two datasets is a huge challenge: a researcher needs to be able to first Find both of the datasets through a query system, and then gain access to both. She then needs to ensure that the two datasets have a compatible study design and metadata. Once she gains access to the raw data, she needs to manually check each metadata field and figure out how to make the studies match. This might include doing transformations such as changing pounds to kilograms in one study, and re-coding cancer/healthy to case/control. If done at the level of the DCCs themselves, this process would need to happen for every dataset pair across the Programs. A given DCC knows its own assets, but like any end user, would have to go searching through all the other individual portals, all of which use different terms, to even begin finding data to interoperate with. Currently there is no common portal for CF data, so it is virtually impossible to easily across all Common Fund assets and determine whether any given dataset exists.

It is important to note that most of these challenges are scientific and social challenges rather than technical ones. Technical solutions like transitioning to cloud storage can reduce the time and costs associated with moving and analysing data, and building a centralized search portal would greatly improve Findability, but these address only a tiny fraction of the interoperability challenge. The true difficulties lie in making scientifically sound decisions about topics such as whether two datasets should be combined, what metadata are compatible, and what metadata terms are ‘important’. Staff at DCCs frequently emphasized that the substantial cost of increasing interoperability should be justified with well-defined use cases, and conversely some data custodians were concerned that facile integration of datasets could lead to mis-interpretation of that information.

Human Subject Data

In addition, DCCs have reported their users are faced with several barriers to accessing dbGaP data. The practices and infrastructure of dbGaP are fairly confusing. For instance, dbGaP renames files when they are deposited, but doesn’t inform the DCC of the new names. Similarly, obtaining access across several clinical studies is cumbersome, for e.g. accessing the data for phase two of the Human Microbiome Project would require approval for twelve different consent groups. Another significant concern is that each DCC, has its own signon and authorization system, so end users need to sign up separately for every portal.

Engagement/Mobilization

During our listening tour nearly all of the PIs expressed enthusiasm for an additional level of coordination and interaction across the Common Fund DCCs. Several sites also expressed an interest in greater opportunity to collaborate with other groups in the areas of data integration, harmonization, resource and expertise sharing, as well as development of common training activities. These examples reflect not only a need to perform some type of common technical development across these groups, but also will require increased levels of engagement, social interaction, and increasing communication opportunities across all the DCCs. While there is some official interaction between DCCs - for example, Dr. Subramian, the PI of the Metabolomics DCC, serves on the SAB for LINCS and MoTrPAC - by and large the DCCs operate independently and do not communicate regularly at a technical or scientific level. Interestingly, both MoTrPac and 4DN have established informal relationships with ENCODE in order to adopt and adapt their workflows for transcriptome and epigenome studies. At present, technical cooperation between the DCCs and identification of reusable technical solutions is challenging because DCCs are largely unaware of what technical approaches are being used by others. The DCCs have mostly developed tools and strategies to support their mission in isolation, and don’t have the time or resources to build new collaborations.