October Report – Opportunities and Challenges, Summarized

The opportunities and challenges reflected in this section are a synthesis of what we have learned from the deep dive sessions described in the July report as well as the two recent visits with SPARC and HubMAP.

FAIRness of data is not inherent in hosting data on the cloud. The main outcome of our July assessment was ​the clear realization that the datasets hosted by the DCCs are not inherently interoperable, and placing their assets in the cloud does not intrinsically solve the problems of findability, accessibility, interoperability, and reusability​. What is clear now, is that a progressive series of challenges must be addressed in order to achieve the goal of making Common Fund data more FAIR. The first challenge is that there are no clear guidelines for how data can be made FAIR. "We believe in FAIR", said one member of a DCC, when asked what they thought about the term -- but it was obvious from the response that while their daily lives revolve around increasing all aspects of FAIRness of their data, they sincerely did not have any other than their own set of subjective measures for FAIRness of their data. This challenge is addressed in CFDE Operationalization Approach 1, where we will provide clear, objective metrics for all Programs to follow in order to increase FAIR levels of Common Fund data.

Another challenge is to make the data derived from the portfolio of Common Fund programs more findable and accessible. Each of the Common Fund programs we visited has (or will have) large quantities of high-value data assets that can be found via their website. Those assets can be viewed, analyzed, and downloaded at each of their individual portals. For example, GTEx has many tissue-specific RNAseq data sets that can be used to compare gene and isoform expression between normal tissues. However, no end user, or even NIH program manager, is able to locate all Common Fund assets in a single system, nor is there an easy way to determine whether any given dataset exists. For example, for a Kids First user to examine a gene’s expression in tumor tissue relative to normal, the Kids First user would first need to know that GTEx has relevant normal tissue samples in order to look for them. Knowing what data exists, and where to find it, would be a huge breakthrough for many Common Fund researchers. This challenge is addressed in Approach 2: CFDE Portal Implementation

A third challenge is that once we overcome issues relevant to findability and accessibility of data, interoperability is contingent on the data sets being combined between at least two Programs, and the ability of users to transport those datasets to analysis tools. This challenge is addressed in Approach 2: CFDE Portal Implementation, which will allow users to combine datasets from multiple Programs, and in Approach 3: Training, which will instruct users on how to use several analysis systems.

A final challenge is that interoperability is not always desirable. Many data curators are wary of efforts that might make incompatible datasets interoperate, and have raised major concerns (see especially the GTEx and HuBMAP deep dives). The two major concerns are that:

  1. Not all data sets may be usefully interoperable, and the cost of making them interoperable may be prohibitive, especially in the absence of well-defined use cases.
  2. Successful data integration requires a talented and motivated user with a deep understanding of the data, which would necessitate working with the original data sources. Or, to rephrase, the further analysts are from the origin of the data, the more likely they are to misuse it.

These concerns will be addressed in Approach 4: Addressing Data Incompatibility Concerns.

More researchers must be enabled to reuse Common Fund data. One person can make anything work, if they try hard enough and/or have enough tech support. The value of operationalizing FAIR is that it will enable many people to do analyses that were previously only available to expert bioinformaticians, and will advance the economy of scale that comes from investing in solving problems that affect many people. Our FAIRness metrics should reflect this. For example, a small number of individual high-impact papers is less valuable to the community than many papers that make opportunistic (and perhaps small) use of Common Fund data sets. The challenge of enabling many researchers will be addressed in Approach 3: Training.

Common Fund must prepare for a future of federation and interoperation. There are a number of initiatives and opportunities for interoperation, including within the Common Fund (GTEx/Kids First) and outside the Common Fund (GTEx/Kids First/ANViL, HuBMAP/HCA). These are areas where the Common Fund can prepare for the future by ensuring that efforts and standards emerging from the Common Fund are not incompatible with likely NIH (ODSS), national, and global (GA4GH) standards. More, the Common Fund should work towards implementations of these same standards in current efforts where possible, so that over time Common Fund programs gain interoperability with each other as well as more globally. This challenge will be addressed in Approach 1: Data Federation and Approach 5: Federating with Data Resources External to the Common Fund.

The Common Fund must plan for “catastrophic success”. A continued message from the Programs is that increasing data reuse will lead to an increased support burden as well as increased costs for compute. For this reason, we expect to need to elaborate our recommendations in this area in the future, including recommendations for broad training, and tier 1 help desk support, as well as on flexible compute options that do not burden the data centers with increased costs as their user base increases. We will need in-depth usage information from data coordinating centers and a full release of the CFDE portal to determine how best to build out these recommendations, therefore these activities will occur beyond FY2020. We will work to understand the scope of this challenge through Approach 3: Training, and Approach 6: Assessing the Optimal Balance of Cloud Versus On-Premises Computing.

The balance of cloud-based capability versus local (on-premises) computing is unclear. New Common Fund Programs are increasingly faced with the decision of whether to build an on-premises solution or develop in the cloud. For example, HuBMAP and SPARC provided quite different perspectives on how to host infrastructure. HuBMAP is using on-premise infrastructure to provide a cost-effective hosting solution, while the SPARC Data Core is completely committed to cloud hosting. This is a complex decision that relates not only to the needs of each individual program, but also to the long-term sustainability of the data resources, and NIH plans. A particular challenge is that although the trend in the Common Fund is towards cloud based solutions, it is unclear whether it is mandated, and the costs of switching a project like HuBMAP to the cloud mid-project would be immense.

The SPARC platform produced by Blackfynn is entirely hosted on the Amazon cloud (AWS). The SPARC Portal is hosted through Heroku (https://heroku.com), which in turn is leveraging AWS. They chose to use a 100% cloud-based model for a number of reasons, mostly having to do with the flexibility of the cloud. Disk drives inevitably fail, and the amount of extra processing power, download or upload speed, and disk space that the project needs fluctuates depending on what users are doing. Determining up front how to account for those needs is difficult and error prone, however, when using cloud services, malfunctioning disk drives automatically fail over to working ones, and processes can scale almost infinitely. They also don’t have to worry that their internet connection is stable or has enough bandwidth — a large proportion of Amazon’s servers around the world would have to be impacted before it would affect the SPARC Portal.

HuBMAP’s data center, based at the Pittsburgh Supercomputing Center which has long been the home of XSEDE, mostly uses computers that are hosted at their own local facility (as opposed computers hosted at Google or Amazon). At their center, they have many years of expertise in both designing and maintaining High Performance Computing Clusters, and so are able to dramatically lower their computing costs by leveraging those abilities. They will still have some cloud computing capability, in particular they plan to ‘burst’ into the cloud for very large compute jobs, and they are hoping their service will be seamless. That is, a user will find working on HuBMAP’s servers functionally indistinguishable from working in something like AWS, and if their job expands or shifts into the cloud, their experience will remain completely the same.

Although HuBMAP’s hybrid infrastructure is likely to work well for them, it is not likely to work for all Common Fund programs. While getting contracts for inexpensive storage is relatively easy, the services required to keep them secure and running smoothly require specialized knowledge of the underlying site infrastructure -- expertise that is not required if DCCs use Google or Amazon -- and the equipment (e.g., servers, backup systems and disk space) depreciate over time and must be replaced. In general, for smaller systems at organizations without an existing infrastructure, cloud solutions provide a cost-effective way to implement a robust system without a large, upfront hardware and IT investment.

There is also a question of long-term data maintenance. As we have seen with the Human Microbiome Project and LINCS, data stored on local servers are subject to the infrastructure demands of the facility they are stored at. If the data is to remain local, as infrastructure is retired, new servers will need to be purchased to replace the old, and it’s not clear who pays for this once a Program has reached the end of it’s funding, or if the hosting organization would even allow the NIH to effectively rent space there for a Program that is no longer active. Of course, the challenge of long-term support is also true for cloud hosted data, which requires constant payments to keep running. The difference lies in the logistics of data access. Data that is already in the cloud is generally more expensive to maintain over its lifetime, however it is relatively easy for the NIH to take over custodianship of at the end of a given program, even if the program is unexpectedly cut short. Local data will be less expensive, but difficult or impossible to maintain on-site at a de-funded program facility, and will take time, likely weeks or months, to migrate to the cloud for NIH custodianship.

Some approaches to address these concerns are in Approach 6: Assessing the Optimal Balance of Cloud Versus On-Premises Computing and Storage.

Outreach and Ecosystem Building require careful social interactions. Our July report outlined the critical role that our site visits play in understanding the opportunities and challenges for each Program, as well as establishing a relationship with the DCC/DRC. The site visits continue to be important! However, as our July report is publicly available and many of the PIs had taken the time to read it, our hosts often had many more questions for us than in previous engagements, and were more likely to tailor their presentations to reflect on how their programs compare to what we had written.

Unexpectedly, this pre-knowledge of our recommendations served as both an expedient and a hindrance to each meeting. In cases where the challenges of our hosts were already reflected in our July report, their questions tended towards those issues and how our hosts could get involved. In cases where our report did not resonate with our hosts, the meetings were largely focused on exploring and clarifying our recommendations. Moreover, there is a tendency to regard our July recommendations as relatively fixed. This challenge will be addressed in Approach 7: Changing Role of Site Visits.

The cost of hosting and managing data must be addressed. The SPARC Data Core told us that the biggest threat to sustainability is the misconception that “Open Data is Free Data.” There will always be costs associated with data download, storage, transfer and analysis, and while it is possible to make data available for free to users, that increases the cost of hosting the data, which is of particular concern after the Common Fund Program that generated the data has ended. Sustainability is defined by finding the best way to distribute these costs. Given the scale of the current and planned Common Fund data sets (10s of TB to PB of data), data storage, downloads and computing will be very expensive. Multiple Common Fund programs are struggling with issues of egress charges for the cloud, how much compute cost to provide for free (and for whom), and how to enable inexpensive compute close to the data for those they cannot support.

One specific example is GTEx, which provides 100 TB of raw RNA-seq data in their v8 release. There are three modes that users can interact with this data: 1) to use the visualization tools on the GTEx website; 2) to download the data to perform local analysis; and, 3) to transfer the data to another cloud-based system for analysis. The GTEx visualization tools are very sophisticated, however they answer specific questions, and unfortunately many users still must resort to option 2, to download GTEx data to their local computers. The egress charges for such a download are approximately $8500, which is prohibitive for many researchers. Option 3, performing analysis on the cloud, avoids egress charges for most users, and compute costs could be offset by providing free or hosted compute to internal program users, while external users would pay for their own compute. However, these solutions do come with their own challenges. One significant challenge is that relatively few biomedical data scientists have experience in doing their analysis in the cloud. Training is vital for these researchers, as trivial scripting errors can sometimes result in huge compute costs. A related challenge is that easier to use platforms for analyzing data in the cloud, such as Terra or Cavatica, may not be robust or flexible enough to meet users’ needs. And, last, avoiding egress charges still means that researchers must pay for cloud computing costs to perform analysis, and so far, universities generally do not have good mechanisms for paying these costs.

Members of the SPARC Data Core noted “Accessible to us means the user should be able to get many Terabytes of data available to them in an easy and scalable way.” This concern must be addressed by arriving at a robust mechanism to either assist users with cost of data egress, or to provide users with some form of inexpensive computing analysis capability. This has been addressed by at least three Common Fund DCCs who are enabling users to access large-scale data assets and provide their own cloud-based analysis services to their users. For example, the Kids First DRC is making use of Cavatica, GTEx provides their pipelines via Terra, the SPARC Data Core has initiated an effort to link their data with Amazon. HubMAP is also planning flexible and robust compute via their data center as well as export to cloud services. These solutions will help, but they will always require an investment of funds from NIH.

A final important cost issue is demonstrated by GTEx, which hosts protected data that can only be served from a FISMA-compliant repository with appropriate authentication and authorization. While those costs have been shifted to NHGRI/ANViL, the burden of maintaining FISMA compliance for other Common Fund data sets will be an on-going concern.

The CFDE tech team can assist the Programs by assessing costs for different sources of computing, encouraging the data coordinating centers to pool resources to lower overall costs, training users to use cloud based systems so they avoid egress charges, and engaging closely with larger cost-saving initiatives such as STRIDES. We will also consult closely with Common Fund leadership to develop clear guidelines, and budgetary policies to enable the Common Fund Programs and end users take better advantage of cloud resources. However, there is little that the CFDE tech team can do to help with the challenge of computing costs, especially since costs are unpredictable due to variability in compute needs and downloads. In any scenario going forward, NIH will always need to shoulder the cost of computing and storage, and to provide resources to train users to adopt cheaper cloud-based analysis systems.