October Report – Assessment: Recent Deep Dives

With the addition of HuBMAP and SPARC, the CFDE Engagement team has now met with six Common Fund Programs. These six programs span the entire breadth of the Common Fund funding lifecycle, ranging from HMP (funding ended) to HuBMAP (just now releasing their first data sets).

We have continued to use the same “deep dive” format for our site visits, and our site visits continue to be both incredibly informative and very productive. Although we continue to use the same general agenda to organize each meeting, the content and tenor of each engagement is unique and the style of our meetings has evolved over time; we discuss the implications of this below, in Operationalizing the CFDE.

HuBMAP. Of all the Programs we have visited, HuBMAP is both the earliest in their lifecycle, and the most organizationally complex. The HuBMAP Integration, Visualization, and Engagement (HIVE) Collaboratory, the main organizational unit for integration and dissemination of HuBMAP data, is itself a coalition of five organizations. Together, they oversee the work of a Tissue Mapping Centers and Innovative Technologies Groups, all working to create a variety of cell maps and analysis approaches. The HIVE has been working diligently for about a year to set up working groups, determine governance, and set up the infrastructure that is vital to the organization. While HuBMAP is still in early stages, they expect to transition to hosting data in the coming year, and their infrastructure design is unique among the sites we have visited so far. As members of the HIVE have a great deal of expertise in running High Performance Computing Centers, they have chosen to use a hybrid infrastructure where much of the data and compute is local to the Pittsburgh Supercomputing Center, with the ability to ‘burst’ into the cloud for larger jobs. HuBMAP expects that their hybrid system with mostly on-premises compute will save hundreds of thousands of dollars over the life of their program. The complete HuBMAP report can be found in Appendix A.

SPARC. The SPARC data portal, designed and operated by Blackfynn, is conceptually very different from the portals of other Common Fund Programs, and tries to incentivize data creators to deposit data as it is generated. While the portal can be used to discover datasets, it is primarily designed as a data management system. Users can store not only the raw data files, but analysis, notes, presentations, and almost anything else. By positioning themselves as a place where data generators can organize data, metadata and supporting documentation for their own day-to-day use, SPARC hopes to make all of their data more FAIR and encourage good data stewardship. All of the portal infrastructure and approximate 10TB of data is already hosted on the Amazon cloud (AWS), and they expect to have over 100TB of data by the end of 2020. SPARC expects that the reliability and flexibility of using cloud services will let them scale their program in a sustainable way, and allow them to focus their time and energy on creating innovative user experiences rather than maintaining servers. The complete SPARC report can be found in Appendix B.