December Report – Appendix A – 4D Nucleome Deep Dive

Table of Contents:
4D Nucleome Site Visit
4D Nucleome Overview
Program Lifestage
Data Platform
Harmonization and Metadata
Sustainability
Training
FAIR
Cross-pollination
SSO (Single Sign-on)
Outcomes
Agenda

4D Nucleome Site Visit

Location: Harvard Medical School. 302 Countway Library

Date: Thursday, October 17, 2019

Attendees: Representatives in attendance from the CFDE were Amanda Charbonneau (UCD), Alex Waldrop (RTI), Titus Brown (UCD), Owen White (UMB), Brian Osbourne, and Anup Mahurkar (UMB). The representatives from 4D Nucleome included Peter Park (PI), Burak Alver (Scientific Project Manager), Andy Schroeder (Senior Data Curator), Koray Kirli (Data Curator), Sarah Reiff (Data Curator), and Luisa Mercado (Data Curator).

Meeting Logistics

We held a meeting with the 4D Nucleome (4DN) Network Data Coordination and Integration Center infrastructure team at the Countway Medical Library at Harvard Medical School on Thursday October 17, 2019 to discuss their ongoing program. During the meeting, we used the agenda at the end of this document as an informal guide to structure our conversation and address key issues.

The engagement team began by introducing themselves and their goals for the meeting. These goals included learning more about:

Structure, vision, and goals of 4DN
Platform stakeholders, important users, and common data types
Information about training and organization
Ongoing and upcoming organizational challenges
Overall set of priorities

After reviewing the day’s agenda and objectives, the 4DN team provided an extensive overview of the overall 4DN program goals as well as the 4DN DCIC goals and products. The overview touched briefly on the scientific motivation behind 4DN and how the DCIC’s current data platform has evolved to serve the needs of program stakeholders. Follow-up conversations led by the engagement team focused on understanding the organizational structure, future goals, ongoing challenges, and potential productivity bottlenecks experienced by the 4DN team.

The engagement team finished by presenting their high-level preliminary vision for how the CFDE might support member DCICs like 4DN in the future. The day concluded with a more specifically tailored brainstorming session of concrete ways the CFDE might support and add value to the 4DN team’s ongoing and future work.

4D Nucleome Overview

The broad goal of the 4D Nucleome program is to study the three-dimensional organization of the nucleus through space and time (the fourth dimension) in order to better understand how changes to this architecture impact biological function. The program grows from an increasing realization that structural variation in the nucleus (e.g. chromatin structure) plays a critical albeit poorly and likely misunderstood role in overall biological function. The high-level goals of the 4D Nucleome program are to:

Investigate the functional role of various structural features and nuclear processes,
Develop, benchmark, validate, and standardize next-generation technologies for investigating nuclear organization in 4 dimensions, and
Develop an open data platform and set of data and data analysis standards for housing, harmonizing, processing, and distributing the diverse array of imaging (e.g. FISH, ChromEMT) and molecular data (e.g. HiC, CHiP, ATAC) pioneered by and generated through 4DN program centers.

Operationally, 4DN is a technology and research development network comprising 29 partner centers organized into 7 larger initiatives addressing one or more of these core goals (shown below).

Center	Member Organization(s)	Responsibilities
Nuclear Organization and Function Interdisciplinary Consortium (NOFIC)	6 centers	Develop, benchmark, standardize, and validate high-throughput technologies that can produce three dimensional physical and functional maps of mammalian genomes
4D Nucleome Imaging Tools	9 centers	Develop, benchmark, standardize, and validate imaging technologies for visualizing structural and functional organization of the mammalian genome
Network Data Coordination and Integration Center	1. Harvard Medical School 2. Washington University (visualization tools)	Collect, store, curate, and display all data, metadata, and analysis tools generated by the 4DN Network. Develop data, metadata, and analysis standards
Network Organizational Hub	University of California San Diego	Coordinates activities across 4DN centers and teams
Study of Nuclear bodies and Compartments	6 centers	Investigate 1) topography of nuclear bodies and transcriptional machineries, 2) structure and function of poorly characterized nuclear structures, 3) role of specialized proteins and RNAs in the assembly, organization, and function of nuclear bodies
Nucleomics Tools	1. The Babraham Institute 2. California Institute of Technology 3. Baylor College of Medicine 4. Cornell University 5. University of Pennsylvania	Develop and validate physical, chemical and biochemical approaches for measuring properties and dynamics of the 3-D organization of the genome

As described in the table above, the majority of 4DN network centers, including those within the NOFIC, Nucleomics Tools, Study of Nuclear Bodies, and 4D Nucleome Imaging Tools consortia, are responsible for generating high-quality datasets with standardized metadata annotations using state-of-the-art technologies to characterize, quantify, and visualize various spatial features of genomic architecture. As part of this work, 4DN centers are also responsible for developing, benchmarking, validating, and standardizing the next-generation technologies they employ as part of their data generation efforts. Example technologies encompassed by 4DN include emerging molecular methods for quantifying chromatin spatial structure and genomic interactions like Chromatin Conformation Capture (3C) and HiC, as well as cutting edge imaging techniques like ChromEMT and Electron Tomography for providing high-resolution 3-D visualizations of chromatin structure and sub-nuclear cellular components. In total, 4DN network centers collectively generate nearly 30 distinct data modalities characterizing various aspects of the 4-D genomic architecture across a range of experimental conditions (e.g. heat shock, various CRISPR modifications) in human, mouse, and drosophila cell lines.

Data generated by 4DN partner institutions are integrated, curated, analyzed, and disseminated by the 4DN Data Coordination and Integration Center (DCIC). As part of this work, the 4DN DCIC develops and maintains a web-based portal (https://data.4dnucleome.org/) to support data submission from 4DN network centers, and provides tools and support for data access, visualization, and analysis to the broader community. The 4DN DCIC is also tasked with leading the ongoing data integration and harmonization efforts across submissions from the program’s 29 partner centers. These efforts include working directly with data generators to define and develop standards for 1) routine data analysis/processing pipelines, 2) experimental metadata and definitions, as well as 3) file format standards for raw/processed genomic and image data files.

Program Lifestage

The 4DN program started in 2015 and is currently in its final year of Phase 1 funding, which is set to end in 2020. After receiving initial funding in September 2015, the 4DN DCIC spent much of its first year conducting a somewhat difficult hiring process. They had hired a core team of 3 developers and 2 data curators by June 2016, and the remainder of 4DN’s first year was spent engaging the broader 4DN network through working groups tasked with establishing the program’s data sharing policies and standards for cell line work and metadata terms. The DCIC released an alpha version of the 4DN portal in October 2016. The 4DN DCIC team also began an ongoing collaboration with the more established ENCODE DCC to learn how their group had addressed many of the same issues 4DN was facing. As a result, much of the current 4DN platform is built on tools first developed by or in tandem with ENCODE teams. Over the next year, 4DN DCIC team members led working groups tasked with developing omics data analysis and imaging standards that continued to guide platform design and development. The DCIC released a beta version of the platform in 2017. A year later, a production version of the 4DN web portal was officially released in 2018.

In that time, 4DN network centers have generated a tremendous amount of data resulting in the publication of nearly 200 peer-reviewed publications. Since the platform’s official release, the 4DN web portal has grown to house data from over 700 studies, encompassing 27 data modalities across more than 2,300 separate experiments. The 4DN DCIC infrastructure team attributes this growth to the active and often interactive role they play in soliciting and facilitating data submissions from across the 4DN network.

Most of the day-to-day work at the 4DN DCIC is in direct service of the 4DN network. This includes processing user submissions through standardized pipelines, expanding the data platform to integrate evolving protocols and metadata terms from collaborators, assisting 4DN network partners through the data submission process, and engaging the 4DN network to develop imaging and omics data analysis standards. Interestingly, the 4DN DCIC team says much of this work focuses on simply trying to convince 4DN network partners to submit their data to the platform. Continued work on the 4DN platform focuses on adding features and services to incentivize 4DN network partners to host data on the platform, such as data processing services through the 4DN analysis platform, and data visualization tools for emerging data types like HiC.

Ongoing challenges faced by the 4DN DCIC stem largely from the program’s size and the diversity and complexity of the techniques under development. Data generated by 4DN network centers are hypothesis driven, which has led to the exponential growth of an increasingly sparse matrix of experimental conditions that need to be defined by 4DN data curators and incorporated into the data platform. Complicating this matter, there is an increasing lack of consensus among 29 institutions, who continue to develop their own internal protocols for shared technologies. The submission process is becoming increasingly complicated by the growth and complexity of the underlying metadata model data to the point that contributors almost always require direct assistance from the data curation team; to date exactly one user has been able to submit data on their own.

Despite these challenges, the data portal continues to grow and 4DN is beginning to navigate the transition from Phase 1 to Phase 2 funding. At present, the 4DN DCIC team is currently preparing their RFA submission for Phase 2 funding that would last until 2025. The team expressed concerns over navigating the impending funding uncertainty for Phase 2 and a potential funding discontinuity between Phases 1 and 2. Even if the current 4DN DCIC team receives Phase 2 funding, they don’t know when those funds would become available, or how they would cover cloud storage and personnel costs in the interim. In spite of this uncertainty, the 4DN DCIC team continues to develop their state-of-the-art platform and provide a high level of support to their user-base as they navigate this period.

Data Platform

Infrastructure

The 4DN DCIC web portal and underlying data platform are hosted entirely on the Amazon Web Services (AWS) cloud platform. The 4DN infrastructure primarily utilizes cloud storage (e.g S3) for datasets hosted through the web portal, and on-demand computing resources for executing data processing pipelines.

The 4DN web portal and larger data platform are undergirded by several modularized services integrated through the platform’s data API. At the core of this architecture, 4DN uses SnoVault--an object-storage system developed by the ENCODE DCC that combines ElasticSearch with a PostgreSQL backend--to manage metadata and data file objects on cloud storage, display metadata statistics on the web portal, and power the web portal’s search functionality. 4DN leverages SnoVault’s RESTful API across it’s platform to provide a single, standard interface for both authenticated external users and internal software components (e.g. web portal, analysis platform, data ingestion pipelines) to access 4DN data and metadata.

As an added benefit of the collaboration between 4DN and ENCODE and their shared use of SnoVault, many services are automatically cross-compatible between the two platforms. For example, visualization tools on 4DNs web portal can be augmented with metadata tracks directly from ENCODE. It’s unclear whether this interoperability would easily extend to any Data Coordinating Center (DCC) using SnoVault, or whether this is merely a by-product of the similar underlying data types and metadata terms shared across 4DN and ENCODE. At the very least, this successful cross-pollination should merit closer inspection as the CFDE looks to facilitate greater interoperability across Common Fund Programs.

To ensure the stability of its platform, 4DN developed a tool called FourSight to manage, monitor, and maintain the network of persistent AWS resources that support the 4DN web portal and data platform. FourSight provides the 4DN DCIC with automated monitoring of the 4DN web portal’s underlying network of web, database, API, and data ingestion servers on AWS that run the 4DN web portal.

4DN has also developed a robust infrastructure for reproducible data analysis on AWS. The core of this infrastructure is built on Tibanna, a stand-alone open source tool developed and released by the 4DN DCIC for automated workflow execution on AWS. Though Tibanna supports a number of workflow languages, a motivating factor behind the tool’s development was the lack of open-source tools supporting CWL execution on AWS. Tibanna automates workflows by orchestrating the provisioning of cloud computing resources, transferring inputs from cloud storage, executing workflow steps in dockerized environments, and saving processed output on cloud storage. With Tibanna and AWS, the 4DN DCIC can reproducibly automate complex data processing pipelines on-demand at virtually any scale.

Analysis

The 4DN DCIC team develops and maintains a range of vetted data analysis pipelines for use by 4DN network members. Currently these include pipelines for Hi-C, CHiP-seq, ATAC-seq, and Repli-seq data processing. The 4DN web portal does not provide any tools for pipeline execution, but 4DN network members can have their data processed by the DCIC team upon request after data submission. 4DN’s data analysis pipelines are implemented in CWL and fully Dockerized to ensure 100% analysis reproducibility and portability. They opted for CWL over similar workflow specification languages (e.g. WDL, Snakemake) based on their preference for the stronger, more explicit I/O typing it provides. They use Tibanna to execute CWL workflows on AWS, and CWL workflows are version controlled through GitHub and integrated with SnoVault. Upon successful execution, processed data files become available through the web portal, where users can also view the data provenance of processed outputs.

4DN provides a few smaller tools for data analysis and visualization through the web portal. Among these, the 4DN DCIC created an open-source tool called HiGlass for visualizing very large contact matrices from Hi-C experiments. HiGlass is fully integrated into the web portal, and is also available as a stand-alone web-page (https://higlass.io). 4DN also provides access to a beta version of its 4D JupyterHub service through the web portal. The tool provides users a workspace integrated with 4DN data that currently only supports very small analyses. A future goal for this service is to provide a fully functional analysis environment to allow users to work closely with 4DN data without having to download anything.

Access

4DN users can download data manually through the web portal or programmatically via their API service. Users with data submission privileges can submit data through both the web portal and a stand-alone python application they provide. Users can also access data from their 4DN centers which is not yet for public release using their credentials via the web portal, the API, or JupyterHub.

Users can register and login with GitHub or Google accounts for additional services like the 4DN JupyterHub. They use an OAuth-based user-permission system to define access to platform resources, but most services are available to everyone. This is mainly because the 4DN web portal currently does not provide tools for more extensive analysis which would incur larger compute costs and require more controlled access to limit spending.

Harmonization and Metadata

In tandem with 4DN network working groups, the DCIC has defined its metadata structure to describe biological samples, experimental methods, data files, analysis steps, and other pertinent data. The 4DN data model is based largely on the framework developed by the ENCODE DCC. They use established ontologies to define metadata terms where possible, including Uberon for anatomy and tissue types, and EFO for cell lines and experimental methods. 4DN also uses NCBI taxon IDs and EntrezGene IDs to support further interoperability. The DCIC curates an internal 4DN controlled vocabulary to provide definitions for emerging technologies like HiC and some cell lines used by 4DN network partners. Where applicable, the DCIC submits controlled vocabulary terms to EFO for future inclusion.

The DCIC maintains a team of 4 data curation specialists that work closely with network partners to guide them through the submission process. This typically includes helping users select appropriate metadata terms, and increasingly has included working with partners to develop new controlled vocabulary for emerging techniques.

4DN uses version-controlled CWL workflows to define standardized data processing pipelines for common analysis. Most of these are developed through 4DN network working groups or collaboration with the ENCODE DCC. Though the DCIC has the infrastructure to process user submissions through these pipelines, many labs opt to use their own internal pipelines instead. There does not appear to be an obvious solution to this problem, but the DCIC does publish its pipelines through GitHub and the 4DN web portal to indirectly facilitate standardized data processing across the network.

The DCIC continues to support working groups developing imaging standards and metadata definitions, but these areas have proved more challenging. While the imaging working group has been led by community members who support detailed metadata collection, the data curation team has received pushback from network members over conforming to such standards. The DCIC says most labs have different microscopes and lab techs typically do not record the kinds of details that might help develop metadata standards for uploaded images. In some cases, 4DN partners refuse to even submit image data to the web portal because they don’t want their results to be mis-interpreted. The DCIC has started focusing on working with a subset of the more collaborative labs to gain some consensus on microscopy standards and metadata terms, but reproducibility issues even among this small group raise the question of whether standards are even appropriate this early in the development of these techniques.

Sustainability

Major concerns over the sustainability of the 4DN DCIC stem from cloud computing costs. As it nears the end of Phase 1 funding, the DCIC has been increasingly impacted by rising cloud storage costs resulting from exponential data growth as their platform continues to mature. The team was excited to learn about discounts available through STRIDES--which they had not yet heard about--and thinks the more targeted support would help relieve the strain of mounting infrastructure costs to some degree. They also think these issues could have been solved even without additional funding or discounts if they had been allowed to carry over unused infrastructure funds from earlier years when storage costs were low to cover the higher storage costs of more recent years.

Long-term cloud computing costs present a more existential threat to the sustainability of the 4DN DCIC. The most important question that remains to be addressed by the Common Fund is what will happen to data hosted on cloud storage at the end of the program’s funding, and how these costs will be covered in the interim between Phases 1 and 2. Without substantive changes to the way Common Fund Programs are supported after the 10-year limit on Common Fund grants, there’s a chance the 4DN portal will simply cease to exist 6 years from now.

4DN also faces long-term questions over data stewardship. As with other DCCs, it’s unclear what personnel support will be available to maintain the DCIC’s infrastructure after the 5 or 10-year funding is up. Without continued support for the AWS infrastructure that hosts the web portal and metadata database, 4DN data will become effectively inaccessible regardless of whether there is a long-term solution for cloud storage.

Training

Internal

The 4DN DCIC team engages the broader 4DN network primarily through working groups addressing key program areas. Since the program’s inception, the DCIC team has led 5 working monthly groups (below) to discuss program standards and disseminate information across 4DN network partners.

Working Group	Task
Policy	Develop data sharing policies
Samples	Pick cell lines, develop protocol standards, define metadata terms
Omic and Data Analysis	Define standard analysis pipelines
Joint Analysis	Continuation of Omic and Data Analysis Work
Imaging	Develop imaging protocols and standards

The 4DN DCIC stressed several times during our meeting the difficulty of coordinating information and building consensus across the program’s 29 centers. To supplement the efforts of working groups, the DCIC data curation team provides one-on-one support to network collaborators through the data submission process. The DCIC also provides additional channels for 4DN network partners through a helpdesk email service, protocols.io, and a feedback service on the 4DN data portal. Internal training within the DCIC is centered on weekly meetings for the data curation and development teams.

External

The 4DN DCIC primarily supports external users by providing high-quality documentation and self-guided resources on its web portal. The DCIC currently provides extensive tutorials to familiarize users with the platform’s API, metadata model, and data submission pipeline. The DCIC also makes 4DN pipelines and metadata terms available to external users through the web portal.

The 4DN DCIC would like to provide more interactive training resources for its external users, but this has largely taken a backseat to the increasing demands of the 4DN network. Burak Alver (program manager) acknowledged this dynamic during our conversation:

“We should be doing more bootcamps, videos, webinars, but do not have bandwidth to do this.”

The team has been able to lead occasional training bootcamps for HiC data analysis at conferences despite these resource limitations. Given their ongoing obligations to the 4DN network, the DCIC team doesn’t see an easy solution for increased training beyond additional funding for dedicated personnel. Not surprisingly, additional support for user training was the top thing mentioned when asked what they could do with additional support through CFDE.

FAIR

The 4DN DCIC is dedicated to FAIR principles and continues to develop its platform in support of these standards. 4DN currently supports FAIR data access by providing a robust search interface enabling discovery through metadata, an API for accessing portal data, a largely ontology-driven metadata database, and data provenance though CWL.

Although the 4DN platform scored highly on the Common Fund’s FAIR assessment (below), they were surprised they didn’t have a perfect score. The DCIC expressed some confusion/concerns over the assessment itself, citing a disconnect between their interpretation of FAIR and the criteria on which they were being assessed. In particular, they said “Findability” was the most ambiguous/confusing component and has been difficult to interpret and operationalize. Interestingly, 4DN scored perfectly on Findability, which the team says has been the primary focus of ongoing FAIR-related platform development. The DCIC team also expressed similar confusion over the criteria used to assess platform interoperability. The assessment docked 4DN for potential interoperability issues stemming from the internal controlled vocabulary they maintain to define new metadata terms that don’t yet appear in existing ontologies. Contrary to the assessment’s findings, the 4DN DCIC team feel they do in fact use a formal knowledge representation, and say it still isn’t clear to them why their current setup was deemed insufficient.

The DCIC team was aware of some of these issues:

“We do a good job with the F and the A and the R.”

The team has experienced occasional interoperability issues due to changing metadata definitions. Issues with the 4DN internal vocabulary highlight the difficulty of both developing metadata standards for emerging methods and developing useful measures of data FAIRness. After all, it’s unclear how metadata being defined for the first time could be made more interoperable.

Overall, the 4DN team is in favor of greater interoperability among DCCs, but isn’t sure how useful this will be in practice. They were mixed on the idea of data re-use post-docs and generally wary of larger initiatives they feel often superficially combine data for the sole purpose of combining data:

“the best science seems to come from researchers who have a specific question.”

Cross-pollination

The 4DN DCIC already collaborates heavily with the ENCODE DCC, but is open to future pairwise collaborations. The ENCODE DCC has effectively served in a “DCC mentor” capacity to the 4DN group, and 4DN continues to hold monthly calls with Ben Hitz who runs the ENCODE DCC. The relationship between 4DN and ENCODE could serve as a model for how to better support early-stage DCCs and foster greater interoperability. In addition to ENCODE, 4DN thinks it might be interesting for their platform to interface with GTeX to link changes in chromatin structure to gene expression. Beyond these pairwise interactions, the 4DN DCIC team doesn’t see other immediate opportunities to integrate data from other DCCs on their platform.

The DCIC team was also generally enthusiastic about attending annual Common Fund cross-pollination events. They said they would be more interested if there was a specific focus or purpose being addressed. As an example, they suggested hosting CFDE mini-conferences where Program representatives could get together and present their solutions to commonly faced problems.

SSO (Single Sign-on)

4DN uses the OAuth authentication system for all security, including identify management, data movement, sign on for their portal, query API permissions, and the 4DN JupyterHub. OAuth uses Google and GitHub as authentication sources, and would not be considered meeting NIH requirements for SSO as they do not support ERA Commons or ORCID.

Outcomes

Infrastructure and Resource Reuse

4DN’s infrastructure provides a good model for Common Fund DCCs opting for a cloud-based approach. Because 4DN is built largely on modular, open-source tools developed for stand-alone use (e.g. SnoVault, Tibanna, and FourSight), the platform’s core services can already be reused by other platforms without any modification. For example, DCCs that want to execute CWL workflows on AWS can simply download and use Tibanna independent of the 4DN platform. The same can be said for SnoVault and FourSight. As a caveat, 4DN’s software components are closely tied to AWS and in most cases not reusable across cloud platforms. In any case, 4DN’s well-designed, reusable software infrastructure demonstrates the long-term value Programs can provide each other and the broader community through high-quality tool development. In particular, Tibanna is already being heavily used by the ubiquitous Snakemake open source workflow project, which is completely independent from 4DN.

Although 4DN’s cloud-based infrastructure has worked well for them, it may not be a sustainable option for larger Programs with higher storage needs. 4DN’s increasing difficulties with rising storage costs also highlight the funding issues cloud-based infrastructures can create. Unlike locally hosted DCCs where infrastructure spending is typically heavily concentrated in the early years of the program, cloud-based infrastructures incur most of their costs once the platform matures. This new dynamic may create more issues with the Common Fund’s flat annual funding structure, as data growth and the resulting storage costs can make budget forecasts and future spending decisions difficult. Despite these challenges, 4DN largely views their infrastructure as a strength and think the benefits of cloud computing (e.g. on-demand scaling of computing and storage, streamlined development, no administrative overhead or start-up costs) have so far outweighed cost concerns. As the 4DN DCIC continues to mature, its platform will provide a test case for concerns over the long-term benefits of an entirely cloud-based infrastructure that should be used to inform future decisions by new Programs, the CFDE, and the Common Fund more broadly.

Challenges

The DCIC team highlighted a number of challenges it has faced over its lifetime in support of the large 4DN network, and we had a productive discussion regarding how the CFDE could provide future assistance.

Hiring and Retention. Since its inception, the 4DN DCIC has experienced ongoing issues with personnel recruitment and retention. Many of the challenges the DCIC faced during its ramp-up stemmed from a protracted hiring process that took more than 9 months to recruit enough quality developers and data curation specialists to begin platform development. 4DN says this problem was exacerbated by competition for highly-skilled technical personnel from places like Google and Facebook. Given the numerous other demands the DCIC faced as it began coordinating efforts across 29 centers, the limited administrative overhead and support available through Common Fund for hiring placed 4DN at a distinct disadvantage to industry competitors. Their experience highlights the need for more concentrated hiring support during the first year and beyond. They also highlight an ongoing need to make DCC positions more attractive to the talented personnel they require.

Training and Support. The amount of time and effort invested by the 4DN DCIC in support of its internal network leaves little time to develop external training resources and outreach programs. Because of this dynamic, the 4DN DCIC feels they operate less to provide a general public resource and more in direct support of a large research consortium generating a tremendous volume of data. The DCIC team feels the only solution at this point would be hiring more dedicated personnel to assist in these efforts. Because of this, 4DN highlights a potential role for CFDE in providing additional funding and/or personnel to support external training and outreach at Programs like 4DN that simply don’t have the additional bandwidth to support these efforts.

Hosting. The DCIC team has had issues getting network collaborators to host their data on the 4DN platform. They attribute this to some degree to the lack of incentives for PIs to release data before publication. In lieu of top-down enforcement mechanisms that likely wouldn’t be received well, additional outreach and assistance through CFDE for partner engagement might help PIs and their labs better understand the submission process and the larger value of the resource they’re helping to build. There is also the possibility that further incentives, like automated pipeline support through their web portal, might make hosting data on the platform more attractive for PIs. The CFDE could potentially play a role in helping cover increased computing costs to support data analysis.

FAIR Definitions. 4DN wants to use the FAIR principles but feels some of the definitions are confusing. In particular they think the Common Fund’s definition of “Findability” could be more concrete and has been difficult to interpret and operationalize. The DCIC team was actually surprised they didn’t score 100% on the Common Fund FAIR assessment, which highlights the need for community input on FAIR assessment criteria, providing greater access to FAIRness audits and check-ins throughout the development life-cycle of DCCs.

Flexibility of FAIR Principles. Also on the subject of FAIR principles, there is concern that if the CFDE attempts to normalize the implementation across different programs, it will create an unneeded hardening of the requirements, which will reduce the flexibility for the individual programs to implement FAIRness. 4DN’s controlled vocabulary provides a good use case where FAIR principles may need to be relaxed or expanded upon in order to allow DCCs the flexibility to develop new metadata standards they think best describe their data. The CFDE could potentially support these efforts by facilitating more collaborative development of metadata standards across similar Programs that could lead to both richer data annotations and greater interoperability. The ongoing collaboration between 4DN and ENCODE serves as a good example, and highlights the potential value CFDE could provide through a more formalized “DCC mentorship” program to promote these collaborations.

STRIDES. While 4DN thinks discounts available through STRIDES will help ease the strain of rising cloud storage costs, they feel the current funding structure through the Common Fund is at odds with the exponential data and storage growth they’ve experienced over Phase 1. Until the web platform went live in 2018, 4DN storage costs were insignificant, and much of their infrastructure budget went unspent. As the portal has continued to grow, cloud storage costs have recently become a significant issue for 4DN, but as funds don’t roll over, they can’t now use those unspent funds from earlier years to cover current costs. 4DN thinks a more flexible/dynamic funding structure that acknowledges the reality of exponentially increasing cloud infrastructure costs over ramp-up may have prevented this issue even without additional funding. The CFDE could be instrumental in helping develop programs for more dynamic assistance to help cloud-based Programs deal with this problem.

Cloud Storage Costs. 4DN is also faced with more existential challenges stemming from long-term cloud-storage costs that the Common Fund is perhaps unprepared to address. In the immediate, 4DN has no idea how cloud storage costs with be paid in the interim between Phase 1 and 2. In the long-term, it’s unclear what will happen to the entire 4DN infrastructure when the program sunsets. This hasn’t been as big an issue for locally-hosted DCCs, where the majority of infrastructure costs are up-front, and where the worst possible scenario of walking away and never thinking about the data again is they become inaccessible on some forgotten server. With cloud-hosted DCCs, storage costs continue at peak levels indefinitely whether or not the data are being used or the web portal is being maintained. In short, 4DN highlights the need for the Common Fund to substantively address a question that it has largely been able to ignore until now: What happens to a DCC when its funding ends?

Financial. 4DN is also heading towards similar challenges related to personnel funding gaps and stoppages. As with problems presented by cloud storage, 4DN is similarly unsure of how to cover personnel costs in the interim between Phases 1 and 2. There is also the longer-term question of whether there will be funding to retain some of the current staff to maintain the system in the long-run. Much like cloud-storage costs, personnel costs continue whether or not funding exists to cover them. These issues ultimately exacerbate the challenges 4DN has faced with hiring and retention, as funding uncertainty makes these positions less desirable. 4DN also mentioned the possibility of “talent flight” during these funding lapses or at the end of the program that could set progress back months or years. The CFDE could potentially play a role in helping provide more definite structure around the mid-life and end-of-life challenges Common Fund Programs experience.

Potential Solutions

Hiring and Retention. The CFDE could provide administrative overhead or/and temporary personnel to assist with hiring during ramp-up; fund postdocs, fellowships, or more prestigious opportunities through CFDE that would help DCCs retain and recruit personnel; and provide further career opportunities to help cover funding gaps and/or provide assurances that would make DCC work more stable and attractive.

Training and Support. Create a playbook to help Programs navigate challenges faced over various life-stages: start-up, end of Phase 1, end of Phase 2 and beyond. Provide CFDE resources, information, and possibly personnel to help early programs deal with start-up, and mature programs deal with middle- and end-of-life issues. Provide personnel to work closely with developers to help them create engaging documentation and training resources. Help organize and facilitate external training opportunities like Hackathons, webinars, MOOCs, etc. Provide consultation services through CFDE for FAIRness questions and audits. The CFDE could also provide guidance on how DCCs can operationalize FAIRness through better infrastructure and design principles. Host forums or periodic meetings to showcase exemplar Common Fund programs, allowing developers to share their platform design with other teams.

Interoperability. The CFDE could create and maintain a metadata asset store for programs to share their metadata definitions that don’t yet fit into existing ontologies. The CFDE could also provide dedicated personnel to work with these groups to help them get their metadata definitions added to existing ontologies.

Promote Pairwise Interactions. Provide “matchmaking service” to help Common Fund Programs find partner groups with shared challenges who could benefit from collaboration. Provide “DCC Mentor” service to pair new DCCs with more established groups who work with similar data or who have faced similar challenges. The partnership between ENCODE and 4DN could provide a model.

Promote Sustainable Solutions for Cloud Infrastructure. To address issues with back-loaded cloud storage costs, the CFDE can work with the larger Common Fund Program to create more flexible/dynamic funding structures and provide temporary funds through the CFDE for cloud costs between Phases 1 and 2. The Common Fund programs need real solutions for long-term data stewardship and financial support for programs after their 10-year funding limit is reached. Mainly the CFDE could work together with the larger NIH and Common Fund programs to answer outstanding questions over what happens to cloud-hosted data and the supporting infrastructure after programs end? The CFDE could create a data stewardship program to address sustainability concerns of both cloud-based and locally-hosted DCCs. This could include dedicated CFDE personnel for ongoing support and maintenance upon program sunsets, or provide additional funding to allow DCCs to retain some of their personnel after the program ends.

Potential Projects

The CFDE could develop a metadata search framework/service that uses natural language processing to find semantically similar metadata terms in addition to textually similar ones. For example, a text-based search for “heart attack” wouldn’t find datasets annotated as “myocardial infarction,” but a semantic search would. This model could be provided as a low-level API service or modular tool that could be easily adapted to a variety of purposes. Potential applications of a semantic-search service include powering DCC data portal searches, as well as automated metadata mapping across DCCs. This functionality could ultimately make the CFDE search portal significantly more useful considering the variety of ontologies employed for different purposes across Common Fund programs that may not always use textually similar terms for the same things. Beyond a CFDE search portal, a semantic search tool could help harmonize metadata and promote interoperability among programs without the need to enforce overly-restrictive metadata standards that place an undue burden on DCCs and their programs.

Game Changers

Creating a Common Fund Metadata Asset Store. Existing ontologies provide powerful tools for data interoperability, but Common Fund programs like 4DN often need to define new metadata terms as they work to describe emerging data and experimental types. The controlled vocabulary used by 4DN captures important data features not defined by previous ontologies, but also creates interoperability issues with other DCCs. The CFDE could foster greater interoperability and more collaborative development of new metadata through a central repository allowing programs to define new terms in a common format and share them with other Common Fund programs. As part of this, the CFDE could also work with groups to help them get new metadata terms incorporated into existing ontologies.

Creating a Common Fund Mentorship Program. The ongoing collaboration between 4DN and ENCODE could serve as a model going forward that could help upstart DCCs navigate their first years and foster great interoperability across Common Fund programs. From its inception, ENCODE has played a vital role in helping 4DN develop their infrastructure, define metadata terms, and learn strategies for coordinating across the 4DN network. Beyond making 4DN’s first years significantly easier, the relationship has also led to increased levels of interoperability between the two platforms. By formalizing future collaboration through a Common Fund Mentorship Program, the CFDE could both help upstart DCCs gain their footing with guidance from a more experienced team and foster greater interoperability through fruitful pairwise collaborations.

Enabling Data Analysis on the Cloud. While 4DN has the infrastructure to provide robust cloud-based analysis through its web portal, the DCIC does not receive extensive funding to support this capability. Allowing users to compute alongside their data would incentivize program members to host data on their DCC’s platform, and increase overall reproducibility by enforcing data analysis standards through shared pipelines and workspaces. If the CFDE could provide additional funding or computing resources to help pay for/support/develop/refine analysis tools hosted on program portals, this could increase both the scale, quality, and reusability of data hosted across Common Fund Programs.

Agenda

Day 1
9-9:30am Introductions
Short introductions from engagement team members and attending DCC members. The overarching goal for the engagement team is to collect value and process data about the DCC. Values data will include things like: mission, vision, goals, stakeholders, and challenges. Process data includes: data-types and formats maintained, tools and resources owned by the DCC that they would like to have broader use, points of contact for follow up on technical resources, etc.

9:30-10am DCC Overview
Short overview of DCC. Can be formal or informal. Suggested topics to cover: What is your vision for your organization? What big problems are you trying to solve? What are your big goals for the next year? Who do you see as your most important users/stakeholders? What project(s) is currently taking up the bulk of your effort/time? What areas of your organization are you putting the most resources into? What is the rough composition of your user base in terms of discipline? Do you have any challenges that are blocking implementation of your current goals?

10am-Noon Goals Assessment
An exercise to get an idea of what types of things are important, what types of things are challenges, what do you dedicate your time/resources towards, and what types of things are not current priorities. Given a list of common goals provided by the engagement team, plus any additional goals the DCC would like to add, DCC members will prioritize goals into both timescale: “Solved/Finished”, “Current-Input wanted”, “Current-Handled”, “Future-planned”, “Future-unplanned”, “NA to our org” and for desirability: “Critical”, “Nice to have”, “Neutral”, “Unnecessary”, and “NA to our org”. The engagement team will work to understand the reasons for prioritization, but will not actively participate in making or guiding decisions.

Goal List

Increase end user engagement X% over Y years
Move data to cloud
Metadata harmonized within DCC
Metadata harmonized with _________
Metadata harmonized across Common Fund
Implement new service/pipeline ____________
Increase number of eyeballs at your site
CF Data Portal
Single Sign On
Pre-filtered/harmonized data conglomerations
A dashboard for monitoring data in cloud
User-led training for end users (i.e. written tutorials)
Webinars, MOOCs, or similar outreach/trainings for end users
In-person, instructor led trainings for end users
A NIH cloud playbook
Full Stacks access
Developing a data management plan
Increased FAIRness
Governance role in CFDE

Lunch: as a group, or seperate, whatever is convenient for 4D staff

1-2pm Open discussion (with breaks)
Using the results of the mornings exercise and a collaborative format, iteratively discuss goals, blockers, etc., such that the DCC agrees that the engagement team can accurately describe their answers, motivations and goals.

Topics:
Infrastructure:

Do you intend to host data on a cloud service?
Have you already started using cloud hosting? If yes:
- Approximately how much of your data have you uploaded? How long did that take? How are you tracking progress?
- What challenges have you faced?
- How have you dealt with those challenges?
What potential future problems with cloud hosting are you watching for?
Does your org use eRA Commons IDs? Do the IDs meet your sign on needs?
- If yes, did you have/are you having challenges implementing them?
- If no, what do you use? What advantages does your system provide your org?
  Use cases:
What is the rough composition of your user base in terms of discipline?
What if any, use cases do you have documented? Undocumented?
What things do people currently love to do with your data?
What things would people love to do with your data, but currently can’t (or can’t easily)?
What pipelines are best suited to your data types?
What are the challenges associated with those desired uses?
What other kinds of users would you want to attract to your data?
Review of metadata:
What's metadata is important for your org? For your users?
Do all of your datasets have approximately the same metadata? Or do you have many levels of completeness?
Do you have any data already linked to outside resources?
- Did you find the linking process easy? Challenging? Why?
What kinds of datasets would you like to link into your collection?
What implementation and schemas do you already have (or want)?
What standards do you have (or want)?
What automated systems do you currently have for obtaining metadata and raw data?
Training:
What training resources do you already have?
What training resources would you like to offer? On what timescale?
What challenges keep you from offering the training you’d like?
Policies:
How do users currently obtain access to your data?
What are your concerns about human data protection?
What potential challenges do you see in bringing in new datasets?
NIH Cloud Guidebook:
What would you like to see included?
What would be better left to individual DCCs to decide?
Would you be interested in contributing to it?
FAIR:
Has your org done any self assessments or outside assessments for FAIRness?
Are there any aspects of FAIR that are particularly important for your org?
Are there any aspects of FAIR that your org is not interested in?
What potential challenges do you see in making your data more FAIR?
Other:
What search terms would make your data stand out in a shared DC search engine?
Does your org have any dream initiatives that could be realized with extra resources? What resources would you need?
If you had free access to a Google Engineer for a month, what project would you give them?
Any other topics/questions the DCC would like to cover

Day 2
9-10am Review of goals and CFC involvement
A quick review of what topics are priorities for the DCC with suggestions from engagement team on how we can help.

10-noon Open Discussion, Thoroughness checking
DCC reflection on suggestions, open discussion to find shared solutions. Touch on any questions not covered previously, ensure we have information on

datatypes they maintain
formats etc of same
tools / resources they think might be useful for the project
points of contact “Who is the best point of contact for your metadata schemas, your use cases, the survey of all your data types?”
Who would like to be added to our governance mailing list?
Or contact info/instructions on how to get that information offline.