Table of Contents:
Kids First Overview
Cross Cutting Metadata Models
Self-Governed Metadata Standards
Findability and Interoperability
CF Data Portal
Major Use Cases
We held a meeting with Kids First group at the Children's Hospital of Philadelphia on Tuesday June 25, 2019 for a day and a half. During the meeting, we used the agenda at the end of this document as an informal guide for structuring the day. Representatives in attendance from the CFDE were: Amanda Charbonneau (UCD), Brian O'Connor (Bionimbus), Brian Osbourne (BioTeam), Titus Brown (UCD), and Owen White (UMB). The primary representatives from Kids First were Adam Resnick (PI), Allison Heath (Director of Data Technology and Innovation), Jena Lilly (Director of Operations & Strategic Planning), Bailey Farrow (Technical Project Manager), Yuankun Zhu (Bioinformatics Engineer Supervisor) and Tatiana Patton (Clinical Research Program Manager). We also met briefly with Amanda Haddock, a patient advocate who is the president and co-founder of the Dragon Master Foundation. After the death of her young son from brain cancer in 2010, Amanda started the DMF to help speed biomedical discovery and empower cancer researchers and she works closely with the Kids First staff.
The engagement team began by reviewing their goals for the meeting. These goals include learning about the structure and goals of Kids First, including technical specifications about the data they host, as well as information about training, organization, and the overall set of priorities for their group. In turn, Kids First provided us with a wide-ranging overview of their organization, and gave us a great deal of insight about the intersection of big data and patient care.
Kids First Overview
The Kids First Data Resource Center (KF DRC) was created to improve collaborations across disease communities and to help translate research into personalized medicine, with a focus on childhood cancers and structural defects; primarily by building a portal that would connect and coordinate the huge amounts of data being generated by the Gabriella Miller Kids First Research Act. This act, signed into law in 2014, was in response to advocacy groups pressuring congress, most notably, the eponymous Gabriella Miller, a ten year old girl dying from an inoperable brain tumor. During her illness, Gabriella was an outspoken activist for childhood diseases, and shortly before her death, she challenged Congress to stop talking and start doing something, in a widely circulated video.
Gabriella Miller’s call for action is clearly evident in all of the activities at the KF DRC. Their raison d'être revolves around the speed at which they make more data more accessible to more researchers because it is vital to improving outcomes for their patients. They were encouraged by the concept of the CFDE, because “...a commons accelerates progress and increases accessibility, and is a more efficient way to drive from data to knowledge to impact.” Until meeting with KF staff, our thinking involved reducing the time needed for gathering data, or running analyses in terms of money saved and convenience for the researcher. However, for the people at KF, time is the obstacle to translational research. As the KF DRC staff made clear multiple times during our visit: they are measure the passage of time in children's lives.
The Gabriella Miller Kids First Research Act reappropriated the Presidential Election Campaign Fund to instead fund a 10 year, $12.6 million research program into childhood diseases. As the research program was rapidly created by Congress, there was almost no organizational framework or direction for the project. The NIH put out RFAs for a sequencing center and data coordinating center in 2016, but researcher-led sequencing began almost immediately. Structural birth defect researchers started to sequence trios, while the cancer researchers focused on tumor vs. normal comparisons. Data was collected across all pediatric cancers and structural defects, by a wide variety of people with very different goals, funding and sophistication levels. For example, structural birth defect researchers often want to identify phenotypes that are underpinned by genetics and rely on standards, ontologies, etc. while this strategy is rarely used in pediatrics. Many of the data generators are small groups, funded by philanthropy, or other sources, including the NIH and investigator driven grants. The differing research norms between disciplines as well as the overall lack of cohesion created tension between these various research communities, and a wide-ranging, but poorly overlapping sets of data.
The Kids First Data Resource Center, based out of the Children’s Hospital of Philadelphia (CHOP) officially received funding in mid-2017, and has made an enormous amount of progress, especially given the resources available and the complexity of the situation they were charged with managing. In less than a year, Kids First was able to deploy the alpha version of their portal, and could support access to data. At the time of our meeting, the DRC was not quite two years old, but already had a robust data access infrastructure, a sophisticated query system, and a responsive help team which supports about 2000 users per month interacting with almost three petabytes of data.
Harmonized data is vital to the operations of Kids First because the clinical researchers using their data must be able to access patient phenotypes and clinical variables associated with information derived from a diverse set of data generation sites. This represents a significant challenge for them, because when the KF DRC started, data collection had already been in progress for about three years by researchers on 23 different X01 grants as well as four sequencing centers, with no clear standards for metadata, analysis or storage. Melissa Haendel’s group was brought in to handle metadata curation, and there are five to ten people on the clinical and phenotypic side of the operation who work on harmonization. KF have re-curated the metadata multiple times; for instance, for diagnosis phenotype they have added new terms to the Human Phenotype Ontology and Mondo (the Monarch Disease Ontology) three times over the last year.
They face several challenges in the type of metadata that they collect, that are partly a result of harmonization, and partly a result of the complexity of their clinical domain. For example, datasets at KF frequently incorporate time. Clinical phenotypic data has “events”, such as doctors visits and updated diagnoses, and clinicians may add in data after samples are included in the database, so there is longitudinal information and diagnoses can change, or expand, over time. The difficulty is further compounded by the broad, rather unfocused mandate of the Gabriella Miller Kids First Research Act. The research scope is “anybody and everybody” with a pediatric cancer or structural birth defect, and so encompases hundreds of distinct modalities. Due to both a given patient’s needs and a researcher’s interests, clinical records for a given patient may have incredibly rich and deep metadata about the specifics of their focus area, and only passing reference to others. For example, in Epilepsy studies, coordinators who are collecting brain tumor data may just say “seizure” whereas Epilepsy researchers will detail the type of seizures. So specialists from different communities generally do not speak in the same level of detail about the same symptom.
Datatype compatibility is also a concern. For example, while the KF data only includes RNA-Seq data from two cancer research groups, who use the same strategies, combining their data is not trivial. To account for batch effects and allow for differing downstream analyses, the KF DRC carries over all metadata from these studies. There is a constant need to revisit pre-analytic variables, so as to remove batch effects as well as iterative refinement of what metadata information to collect, and to pass on to users.
The KF DRC handles the idiosyncrasies of metadata with these issues by taking a practical, minimalistic approach, and have used it as an opportunity to build an ecosystem that is vertically and horizontally integrated. The KF metadata model uses data definitions to make data findable, and allow the user to see as much metadata as is available. The KF DRC data curation tool imports/loads from user spreadsheets, and serves as a first-level data ingestion tool for the team. KF has taken the time to discuss metadata terms with their users, and so in most instances, the appropriate terminology is used in the portal. In these cases raw spreadsheet data from a data generator can be computationally imported, but all data still requires some manual curation. When metadata is harmonized, KF tends to err on the side of preserving the original metadata terms, e.g. the text from the source Excel file, as well as the harmonized data.
They noted that legacy data is will also be a significant problem; since data collection pre-dated the curation effort, the early KF datasets tend to have idiosyncratic pipelines and metadata terms. Even data generated since mid-2017 has these issues, as the KF DRC has no authority to (or interest in) dictating a metadata standard. Pediatric oncology is full of rare diseases, and research is often driven by single point labs. These labs often have their own specialized metadata requirements and pipelines dictated by patient care needs which are not amenable to standardized metadata models. Further, for some data generators, there is still a barrier for how much work people are willing to do to supply them with metadata. A data generator will give KF a spreadsheet, or raw forms, but are often unwilling to do any harmonization, or supply clarification of terms. Generally data providers are not funded to curate that data for KF, and so they have no incentive to increase their effort.
KF minimizes technical harmonization variables by ensuring that everything uses hg38 as a reference and they suggest the Broad Institute’s Genome Analysis Toolkit as a best practice. Any data they receive that is processed using other tools is reprocessed by KF before being added to the collection. They run the genomic or RNA-Seq workflows using CWL, have Dockerized tools for each step in their those pipelines, and they use the NCI Genomic Data Commons data dictionary to capture other technical variables like read group and read length info. For genomic data, which accounts for the vast majority of their portfolio, these practises make harmonizing technical variables relatively straightforward.
Cross cutting metadata models
Harmonization is a difficult, multifaceted, and just as the CFDE technical team would agree, the KF staff stated that a cross cutting model will not serve as a magic bullet to solve all harmonization problems. The KF team agrees that a common metadata model will be useful for searching across CF DCCs, but it would not inherently solve the types of harmonization issues they face. Simply stated, the C2M2 is likely to just bring a lot of incongruous metadata together into a single collection.
For clinical variables, the KF DRC relies heavily on Fast Healthcare Interoperability Resources (FHIR), a standard for health care data exchange based on the HL7. FHIR can handle 60-70% of the KF DRC use cases right now, is an open source and flexible framework that they can add data resources to as needed, and the most difficult part of implementing it was to create a FHIR server. The core data model of FHIR is Argonaut, which Google, Apple, and Microsoft have bought into, along with Epic, the system that manages more than two-thirds of all medical records in the United States. For hospitals and communities that want to participate in research, FHIR is a bare-minimum point of contact, and it is even patient accessible. For instance, you can use FHIR to integrate your personal medical records with the Health App on your iPhone.
The KF DRC staff were very excited about FHIR and its long term implications for managing patient metadata. Currently, FHIR doesn’t really work for non-human data, and the specimen and genomic sequencing data in the FHIR system is shallow, and based on what a clinician would see in an electronic health record. Typically, hospitals will report only a digest of sequence analysis. However, FHIR is very focused on clinical reporting of known data, and KF is sure that FHIR could be extended to deal with the research space. In fact, there is already some work in that area. FHIR genomics can deal with whole genome sequences, and GA4GH seems to be adopting it as a standard. KF staff also pointed out that within two years, they expect every patient at CHOP, nearly 30,000 children a year, to be routinely sequenced, and stated that “the minute clinical data will routinely include whole genome data, the FHIR community will immediately adopt it.” Using FHIR also aligns well with the overall goal of the KF DRC, which is to improve patient outcomes, and that relies on keeping -omic data integrated with its phenotypic data.
Self-governed metadata standards
Due in a large part to the overwhelming variety of metadata in their datasets and the fast pace of their field, KF DRC staff did not see much value in adopting overall metadata standards, regardless of who governs them. Instead, they stressed the need to be forward thinking and build resources for the way the world actually works rather than the way we might like it to be. “We cannot build rules and infrastructure that prevent you from doing all the things users think they want to do.”
They also pointed out that even simple, seemingly easy sounding standards can be complicated in practise. For example, we suggested that one minimal standard might be that metadata has to include a Global Unique IDentifier (GUID), so that a user could be sure they had a specific dataset. Allison Heath rightly pointed out “when you talk about files, this all works...but when you talk about rows in a database… how does that work?” Datapoints in the KF dataset aren’t a single, static genome sequence, they’re children who are often under aggressive treatments for deadly diseases, and their phenotypes are constantly being updated. Given these kinds of scenarios, GUIDs would have to be handled carefully, and GUIDs might need to be assigned to individuals rather than datasets. At the scale of the overall dataset, like records from FHIR, changes might happen daily, and even each individual person might ‘version’ with every appointment.
While uninterested in top-down standards, KF is extremely interested in the idea of building a self sustaining community, both for researchers and for DCC staffers themselves. Right now there is a gulf between staff at different DCCs to communicate with each other to talk about many important topics such as harmonization, data models, and dealing with protected data. Creating spaces for DCCs to engage with each other would go a long way towards simplifying both the current effort to improve interoperability and future efforts to integrate between datasets. In the short term more socially integrated DCCs will be more likely to find compromises and shared ground on topics like authentication. In the long term, they will be more likely to choose compatible solutions to future shared technological problems, because they will have been able to share ideas. Creating a DCC network will also improve the overall sustainability of Common Fund projects, as older DCCs can share institutional knowledge with up and coming DCCs, and perhaps even serve as talent reservoirs. For closely aligned projects, seeding new DCCs with experienced, onboarded personnel from sunsetting DCCs would allow them to get up to speed more quickly, and to benefit from prior Common Fund investments. One observation from the meeting was the recognition that building a true community of DCCs might will be most successful when all the Common Fund DCCs participate, it will be far less successful if an isolated set (or just two) DCCs work to cross-reference their data or build common standards.
The KF DRC is very interested in FAIR principles, primarily due to their belief that making the data more accessible to more researchers, faster, is the key to improving outcomes for their patients. The KF staff are dedicated to every element of making data more accessible. As an independent entity, their DRC dedicates an enormous amount of time and energy to all elements of making their data assets findable, accessible, interoperable, and reusable -- what was evident from talking to them however was that there are no clear guidelines to operationalize FAIRness. Other than their own expertise, there is a vacuum of guidelines for them to operationalize FAIRness improvement.
Findability and Interoperability
Findability and Interoperability are inextricably linked in most of KF’s use cases and both the KF DRC and CHOP are interested in supporting environment search and interconnectivity. This reflected by the KF portal, which potentially has the most advanced query capability among DCCs, enabling users to search across a wide array of variables and see the data in real time. Still, in keeping with their ‘we have to accept the world for what it is’ philosophy, they would like to replace it with an even more sophisticated system. “We’ve got to break out of facets, they just don’t work!” They think a free text google-like search is necessary, especially for metadata terms that may be recorded in multiple ways: RNAseq vs RNA-seq vs rna seq. Search at KF mostly starts with diagnosis and phenotype, but depending on the intended cohort, they may also start with “genomic” terms or biospecimen information, for e.g. tumor descriptions. Researchers want to drill down into the data, and even very complicated boolean searches can’t always build a cohort that corresponds to their study question in the way a free text search might.
Cohort building is also an iterative process. Researchers want to be able to share their current query state and send it to collaborators, who should also be able to edit those queries. They want to support saving queries and supporting sharing with other researchers: e.g., operations like save cohort, share cohort, and manipulate cohort. There are also longitudinal aspects of cohort building, family structure, and complex pedigrees - information KF would like to conserve. And they want to allow users to search for information about events over time. True to KF’s primary focus on quality of care, they are aware that the end user must be supplied with all information about the patient as it changes over time.
KF also discussed several forward thinking ideas about genomics search, again in the spirit of not hamstringing the creativity of future researchers. These included ‘layering’, the idea of using entire datasets as a way to think about findability. Here, a clinician might start with genomic data to look for a specific feature. Then, based on their interpretation of that data realize that they need another specific layer of data to overlay onto it: “to check this SNP, now I need single cell data”. Similarly, they suggested that one way to deal with interoperability would be to set some sort of interoperability thresholds that are computed as a user sets up an analysis. For example, when a user saves an analysis pipeline file, Cavatica, Kids First’s data analysis platform, might issue a warning that use of RNA-Seq dataset A with pipeline B might result in overestimation of expression differences. This way a naive user isn’t left to their own devices to perform a potentially ill-advised experiment, but expert users still have the freedom to explore. They also suggested other user alerts that might push ahead stalled research, such as alerting users when the files they have downloaded have new metadata. Or that files that look like ones they have previously used are now available. KF is also very interested in fostering social communities that may be formed by users of their portal. Essentially, KF is thinking about ways to allow researchers to search for exactly the data they need, for the specific problem at hand, rather than trying to build systems based on constrained metadata terms.
Finally, KF talked about FAIRness of citations. The KF DRC, as well as their patients and data contributors would like to be able to answer questions like:
- If I find an interesting cohort, how many of the individuals have been used in publications?
- If I own this data as KF, how do I know it was used?
- How many people are using my child's data?
Kids First has essentially solved accessibility from the standpoint of their users being able to get KF data. Their portal allows anyone to query the unprotected data and requires only a simple sign up, and the portal itself has sophisticated logic for advanced queries, based on real use cases. A user with access to protected data can quickly add their credentials (during our visit their demonstration took about thirty seconds), and users can push cohorts selected on the portal directly to Cavatica, their data processing platform provided by Seven Bridges.
However, Kids First stressed that accessibility in terms of understandable metadata as typically described by FAIR is only part of the true problem of accessibility. Researchers have few problems finding and accessing data at KF, however they have little or no ability to effectively engage with the resources. So, while the data is technologically accessible, it is not intellectually accessible. Small datasets (like for rare diseases) require the cloud so they can be compared and analyzed; and clinicians have specific, sophisticated questions they want to ask with the data, as well as an intimate knowledge of the datasets and their metadata, but they don’t have the bioinformatics knowledge to do an analysis on the cloud. KF points out that the difficulty with ‘Big Data’ is not just that we have too much data, it’s that the cloud is providing new opportunities that are only currently accessible to a handful of specially trained people.
The KF DRC is also interested in accessibility in the other direction. That is, how accessible is their system to people who want to input data. They discussed this in two contexts. First, they want to ensure their system is accessible to data creators, such as sequencing centers and researchers, who currently have little incentive to upload complete metadata (as discussed in Cross cutting metadata models). The easier the process of inputting data, presumably, the more likely a given researcher will comply.
Second, KF talked about making their infrastructure accessible to patients. Patients and their families should have the ability to upload their own data to KF, and “get the data out of the hospital”. Whereas the NIH is often focused on keeping human metadata private, KF and the patient advocate we spoke with stress the importance of liberating metadata. Patients at Kids First often have terminal diseases, and interventions like experimental treatments might be their only hope. Families are therefore desperate to get their child's DNA sequence into the hands of as many researchers as they can, as quickly as possible. Currently, many families organize their own DNA file sharing via Facebook, a practise that KF is very concerned might become predatory, for e.g. pitches like ‘for only $10,000 I’ll analyze your kids genome’. Similarly, as noted in the introduction, embargo periods for X01 awardees, while short for research, seem like an eternity to families. Staff at KF related having to face parents asking questions like: “Why did you sequence my son’s tumor if it did not impact his treatment?”. Giving patients the ability to drive the accessibility of their own data empowers patients who have no hope or comparators.
KF has already done a great deal of work in liberating metadata. Weight, for instance, is not identifiable, but is hidden by dbGaP. The KF DRC was able to show that other NIH data repositories, such as the Genomic Data Commons, showed weight and a number of other metadata terms without requiring a log-in. By examining the differential between what metadata KF was initially exposing and what was available elsewhere, KF was able to work with their PO to update the KF system to align with the GDC. Similarly, KF noted that institutional certificates often don’t match the consent forms signed by patients. There are four sequencing centers where the many KF contributors send samples. The genomic data flows through the sequencing centers, while phenotypic and clinical data comes from the original contributing clinicians, and in many cases, KF had to reach back to the clinicians to get access to the raw data. Some of the early consent agreements had prohibitions on multi-cohort use, however when KF went back and revisited data use agreements they were able to remove the prohibitions. By systematically reviewing documents, and working with their program officer, KF was able to retroactively change several institutional certificates to allow more metadata to be made public, in better accordance with the agreements the patients actually signed. As of our meeting, the default agreement at KF with their X01 grantees is that all of their phenotypes are public.
Kids First staff discussed the idea of reusability in several ways. First, they addressed the idea of licensing. As of our meeting, there are no NIH-wide standards or guidelines for licensing. While they did not think that the NIH should choose a single license everyone must use, they suggested that it would be useful to have a framework. For instance, the Common Fund might issue guidelines that some small set of pre-existing licensing options are compatible with CF principles, so that researchers can choose among them.
Generally speaking, and in keeping with their goals, Kids First data is very reusable, however their infrastructure is less so. Data at KF is processed in a reproducible way, using documented workflows. Between GitHub hosted documentation and CWL workflows, an interested and sophisticated user can find all the internal metadata such as program parameters and versions of software. However, these parameters are not exposed by default to the end user. This decision is still an open question at KF, as they try to strike a balance between transparency and confusion. As previously discussed, most of their users are already overwhelmed with just their own data analysis, so adding all of these parameters to the default view would likely make the data less accessible overall. KF also noted that exposing their pipelines by default may also imply a level of authority they are uncomfortable with “My foremost concern is that somehow we are seen as defining what is best practice.” They want to be sure they have a light touch and encourage community building around what pipelines they are running.
In terms of the reusability of their infrastructure, various components ranged from ‘not at all’ to ‘completely reusable’. Their metadata model as a whole is not reusable. As discussed previously, they took a very practical approach to metadata as necessitated by operational needs, so their model is tailored to their data. However, FHIR, the standard they’ve adopted for patient variables, is very reusable, as is the KF data portal site. They also pointed out that their authentication system is universalizable already, it’s just burdensome to set up. However, the biggest challenge is contracts not technology. If NIH built the authentication system, there would be no need for an ATO, but right now everyone has to build their own system, and that requires FISMA moderate and an ATO, which are expensive and time consuming sign-offs.
Covered in Brian O’Connor’s report.
Kids First already has all of their data in the cloud, and has a system in place for ingest, so they were not initially interested in the Data Dashboard as proposed by the CFDE. However, they imagine that a dashboard with a few added features would be very useful. In particular, one that manages patient uploaded data, or one that tracks usage or processing type statistics such as:
- What processing stage data is in
- e.g. “harmonized”, “in the process of being harmonized”
- How the data is being used by others
- e.g. “downloaded”, “used in virtual study”
- How many people are using my child's data?
CF Data Portal
In keeping with their mission to get more data, to more researchers, faster, Kids First is very interested in a CFDE portal to connect users to new datasets. The illnesses faced by the children that come to CHOP and KF are mostly rare diseases, and a single hospital might only see one case every few years. This means that research is limited by numbers of samples, and researchers want to interconnect with more and bigger datasets and cohorts. There are also ‘failure of development’ diseases, where researchers want to interact with normal comparators as well as make novel connections to data that might be relevant, but that exists in another disease space. No single institution or single dataset has enough information, so there’s a prebuilt requirement to collaborate in the pediatric space to both search across platforms and to be able to integrate the results in order to compete with more common diseases (prostate, breast, etc.) However, as previously noted, KF doesn’t think investing in one overarching metadata model entirely solve this problem. Instead, they picture something more like portable queries. Rather than each DCC changing to a shared overall model, DCCs could provide a query guide to their internal model, and engage with each other to map queries to other DCCs. This might be done through a single plugin shared across DCCs that is connected to each underlying database in such a way that the user can query any site on a given set of search terms and get back the sensible response for that dataset. Or it could be done by creating a utility that takes a given user query at one DCC and translates it into the corresponding queries for other DCCs.
The Kids First DRC already has a simple to use, and yet very sophisticated portal that allows users to interactively explore a wide range of publicly available metadata. The portal requires a login, however, joining is simple. Users have the option to create a stand-alone login, or to simply connect an existing Google or Facebook account. Upon logging in, a user can:
- Access their previous queries or previously viewed files
- Apply controlled access credentials to their account
- Explore community data plots such as ‘member research interests`
- View pathology and histology images
- Push cohorts to Cavatica for analysis
- Explore the KF datasets using a number of metadata terms by:
- Entering queries into a boolean search
- Filtering by a wide range of clinical and technical variables
- Clicking on data points in cBioPortal plots that dynamically respond to selections
As was discussed in FAIR: Findability and Interoperability, Kids First also has a number of plans to make their portal more specific to their own datasets. They hope to eventually add functionality to search by events and other clinical variables, as well as add free-text search. They also plan to add a notification system that notifies users if a saved query result changes over time.
As previously discussed, the KF DRC was funded several years after data collection began, and the primary goal of the KF DRC was to summarize, harmonize and collect partially processed data sent to them by the sequencing centers. However, the Sequence Read Archive (SRA) stopped taking on high volume datasets, and the KF had to take on hosting, storing, and distributing the raw data for the program. Essentially they began to act as the SRA for Kids First data. Initially this was difficult, as the DRC was not funded for data storage or cloud computing, however CHOP contributed in-kind funding as part of the grant mechanism which helped support this. As of our meeting, the KF DRC still does not have NIH funding for cloud resources.
At the time of our meeting, Kids First was hosting about three PB of data, about one petabyte of which is harmonized. With four active sequencing centers as well as other sources of data, they are growing at a pace of about 6,000-10,000 whole genomes (about 1800 terabytes) per year. They project that this figure will soon increase exponentially, as they expect that within two years, every patient at CHOP will have a WGS as a matter of routine. “Right now, it’s not clear on how we’ll keep up on data”. KF told us that in the short term, liability will require that they keep all data, so it can be analyzed against new variants, and tested in new ways. They warned that at some, not too distant, point getting sequencing might transition to be more like getting a blood draw, at which point it may be cheaper to keep the sequencing summary and re-sequence as needed. However, this will likely not be true for specific samples, like lesions or microbiomes, which will be much more precious.
Cavatica, developed by Seven Bridges, is the Kids First Cloud computing environment. Users choose cohorts via the Kids First portal and push those UIDs to Cavatica. Cavatica then pulls the correct data. Since this operation only pushes data IDs rather than moving or copying data, users can begin their analysis immediately. New users get $100 worth of credits when they register their Cavatica account, and Kids First subsidizes continued compute and storage for their researchers. Users can also directly link their own AWS buckets to Cavatica and do combined analyses. Kids First would be interested in APIs or other solutions that would allow their users to link to other data sources as well. The Cavatica workspace ensures that a users analysis is reproducible by recording details on tools, as well as input and output parameters, and advanced users can use the API or ‘Docker pull’ to access extra details about the pipeline.
The Kids First Data Resource Center serves a wide variety of users. Their portal gets about 2000 users per month, and their Cavatica space has somewhat less that 500 total accounts, about 200 of which are active, repeat users. The majority of their users have role of research, but there are also families, patients, patient advocates and clinical users. A number of these researchers are Kids First awardees, either from X01 or R03 grants.
At the time of our meeting, Kids First had not run any training sessions, however they were considering starting to do monthly office hours on site. They frequently go on ‘listening tours’ to all their awardees to survey their needs, and have a standard presentation for their X01 grantees that explains the overall process of managing data. KF also have attended several conferences for cancer, as well as more specialized disciplines, and recently one of the KF staff gave a Gordon Conference talk. KF also has a Twitter, Facebook page, and mailing list, however these are aimed at the general public, not at researchers
KF discussed two user groups with complementary training needs. They have clinicians and clinical researchers who have little or no bioinformatics training and struggle with computational issues, but there are also a number of genomic researchers who are accomplished bioinformaticians, but don’t know how to deal with the vast amount of clinical data tied to each genome. Kids First reported that their support burden is mostly from the first group. Frequently, KF is being asked by local people to perform analyses for them, and noted there are a lot of constraints on resources and insufficient bioinformatics support. These circumstances cause Kids First to wonder what role they should be playing; are they just a data provider or are they also a bioinformatics core? And just how many of their resources should be devoted to basic bioinformatics training? The vast majority of their users require basic computational training before they can begin to interact with the datasets, and so it would technically support their mission to increase data use. However, supporting basic bioinformatics training takes away from support for more experienced users, and from many other activities that are much more obviously Kids First’s responsibility.
KF is most interested in training the clinical researchers. The NIH funds approximately ten new X01 KF awards each year which are handled by the KF DRC. All of these awards have different timelines, readiness to begin the project, and bioinformatic aptitude. Once these researchers receive their files from the sequencing facility, they have a six month embargo period before the data is released to the public. However, in most cases, these awardees don’t know how to get started, and need a lot of help. Very few of them finish, or even in some cases start, their analysis before the embargo period is over. This is frustrating for the researchers, but also for the families of the patients involved with the study, who are expecting fast results and hoping for miracles.
There are three main reasons that KF has not begun work on a training program: time, money and expertise. First, they haven’t had time. Although Kids First is a very mature DRC in terms of their portal and volume of data, they had only been in operation for a little under two years when we visited. Getting all of their infrastructure to its present level of maturity took precedence over developing training. Building a training program also takes money. While KF has some backing from CHOP as well as several Foundation partners, most of this money has gone into supporting infrastructure needs that are still unsupported by the NIH. Finally, training is simply not in their expertise. In particular, they worried that they don’t know how to capture the messy underside of data analysis, which is essential when their audience is going to use their new skills on patients. “How do you take data and new abilities to do science and give it to more people?“
The Kids First DRC staff were very excited about several of our training suggestions, and suggested that with the right proposal, they may be able to secure additional funding from non-NIH sources to supplement their training goals. In particular, patient advocates are usually present at KF events, and have heard firsthand from researchers about the training they lack. Since patients and their families are interested in getting their data used by more researchers, they would likely be amenable to funding those training programs.
There were a number of outcomes, both technological and social, that Kids First would like to see from the Common Fund Data Ecosystem:
- Help with building a KF specific training portfolio:
- A webinar with curated questions about how a type of data is best analyzed, access questions, and other higher level issues. This would allow users to get help, serve as a resource for new users, and help to find new use cases to pursue
- A two-day clinician-focused, hands-on intro to the Kids First portal, and how to explore already-done analyses. CFDE would provide initial work to create materials, and eventually hand off training program to KF when sustainable
- A series of clinician-focused workshops + remote touch-ins to help newly awarded X01s go from data to analysis. This would help X01 awardees get their data analyzed inside their 6 month embargo period.
- A two-day to 1-week hands-on portal+workflow platform for bioinformatics, and how to work in the cloud. This might be better suited as a CFDE product than an individual DCC project
- A ‘Proposal in a Box’ for training that they could use to attract additional funding from foundations and sponsors: outlines the purpose and costs to train X clinicians
- Unified authorization: A CF wide authorization solution
- Reduce risk of sharing data
- Risk can arise when clinical data is selected for research projects
- And has protected patient health information
- Inform KF on what other datasets “their” users are accessing
- “If you like this dataset then you will also like …”
- For instance, it would be interesting to know if GTEx is more or less useful than TopMed
- Related: analysis on data use as categorized by users
- e.g. data use trends within dbGAP
- Research on how to “do a commons”:
- Fund discovery of what a data ecosystem is “The social science of data science”. A data commons is analogous to the concept of a commons in the economic sense and it would be useful to have a model of the benefits and requirements of a science commons, for e.g. the total cost of operations, or how to build, or exploration into narratives of success of a science commons.
- A DCC is the “perfect” subject for an experiment on the value of a science commons
- Accelerate discovery and improved care for children.
- More data, into the hands of more researchers, faster
- Steer more children to appropriate cancer treatment
- Involve children in clinical trials when necessary
- Basic bioinformatics training: Workshops for clinicians who have access to tissue and can get sequence, but lack bioinformatics exposure.
- Facilitate collaborative or “crowd sourced” publications
- Crowd sourced analysis
- Crowd sourced analysis
- Create local champions who have the mandate of chasing down answers to questions
Major Use Cases:
- Automated data ingest:
- A user take their FASTQ or BAM file and push a button to put it through KF pipeline. It’s automatically harmonized and comparable with their KF cohort.
- Shared workspaces:
- A user can build a cohort and start an analysis on Cavatica and share that workspace with another user. Have to support dbGaP and other auth/access. E.g. if you want to share data from dbGaP with others, you need to make sure everyone has dbGaP access. But then there are PI-owners who want to have direct ability to share.
- Patient driven access
- A patient can upload their own data to KF, have it automatically harmonized, and be able to track how and whether their data is used
- Anti use case: allow naive data recombination between DCCs
- Allowing users to combine any data creates the possibility for data misinterpretation. There are perils of joining data where the details are concealed from the users, or the users are not savvy enough to recognize the problems.
During our discussions with Kids First we brainstormed several ways the CFDE might address the problems posed by Kids First in a more holistic way across the entire Common Fund. This is a targeted set of goals that were not originally included in our proposal, however, if the concerns from Kids First are shared by most DCCs, we could positively impact the entire ecosystem and make important new advances in data access and analysis for the NIH research community. These targets are labeled as "game-changers" because by accomplishing these innovations, the CFDE has the potential to dramatically improve the access and usability of an organized set of resources hosted at the Common Fund DCCs.
Building a DCC community. Kids First pointed out that they have few opportunities to collaborate with other DCCs or to share ideas. In other words, there is currently no simple way for the people who are the most expert in how to successfully operate a Common Fund Data Coordinating Centers to talk to one another. Unsiloing the data is not enough. We need to work to unsilo the staffs of DCCs, and provide avenues for discussion and collaboration. The CFDE could help by
- Defining bottom up efforts that can be connected
- Creating opportunities for collaborative proposals between DCCs
- “Bubbling up” the knowledge and the solutions present in all the DCCs
- Providing opportunities for cross training such as ‘how to commons for Program Officers’ so other DCCs can benefit from KF’s work in liberating metadata
- Hosting DCC conferences
- Create DCC mentoring space so solved problems can be shared
- Based on maturity of the DCC and the specific needs
Pipeline tracking. Even with CWL and parameter tracking in Cavatica, Kids First is still plagued by the difficulties of tracking pipelines by metadata, and what makes a pipeline sufficiently ‘different’. When the near infinite number of tool combinations to process data is combined with the ever evolving nature of Kids First metadata, it is difficult to apply GUIDs or other standard ways of thinking. This could potentially be solved with an algorithm that looks at Cavatica and Terra execution metadata and determines what pipelines they were run with, and rather than report back to the user whether the pipelines are ‘the same’, which they almost never will be, reports back whether the pipelines are likely to be incompatible, based on how similar the pipelines are.
A cross-DCC query system: This idea is a variation on the CFDE shared portal proposal. Kids First imagines that most DCCs will have metadata models like their own: idiosyncratic and highly specialized. However they still want to be able to query data across sites. A solution to this might be that rather than each DCC changing to a shared overall model, DCCs could provide a query guide to their internal model, and engage with each other to map queries across DCCs. This might be done through a single plugin shared across DCCs that is connected to each underlying database in such a way that the user can query any site on a given set of search terms and get back the sensible response for that dataset. Or by creating a utility that takes a given user query at one DCC and translates it into the corresponding queries for other DCCs.
TUESDAY June 25, 2019 - Center for Data Driven Discovery in Biomedicine, The Children's Hospital of Philadelphia – 2716 South Street, 12th Floor, Philadelphia, PA 19146
Room Location: 12105
Short introductions from engagement team members and attending DCC members. The overarching goal for the engagement team is to collect value and process data about the DCC. Values data will include things like: mission, vision, goals, stakeholders, and challenges. Process data includes: data-types and formats maintained, tools and resources owned by the DCC that they would like to have broader use, points of contact for follow up on technical resources, etc.
9:30am-10am DCC overview
Short overview of DCC. Can be formal or informal, choose 1-5 topics to cover. Suggested topics: What is your vision for your organization? What big problems are you trying to solve? What are your big goals for the next year? Who do you see as your most important users/stakeholders? What project(s) is currently taking up the bulk of your effort/time? What areas of your organization are you putting the most resources into? What is the rough composition of your user base in terms of discipline? Do you have any challenges that are blocking implementation of your current goals? What skill set would you like to add to your project? How do you engage with your users? What kind of sustainability issues are you confronting? Can you currently do combined analyses with external datasets?
10am-Noon Goals Assessment
An exercise to get an idea of what types of things are important, what types of things are challenges, what do you dedicate your time/resources towards, and what types of things are not current priorities. Given a list of common goals provided by the engagement team, plus any additional goals the DCC would like to add, DCC members will prioritize goals into both timescale: “Solved/Finished”, “Current-Input wanted”, “Current-Handled”, “Future-planned”, “Future-unplanned”, “NA to our org” and for desirability: “Critical”, “Nice to have”, “Neutral”, “Unnecessary”, and “NA to our org”. The engagement team will work to understand the reasons for prioritization, but will not actively participate in making or guiding decisions.
- Increase end user engagement X% over Y years
- Move data to cloud
- Metadata harmonized within DCC
- Metadata harmonized with _________
- Metadata harmonized across Common Fund
- Implement new service/pipeline ____________
- Increase number of visitors to your site
- CF Data Portal
- Single Sign On
- Pre-filtered/harmonized data conglomerations
- A dashboard for monitoring data in cloud
- User-led training for end users (i.e. written tutorials)
- Webinars, MOOCs, or similar outreach/trainings for end users
- In-person, instructor led trainings for end users
- A NIH cloud playbook
- Full Stacks access
- Developing a data management plan
- Increased FAIRness
- Governance role in CFDE
1 – 3:30pm Open discussion (with breaks)
Using the results of the morning exercise and a collaborative format, iteratively discuss goals, blockers, etc., such that the DCC agrees that the engagement team can accurately describe their answers, motivations and goals. Topics don’t need to be covered it order, we’d just like to touch on these types of questions.
- Do you intend to host data on a cloud service?
- Have you already started using cloud hosting? If yes:
- Approximately how much of your data have you uploaded? How long did that take? How are you tracking progress?
- What challenges have you faced?
- How have you dealt with those challenges?
- What potential future problems with cloud hosting are you watching for?
- Does your org use eRA Commons IDs? Do the IDs meet your sign on needs?
- If yes, did you have/are you having challenges implementing them?
- If no, what do you use? What advantages does your system provide your org?
- What is the rough composition of your user base in terms of discipline?
- What if any, use cases do you have documented? Undocumented?
- What things do people currently love to do with your data?
- What things would people love to do with your data, but currently can’t (or can’t easily)?
- What pipelines are best suited to your data types?
- What are the challenges associated with those desired uses?
- What other kinds of users would you want to attract to your data?
Review of metadata:
- What's metadata is important for your org? For your users?
- Do all of your datasets have approximately the same metadata? Or do you have many levels of completeness?
- Do you have any data already linked to outside resources?
- Did you find the linking process easy? Challenging? Why?
- What kinds of datasets would you like to link into your collection?
- What implementation and schemas do you already have (or want)?
- What standards do you have (or want)?
- What automated systems do you currently have for obtaining metadata and raw data?
- What training resources do you already have?
- What training resources would you like to offer? On what timescale?
- What challenges keep you from offering the training you’d like?
- How do users currently obtain access to your data?
- What are your concerns about human data protection?
- What potential challenges do you see in bringing in new datasets?
- Has your org done any self assessments or outside assessments for FAIRness?
- Are there any aspects of FAIR that are particularly important for your org?
- Are there any aspects of FAIR that your org is not interested in?
- What potential challenges do you see in making your data more FAIR?
- What search terms would make your data stand out in a shared DC search engine?
- Does your org have any dream initiatives that could be realized with extra resources? What resources would you need?
- If you had free access to a Google Engineer for a month, what project would you give them?
- Any other topics/questions the DCC would like to cover
WEDNESDAY June 26, 2019
9-10am Review of goals and CFDE involvement
A quick review of what topics are priorities for the DCC with suggestions from engagement team on how we can help.
10-Noon Open Discussion
DCC reflection on suggestions, open discussion to find shared solutions.
1-2pm Thoroughness checking
Touch on any questions not covered previously, ensure we have:
- Action Items for us, and rough timelines for getting back to DCC on them
- Tools / resources the DCC thinks might be useful for the overall project
- Points of contact “Who is the best point of contact for your metadata schemas, your use cases, the survey of all your data types?”
- Who would like to be added to our governance mailing list?
- Or contact info/instructions on how to get that information offline.