October Report – Concerns, Risks and Threats

Sustainability and cloud costs

There are monetary concerns around accessibility. Data hosting and egress charges have traditionally been covered by the institution that is funded to share the data. However, repositories increasingly don’t have the funding to do this at scale. The Sequence Read Archive (SRA), for instance, stopped accepting large sequencing project data years ago, mostly due to budgetary constraints. For data in the cloud, the largest cost is typically data download: “To mitigate these costs, many repositories either limit the size of datasets or limit the throughput on downloads. However, this goes squarely against the FAIR principles and results in repositories that have the notion of data-sharing while in fact the data is not truly available.” The SPARC Data Core suggested that the ultimate fix for this problem needs to be user education and training. Researchers need to be aware of the time and money required for downloading data, and that working in the cloud is a much faster and cheaper option in many cases. However, to work in the cloud, users need training that isn’t readily available right now.

Even with a better educated user base, SPARC told us that there needs to be a long-term sustainability plan for data storage and use costs, and cautioned against adopting a system where the NIH pays data egress fees directly, without oversight:

“If you do provide a platform that enables scalable, high-throughput data access to very large amounts of public data, one needs to take into consideration the cost that could be incurred by users. For example, what if a graduate student writes a script to download the entire public repository each night. What if this is a student in a different country, what if this is mal intended? Given the high availability of the resource, it would be very easy to have hundreds of thousands of dollars in unexpected costs within a couple of days.”

Finally, there are additional concerns with thinking not only about FAIR but also about making data publicly available. In a discussion about data ownership, the SPARC Data Core told us that one difficulty with making data public, is that often several entities claim rights to the same dataset, and have different views on where and how it should be stored and accessed:

“We strongly believe that data on our platform is owned by the users of our platform and not Blackfynn. However, our experience with academic institutions is that they also claim rights to the data, even though the NIH mandates data sharing. It would be great to have a discussion on mechanisms to break through this impasse.”