What if Data were not Forever?

Mr Rob Cook1, Dr Rhys Francis2

1Pangalax, Bardon, Australia,

2eRF, Diamond Creek, Australia

 

INTRODUCTION

Digital data is an increasingly important ingredient of research [2][3][4][7]. As part of the evolving culture of research, researchers are being asked to plan, manage, publish and retain the data used in their research projects [6]. Methods of doing so are being required as part of institutional data management policies. Valuable data is being curated, maintained and used as inputs to ongoing research, sometimes as components of substantial data collections with broad interest and use. This development in research methodology is exceptionally valuable and acts to increase the available stock of knowledge and to raise the level of reproducibility of research.

Storing all research data in a manner approved in data management plans means that requirements for data storage repository space are increasing rapidly, alongside the time and effort required to curate and manage data that also grows similarly.

Ultimately continuing growth in preserved data is unsustainable as data preservation costs do not decrease as fast as data volumes and complexity are increasing [9], and there is no obvious answer to who will be willing to pay the escalating bills.

This paper challenges the reader to think about how custodians can rationally and safely reduce total data costs, and how research processes can minimise the effort involved in handling research data. It proposes a simple time scale approach.

 

DATA TIME SCALES

Our contention is that the purpose of data retention can be classified depending on the maturity of the research use of the data and the group responsible for the stewardship of the data. The data in each of these states has an associated time scale during which the data has value for its research users. These time scales are observable across the creation and use of research data. Recognising them leads to a practical decomposition of the Australian research data system.

 

Time scale State Locus of Stewardship
3-5 years Active Research Data Researchers and research projects
10-15 years Openly Exchanged Research Data Research performing institutions
Decades Research Community Data Research communities and their supporting institutions
Indefinite Stock of Knowledge Society at large through governments

 

Budget restraint depends on agreeing that not all research data is of interest over all of these different time scales. In addition, the achievement of meaningful data management for data that is of interest over each of these time scales has different purposes, different access and durability requirements, different cost drivers and different custodial interests. Therefore these time scales help set out components of a design for an Australian research data system by identifying separable purposes within it.

PROGRESSION AND SELECTIVITY

The challenge of storing digital data can be related to the volume and complexity of the data itself; the quality of its curation; the scale and complexity of the infrastructure required to support its retention, organization, accessibility and use; the longevity and durability of its preservation and the degree to which automation can be applied.

Publically funded research, by virtue of budget realities, has limited scope to cope with an exploding cost of data. Consequently, downward pressure needs to be applied to each of the cost factors in order to maximize the amount of valued data that can be retained for future use. Indeed, it is imperative to produce a cost of managing data to the publically funded research sector that is significantly sub-linear in the volume, variety and velocity [5] of that data.

The difference in scope and participation can be understood as follows:

  1. Active research data is the ‘working data’ of research projects and relates to any data created or used by researchers in the research projects they undertake
  • Openly exchanged research data is the ‘working data’ of the research system itself. It is data that is curated in support of the exchange of knowledge, largely amongst the associated research communities, and for the purposes of improved research integrity, quality of research and for the purpose of research reproducibility. It is data that is managed according to best-practice management principles, in order to underpin the global research culture.
  • Reference research data is the ‘working data’ of broadly based research communities operating over sustained periods of time and cutting across research programs and interest areas. It is often data that would also meet the needs of the social and economic beneficiaries related to those research communities. It would most likely be managed in concert with stakeholders.

These three purposes have different participants as stakeholders, and may justify significantly different costs per data element. The beneficial outcome, the selectivity of data inclusion, the curation of the data sustained, and the durability of the infrastructure and organisational arrangements in support of them, will all be different and drive different costs. The recent call for core data to be identified and treated differently in life sciences [1] is a contemporary example.

The intention is that theses components should be conceived of as supporting different states for data in a data system and thereby reveal transitions between the states. Identifying the components of the system allows these critical but missing transitions and their associated policies to be developed. Those policies will determine total data system costs.

CHALLENGES

The transition from active research to openly exchanged data involves publication and selectively sharing published and other completed research data across research teams. How is this selection and sharing accomplished? Does the researchers institution accept the cost of making openly exchanged data available?

Data that is accepted as valuable across a research community involves a community agreed process and possibly the upgrading of data quality to meet the FAIR principles [10] that are required by that community. How do communities conduct such a process? To what extent can the application of FAIR principles be automated to enable more research community data to be accumulated? How do research communities fund the necessary storage?  The National Institute of Health in the US is funding a data commons for molecular biology data [8]. Can this be replicated in other domains?

Some research data qualifies for retention as part of the global stock of knowledge. Is this funded by governments and other central funding agencies because of its value to society?  How should data in this state be identified?

REFERENCES

  1. Anderson et al (2017), Data management: A global coalition to sustain core data. Nature 543
  2. European Open Science Cloud High Level Expert Group (2016), Realising the European Open Science Cloud
  3. European Union (2010), Riding the wave, How Europe can gain from the rising tide of scientific data
  4. Finkel, A. (2017) 2016 National Research Infrastructure Roadmap, Department of Education and Training
  5. Laney, D. (2001), ‘3D Data Management: Controlling Data Volume, Velocity, and Variety’ , Technical report, META Group 6. NH&MRC (2007), Australian Code for the Responsible Conduct of Research
  6. NITRD, NSF (2016), Federal Big Data Research And Development Strategic Plan
  7. NIH, Data Commons, https://commonfund.nih.gov/bd2k/commons
  8. Rizzani, L (2016), Digital Data Storage is undergoing mind-boggling growth, EE Times
  9. Wilkinson MD. (2016), The FAIR Guiding Principles for scientific data management and stewardship, Scientific Data 3

Biographies

Rob Cook provides a consultancy service in eResearch from his own company, Pangalax.  Until recently he was the CEO of QCIF, the Queensland-based eResearch service provider that operates part of the Nectar and RDS research cloud and data storage.  Prior to QCIF, Pangalax worked on a number of large eResearch projects, and before that Rob was CEO of Astracon Inc, a Denver CO company that offered telecommunications service management software.  In the past Rob has led efforts to found and operate Cooperative Research Centres including the Distributed Systems Technology Centre and Smart Services.

Rhys spent the first decade of his career as an academic researcher in parallel and distributed computing. The next decade and a half included roles as a senior principal researcher, research programme manager and strategic leader in information and communication technologies in the Commonwealth Scientific and Industrial Research Organisation (CSIRO). His experience includes being the High Performance Scientific Computing Director for CSIRO and the National Grid Programme Manager for the Australian Partnership for Advanced Computing. From 2006 Rhys worked within the Australian Government’s National Collaborative Research Infrastructure Strategy as the facilitator for its investment plan in eResearch and subsequently as the Executive Director of the Australian eResearch Infrastructure Council. Since then through a series of engagements he has continued to work to harness advancing information and communication technologies to the benefit of Australian research.

About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2016 - 2017 Conference Design Pty Ltd