Long term curation of DMRs; the metadata within, and the data they reference

Dr Andrew Janke1, Ms Ingrid  Mason2, Ms Helen Morgan3, Ms Siobhann McCafferty4, Mr Andrew White5, Dr Peter Sefton6

1Research Data Services, Brisbane, Australia, andrew.janke@uq.edu.au

2AARNet, Canberra, Australia, ingrid.mason@aarnet.net.au

3The University of Queensland, Brisbane, Australia, helen.morgan@uq.edu.au

4The Australian Access Federation, Brisbane, Australia, siobhann.mccafferty@aaf.edu.au

5Queensland Cyber Infrastructure Foundation, Brisbane, Australia, a.white@qcif.edu.au

6University of Technology Sydney, Sydney, Australia, Peter.Sefton@uts.edu.au

INTRODUCTION

The University of Queensland has invested in data management systems and is building a reputation as a trusted provider of research data. This is for a number of reasons, but is driven by the need to ensure that all research at UQ abides by the Australian Code for the Responsible Conduct of Research[1], and that UQ research data meets the FAIR data principles [2].

The Research Data Manager (RDM) system has been developed with researchers and aims to be useful, enabling best practice research data management with minimal disruption to normal workflows. To meet this need a small project team designed and implemented a minimal viable metadata Data Management Record (DMR) based system [3, 4].

The system that has been built is centred on research projects rather than individuals, and looks to solve the working research data problem at a national level for research that involves UQ collaborators. This is achieved by defining access to data based upon AAF [5] credentials, thus easing the path for collaboration.

In order to further promote and ease collaboration the UQ RDM system is integrating with the Research Data Services (RDS) led Research Activity identifier (RAiD) project [6]. In the system each DMR has a unique, persistent identifier – a RAiD – associated with it. This will allow integration of the UQ system with both other institutions and service providers in Australia and in time internationally. The combination of a DMR with a RAiD allows for trusted published research outputs, tracking back from publications to the source research data and to the project that originally generated the data.

THE PROBLEM

It is inevitable, however, that at some point the project will finish. At this point it is critical that both the data referenced by the DMR and the metadata in the DMR is archived in a way that is accessible, in order to meet the F.A.I.R.[3] principles. The data also needs to be continually updated as formats change and gentrify. This can be handled by either continually monitoring the contents of archives and converting to new formats when required or by curating the data converters themselves. In either case automated monitoring and curation of the archives is critical.

A SOLUTION

In order to meet the ideals of F.A.I.R, a number of existing and proposed technologies might be used. Bagit [4] is a well established format that allows for the inclusion of metadata in an archive along with checksums that was designed for digital preservation. It has an established user base, is human readable and has broad support across the eResearch space. Tools to convert working data to archives, examples include cr8it and cloudstor collections, these plugins to cloud platforms allow users to create long term archives of working data with included provenance information.

Once data is in archives, only part of the problem has been solved, the data then needs to be curated long term, particularly from the view of ensuring that file formats that were used are still accessible. There are two approaches to this, the first and more difficult option is to preserve the code and methods needed to still access the data. The second is to continually monitor and update file formats. There are some evolving tools in this space [5].

In order to increase the likelihood of success, archived datasets need to be placed into a national repository that has both the resources and available power to continue to monitor and preserve archives. Such a national system has been proposed as part of the Research Data Services refresh.

Details of how these systems might all work together to solve this national problem will be proposed in this presentation, along with emerging ideals and tools, with a time for input and discussion by others at the end.

REFERENCES

  1. ANDS blog post: http://andscentral.blogspot.com.au/2017/05/dmrs-making-dmps-relevant-again.html
  2. Australian Code for the Responsible Conduct of Research https://www.nhmrc.gov.au/guidelines-publications/r39
  3. FAIR data principles: https://www.force11.org/group/fairgroup/fairprinciples
  4. Bagit specification: https://tools.ietf.org/html/draft-kunze-bagit-14
  5. DataCrate: https://github.com/UTS-eResearch/datacrate

Biography

Andrew Janke works on the RDS led Data LifeCycle Framework (DLCF), this project seeks to address how data should be managed and curated during the entire research lifecycle.

About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2016 - 2017 Conference Design Pty Ltd