Mr Guido Aben1, Ingrid Mason1, Adam Bell1
1AARNet, Kensington, Australia
AARNet has for several years offered the CloudStor platform to the Australian R&E community. CloudStor was designed to accept content directly from researchers and other users, including both active research data and research data that is no longer being actively used but which has been cited in publications. Although CloudStor provides vast storage capability, long-term preservation of the data stored wasn’t actively addressed, neither in architecture nor in customer facing functionality. For the purposes of this abstract, we will define “preservation” as the management of activities that will allow the data to be discovered, accessed, rendered, deemed reliable and re-used over many years and even decades.
Digital preservation is a known challenge and is being addressed by numerous software tools, services and projects. A Research Data Shared Services (RDSS) platform being piloted in the UK by JISC is designed to provide a suite of services for researchers to deposit, publish, share and preserve research data. Similar national and supra-national projects are underway in Canada and the EU. One of the tools being piloted as part of the RDSS is Archivematica, an open-source digital preservation system designed to ingest content, perform preservation actions, generate comprehensive technical and preservation metadata, and generate system-independent Archival Information Packages (AIPs) for long-term storage. AARNET have engaged Artefactual, the principal consulting firm behind the Archivematica codebase, to run a research data preservation pilot project for a select sample of content in CloudStor, the purpose of which is to assess whether and how Archivematica’s preservation functionality can best be integrated with CloudStor, and its functionality made available to data custodians at connected institutions.
CloudStor, being a platform that hosts live data before it is due for archival, gives us the interesting opportunity of inspecting files and collections ahead of the do-or-die moment of archival, and proactively identifying content with particular preservation risks; perhaps signaling these risks to librarians associated with the collection. For example, Archivematica includes a format identification microservice that attempts to determine the exact format and version of each file in a dataset, based on the PRONOM registry maintained by The National Archives in the UK. The project will investigate whether datasets that contain a large proportion of files not identifiable through the PRONOM registry are indeed at risk of being unusable in the future, as identified by participating librarians. Other preservation actions available in Archivematica include assignment of persistent identifiers, checksum generation, file format validation, metadata extraction, fixity checking, transcription, normalization to preservation formats and generation of standardized technical, preservation and audit metadata. We expect that the addition of these capabilities to CloudStor would enable the service to provide a truly sustainable long-term solution for research data preservation, storage and re-use.
Guido Aben is AARNet’s director of eResearch. He holds an MSc in physics from Utrecht University.
In his current role at AARNet, Guido is responsible for building services to researchers’ demand, and generating demand for said services, with the CloudStor / FileSender family perhaps the most widely known of those.