Launching DataCrate v1.0: a general purpose data packaging format for research data distribution and web-display

Peter Sefton¹, Michael Lynch² Liz Stokes³Gerry Devine⁴

¹University of Technology Sydney, peter.sefton@uts.edu.au

²University of Technology Sydney, michael.lynch@uts.edu.au

³University of Technology Sydney, elizabeth.stokes@uts.edu.au

⁴Western Sydney University, g.devine@westernsydney.edu.au

At eResearch Australasia 2017 we presented an early version of a standard for packaging research data for distribution as data sets and for hosting on web sites known as DataCrate [1]. DataCrate builds on existing standards for packaging data (Bagit [2]) and metadata (Linked data via JSON-LD, similar to a method described by Wang et al ([3]), discovery metadata from schema.org and the SPAR ontologies [4] with an further innovation: each package contains a HTML catalog that can be used to describe content down to the file level (and beyond, in future to file contents such data dictionaries for tabular data file) building on earlier work at Western Sydney University [5].

The system is discipline agnostic, providing core metadata that can be used for data-set discovery, and for generating

DataCite citations, while being easily extensible to deal with discipline specific metadata.

In this presentation we will launch version 1.0 of the DataCrate standard. The presentation will cover:

The motivation for this work, and prior art – why we needed to bring together the standards we did in the way that we did.
A walk-through of example data crates from a variety of sources, speleology, clinical trials, simulation, social history, environmental science and microbiology.
An introduction to tools for making data crates with an appeal to attendees to join us in making more tools, for more new kinds of data.
A demonstration of how DataCrates are being used at UTS to move data though the research lifecycle – archiving and publishing data.

MOTIVATION

DataCrate is a standard that enables researchers to apply FAIR data principles [6] to how they manage research data (given tools to do so, which we discuss below).. As such it demonstrates an affordance that is currently lacking yet to be developed in institutional data management infrastructure. Eclipsing abstract exhortations for research data management, DataCrate enables good data management practice by providing data documentation that is human readable and comprehensible by the end user as well as enabling automated processes to support key processes in archiving and publishing research data.

TOOLS

HIEv DataCrate – At the Hawkesbury Institute for the Environment at Western Sydney University, a bespoke data capture application (HIEv) harvests a wide range of environmental data (and associated file level metadata) from both automated sensor networks and analysed datasets generated by researchers. Leveraging built-in APIs within the HIEv a new packaging function has been developed, allowing for selected datasets to be identified and packaged in the DataCrate standard, complete with metadata automatically exported from the HIEv metadata holdings into the JSON-LD format. Going forward this will allow datasets within HIEv to be published regularly and in an automated fashion, in a format that will increase their potential for reuse.

Calcytejs is a command line tool for packaging data into DataCrate developed at the University of Technology Sydney which allows researchers to describe any data set via the use of spreadsheets which the tool auto-creates in a directory tree.

Omeka DataCrate Tools is a collection proof of concept tool for exporting data from Omeka Classic repositories into the

DataCrate format written in Python.

A tool in development for exporting DataCrates from the Omero microscopy repository will also be presented.

DATACRATES IN THE RESEARCH LIFECYCLE

At the University of Technology Sydney, the Provisioner is an open framework for integrating good research data management practices into everyday research workflows. It uses DataCrates as a flexible interchange format to move datasets between diverse research apps such as lab notebooks, code repositories (where data is included by-reference), survey tools, collection management tools, and into archival and publication workflows. Examples of DataCrates moving through the research lifecycle will be provided.

REFERENCES

Sefton, P. DataCrate: Formalising ways of packaging research data for re-use and dissemination, Presentation, eResearch Australasia 2017,

https://conference.eresearch.edu.au/2017/08/datacrate-formalising-ways-of-packaging-research-data-for-re- use-and-dissemination/, accessed 22 June 2018.

Kunze, John, Andy Boyko, Brian Vargas, Liz Madden, and Justin Littman. “The BagIt File Packaging Format

(V0.97).” Accessed March 1, 2013. http://tools.ietf.org/html/draft-kunze-bagit-06.

Wang, Jingbo, Amir Aryani, Lesley Wyborn, and Ben Evans. “Providing Research Graph Data in JSON-LD Using

Schema.Org.” In Proceedings of the 26th International Conference on World Wide Web Companion,

1213–1218. WWW ’17 Companion. Republic and Canton of Geneva, Switzerland: International World Wide

Web Conferences Steering Committee, 2017. https://doi.org/10.1145/3041021.3053052.

Peroni, S., Shotton, D. (2018). The SPAR Ontologies. To appear in Proceedings of the 17th International

Semantic Web Conference. https://w3id.org/spar/article/spar-iswc2018/

Sefton, Peter, Peter Bugeia, and Vicki Picasso. “Pick, Package and Publish Research Data: Cr8it and Of The

Web.” In EResearch Australasia 2014. Melbourne, 2014.

http://eresearchau.files.wordpress.com/2014/07/eresau2014_submission_30.pdf.

Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak,

Niklas Blomberg, et al. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.”

Scientific Data 3 (March 15, 2016): 160018.

Biography:

Peter Sefton is the Manager, eResearch Support at the University of Technology, Sydney (UTS).

At UTS Peter is leading a team which is working with key stakeholders to implement university-wide eResearch infrastructure, including an institutional data repository, as well as collaborating widely with research communities at the institution on specific research challenges. His research interests include repositories, digital libraries, and the use of The Web in scholarly communication.

Mike Lynch is an eResearch Analyst in the eResearch Support Group at UTS. His work involves solution design, information architecture and software development supporting research data management. His other interests include data visualisation and functional programming languages.

Launching DataCrate v1.0: a general purpose data packaging format for research data distribution and web-display

Conference Host

Conference Managers

LINKS

ACKNOWLEDGEMENT OF COUNTRY

Launching DataCrate v1.0: a general purpose data packaging format for research data distribution and web-display

Website Sponsor

Conference Host

Conference Managers

LINKS

ACKNOWLEDGEMENT OF COUNTRY