Data Versioning: From Principles to Practical Recommendations

Dr Jens Klump1,2, Prof Dr Heinz Pampel2, Dr Mingfang Wu3, Ms Laura Rothfritz2, Ms Dorothea Strecker2, Prof Dr Lesley Wyborn4,5

1CSIRO, Perth, Australia, 2Humboldt Universität zu Berlin, Berlin, Germany, 3ARDC, Melbourne, Australia, 4Australian National University, Canberra, Australia, 5ARDC, Canberra, Australia

Biography:

Jens Klump is a geochemist by training and Group Leader, Exploration Through Cover in CSIRO Mineral Resources, based in Perth, Western Australia. Jens' work focuses on how information technology can be used to solve geoscience challenges. This includes data in minerals exploration, data capture, and data analysis, automated data and metadata capture, sensor data integration, both in the field and in the laboratory, data processing workflows, and data provenance, but also data analysis by statistical methods, machine learning and artificial intelligence.

Abstract:

The data lifecycle—from acquisition to release—is increasingly complex, involving multiple processing stages, research groups, and funding sources. There are also growing concerns about data sovereignty and data governance. This complexity demands a robust conceptual framework to ensure reproducible versioning and linking of any original datasets to their many derivatives.

To address this, we initially developed six data versioning principles for digital data artifacts by analysing use cases and adapting the Functional Requirements for Bibliographic Records (FRBR) framework developed around 1995 by IFLA (International Federation of Library Associations and Institutions) for analogue Information Resources. Our six principles established a common language for key concepts and terms: Revision, Release, Granularity, Manifestation, Provenance, and Citation (Klump et al., 2021, https://doi.org/10.5334/dsj-2021-012).

As part of a project supported by the Berlin University Alliance, a workshop was held to translate the principles into actionable practices. The workshop, held in June 2024 in Berlin, was attended by 40 experts from information infrastructure institutions with diverse scientific backgrounds (https://zenodo.org/records/13743876). Through this workshop and subsequent RDA community feedback, we refined our work into key recommendations.

In this presentation, we will introduce these recommendations to the Australian data community, which need to be embedded in research practices in a consistent way. The recommendations cover:

– adopting a consistent versioning strategy;

– considering standardisation initiatives;

– using persistent identifiers for unique identification of versions;

– implementing clear and descriptive version labels;

– ensuring user-friendly version control systems;

– documenting changes and metadata;

– communicating versioning practices clearly to stakeholders.

 

 

Categories