Capturing the Full-path of Any Data Asset: Engineering Vertical Integration From Initial Capture, Though Intermediate Processing Stages to Multiple Derivative Products.

Dr Lesley Wyborn1, Dr  Jens Klump2, Dr Ben Evans1, Nigel Rees1, Sheng Wang1, Dr Mingfang Wu3, Mr Ryan Fraser4

1National Computational Infrastructure, ANU, Canberra, Australia
2CSIRO, Kensington, Australia
3ARDC, Melbourne, Australian
4AARNet, Perth, Australia

Science has always required that all claims are capable of being evaluated against testable hypotheses and that research be reproducible. This necessitates transparency in data observations, methods of analysis and descriptions. Modern research data processing pipelines involve complex systems of physical and digital infrastructure and publishable artefacts: starting from initial samples and observations collected at the source (e.g., field measurements, raw data from laboratory instruments), through intermediate datasets (which can be derived from multiple processing steps), to data ‘products’ that are referenced in scholarly publications. The artefacts can be released at any stage along the ‘Full-path of data’: each component can be made accessible in multiple formats, often on numerous unrelated digital locations.

The technical complexity involved in exposing the Full-path of data, from initial capture to final released product, has been a major challenge, particularly in HPC. Too often, the various stages and processes along a data pipeline are either not documented or buried within unpublished scientific workflows: source or intermediate data is misplaced and not appropriately persisted; provenance information is routinely recorded in text-rich files that are not machine readable or accessible.

For transparent science, each release from any stage of the Full-path of data should be uniquely identified to ensure it can be ‘vertically’ integrated with predecessor(s) and subsequent derivative(s). Each release should include identifiers that enable attribution for any person/organisation making contributions (including funding). Portraying the Full-path of data as a ‘Knowledge graph’ offers more possibilities than just provenance, in particular, pattern analysis of research processes.


Lesley Wyborn is an Adjunct Fellow at the National Computational Infrastructure at ANU and works part-time for the Australian Research Data Commons. She  had 42 years’ experience  in Geoscience Australia in scientific research and in geoscientific data management. She is currently Chair of the Australian Academy of Science ‘National Data in Science Committee’ and is on the American Geophysical Union Data Management Advisory Board and the Earth Science Information Partners Executive Board. She was awarded the  Public Service Medal in 2014, the 2015 Geological Society of America Career Achievement Award in Geoinformatics and the 2019 US ESIP Martha Maiden Award.

Recent Comments