Vertical Integration of National Geophysical Data Assets to Support Next-generation Reproducible Research at Exascale

Jo Croucher¹, Hannes Hollmann¹, Yiling Liu¹, Andrew Robinson¹, Nigel Rees¹, Rebecca Farrington², Lesley Wyborn³, Ben Evans¹

¹National Computational Infrastructure (NCI), Canberra, ACT, Australia
²AuScope Ltd, Melbourne, VIC, Australia
³The Australian National University, Canberra, ACT, Australia

Abstract

The 2030 Geophysics Collections Project, co-funded by the Australian Research Data Commons (ARDC), AuScope, the National Computational Infrastructure (NCI) and Terrestrial Ecosystem Research Network (TERN), is focussed on positioning geophysical data collections to take advantage of High-Performance Computing (HPC) infrastructure, current computational and data science methods, and best-practice software engineering. To be usable in next-generation exascale environments, datasets must be fully FAIR-compliant and ‘vertically integrated’ to support greater transparency and reproducibility of research.

Using HPC, processing terabyte-scale field survey data now takes minutes rather than days or weeks. Researchers are increasingly demanding access to the minimally processed rawer datasets in usable data formats so that they can develop products for their specific case, rather than relying on generic analysis-ready products.

A key challenge has involved representing the ‘vertical’ relationship between different processed data products and the relevant source datasets. Datasets were organised according to the level of processing (e.g., distinguishing the raw packed time series from various level 1 or level 2 processed datasets). Detailed lineage statements enable derivative datasets to be associated with rawer forms, whilst higher-level data products reference the original raw time series data. Digital Object Identifiers (DOIs) facilitated interlinking and referencing of different data products and related software, and ensuring citation for researchers and impact-tracking for data repositories, organisations and funders.

This presentation will provide recommendations from the 2030 project on data organisation and cataloguing to enhance FAIR data access, reuse and sharing across vertically-integrated datasets for next-generation computational infrastructures.

Biography

Jo Croucher is a Research Data Specialist at the National Computational Infrastructure (NCI). Her scientific background includes experience in health research. As a data librarian she previously worked within the Library Repository Services team at UNSW Sydney, supporting different data curation projects across a wide range of disciplines.

Vertical Integration of National Geophysical Data Assets to Support Next-generation Reproducible Research at Exascale