Vertical Integration of National Geophysical Data Assets to Support Next-generation Reproducible Research at Exascale
Jo Croucher1, Hannes Hollmann1, Yiling Liu1, Andrew Robinson1, Nigel Rees1, Rebecca Farrington2, Lesley Wyborn3, Ben Evans1 1National Computational Infrastructure (NCI), Canberra, ACT, Australia2AuScope Ltd, Melbourne, VIC, Australia3The Australian National University, Canberra, ACT, Australia
Abstract
The 2030 Geophysics Collections Project, co-funded by the Australian Research Data Commons (ARDC), AuScope, the National Computational Infrastructure (NCI) and Terrestrial Ecosystem Research Network (TERN), is focussed on positioning geophysical data collections to take advantage of High-Performance Computing (HPC) infrastructure, current computational and data science methods, and best-practice software engineering. To be usable in next-generation exascale environments, datasets must be fully FAIR-compliant and ‘vertically integrated’ to support greater transparency and reproducibility of research.
Using HPC, processing terabyte-scale field survey data now takes minutes rather than days or weeks. Researchers are increasingly demanding access to the minimally processed rawer datasets in usable data formats so that they can develop products for their specific case, rather than relying on generic analysis-ready products.
A key challenge has involved representing the ‘vertical’ relationship between different processed data products and the relevant source datasets. Datasets were organised according to the level of processing (e.g., distinguishing the raw packed time series from various level 1 or level 2 processed datasets). Detailed lineage statements enable derivative datasets to be associated with rawer forms, whilst higher-level data products reference the original raw time series data. Digital Object Identifiers (DOIs) facilitated interlinking and referencing of different data products and related software, and ensuring citation for researchers and impact-tracking for data repositories, organisations and funders.
This presentation will provide recommendations from the 2030 project on data organisation and cataloguing to enhance FAIR data access, reuse and sharing across vertically-integrated datasets for next-generation computational infrastructures.
Biography
Jo Croucher is a Research Data Specialist at the National Computational Infrastructure (NCI). Her scientific background includes experience in health research. As a data librarian she previously worked within the Library Repository Services team at UNSW Sydney, supporting different data curation projects across a wide range of disciplines.