Knitting jumpers from steel wool and spaghetti: implementing a modified Darwin Core Event model for the Australian Reference Genome Atlas (ARGA) to increase trust through provenance.
Kathryn Hall1, Matt Andrews1, Keeva Connolly2, Yasima Kankanamge1, Christopher Mangion2, Winnie Mok2, Lars Nauheimer1, Goran Sterjov1, Nigel Ward2, Peter Brenton1 1Atlas of Living Australia 2Australian BioCommons Australia
Abstract
Trust is a key concern for researchers who work in data-intensive fields; if they cannot trust data they find, they are unlikely to reuse it. Addressing data trustability is a core imperative for the Australian Reference Genome Atlas (ARGA). New architecture to integrate genomics data from domain-specific repositories with Darwin Core formatted occurrence records, and other biodiversity data, from collectories lies at the digital heart of ARGA.
Problematically, each data type indexed by ARGA is generated from source biologicals using its own method; some of these data are raw, some are derivatives, some are aggregates. While data reuse within the biosciences is not restricted to primary data (indeed often secondary or tertiary data are preferred), it is critically important that clear provenance chains are visible to researchers to inform their discernment processes. Repositories for genomics data, such as NCBI GenBank, have historically been built with an emphasis on the molecular datasets themselves, providing complex metadata around annotations, but not explicitly recording methodology and source materials in detail.
Foundational work by ARGA integrates data from these two different structures, via a series of ingestion scripts implementing customised field mappings, to harmonise both data sources within an extended Event framework. Here we demonstrate our mapping process and detail how we have woven the two data structures together to form a coherent whole for genomics data. In creating a hybrid framework, ARGA transforms genomics data and their metadata, and enriches it with trusted organismal provenances, to enable discoverability within biological and ecological contexts.
Biography
Kathryn Hall is the Product Champion for the Australian Reference Genome Atlas (ARGA) Project, and is part of the Atlas of Living Australia (CSIRO).
She has a background in animal taxonomy, with a special focus on marine invertebrates. Kathryn has always worked to integrate genetic data with other biological data to inform her taxonomic process, and is very proud now to be overseeing the ARGA Project team as they build the ARGA platform. ARGA aims to help researchers discover data within biological and ecological contexts for all Australian biota.