Automating ARGA: Standardisation and Ingestion for the Australian Reference Genome Atlas

Mr Christopher Mangion1, Matt Andrews1, Peter Brenton1, Jack Brinkman1, Keeva Connolly2, Winnie Mok2, Sarah Richmond3, Goran Sterjov1, Nigel Ward2, Kathryn Hall1

1Atlas of Living Australia, Australia, 2Australian BioCommons, Australia, 3Bioplatforms Australia, Australia

Biography:

My name is Christopher Mangion, I am the primary data engineer working on the ARGA project, a collaborative project between the ALA, Bioplatforms Australia, and the Australian Biocommons. I've been working on the ARGA project for 3 years, primary dealing with data collection and standardisation from the mana sources ARGA indexes. My goal is to streamline our entire process, so my focus can be set on indexing even more data sets, enabling our users to achieve everything on ARGA.

Abstract:

The Australian Reference Genome Atlas (ARGA) references a lot of different data from many sources and attempts to collate them into one digestible set of data. Much of this data is updated regularly and so ARGA needs to continuously collect this data and standardise it into a unified format. With so much data coming through, the best way to do this is autonomously, and so a workflow has been set up to achieve this.

The data is initially run through the ARGA data pipeline, standardising the variety of data types and sources into a normalised format. On the tail end of that pipeline, a conversion process occurs where output field names are remapped to an ARGA standard based on modified Darwin Core (https://dwc.tdwg.org/terms/). These fields are divided into 10 different event categories, resulting in a compressed output folder with a subfile for each event, containing the remapped fields.

To trigger this chain of events the ARGA backend requests an update from ARGA data, which determines which data sources require an update based on their custom update configuration and creates these packaged outputs. This is put in a known location by both ARGA data and the ARGA backend, where it is then picked up by the backend and loaded into a set of relational databases. This then achieves a resource for data to be searchable and easily retrievable for use on the ARGA portal (app.arga.org.au).

 

Categories