ARGA Data Retrieval

Mr. Christopher Mangion1, Simon Checksfield, Nigel Ward, Kathryn Hall, Sarah Richmond, Keeva Connolly, Winnie Mok, Goran Sterjov, Caitlin Ramsay, Matt Andrews

1Australian Biocommons, Australia, 2CSIRO / Atlas of Living Australia, Australia, 3Bioplatforms Australia, Australia

Biography:

I'm a data engineer working under the Australian Biocommons on the Australian Reference Genome Atlas project in conjunction with the Atlas of Living Australia. I work on the data ingestion process for metadata to be displayed on the ARGA portal. I've been attempting to simplify our process to represent all ingestion with configuration files, which I have done so successfully for most of the data sources we currently reference.

Abstract:

​​​​Data are stored in many different shapes and sizes everywhere across the internet, and often without consistency. The goal of the Australian Reference Genome Atlas (ARGA) project is to show users where to obtain data they may be interested in, and how it compares to data in other locations. To do this, we first need to collect the metadata and standardise it such that all pieces of metadata can be interpreted similarly. ARGA has created a bespoke in-house data retrieval pipeline for this purpose.

The data retrieval pipeline consists of three phases: download, processing, and conversion; this pipeline is driven by configuration files outlined for each database of interest. Downloading is currently split into distinct categories: URL, crawling, and script retrieval. Script retrieval is the most diverse, being used for both API access and scraping static data from web pages. Processing is then implemented based on specific file processing, processing that needs to be applied to all files, and final stage processing that combines files into a final output file representative of the entire target database. Some datasets require little to no processing, but are always represented with a single output file. Using this final file, mapping is implemented to convert columns to standardised names within the ARGA schema through the conversion process.

Here we detail the data ingestion pipeline and demonstrate how it will be implemented to index more datasets and adapt to handle any complex data challenges ahead.

 

Categories