Research data management pipelines with Apache Airflow

Dr Siddeswara Guru1, Mr Andrew  Cleland1, Mr Javier Sanchez Gonzalez1, Mr Gerhard  Weis1, Mr Wilma Karsdorp1, Mr Edmond Chuc1

1University Of Queensland, St Lucia, Australia

Managing research data is a complex, convoluted task of processing, organising and sharing of data collected from research projects.  The data processing and organisation is an integral part of the data management pipelines and are often set of repeated tasks. Automating these tasks will enable substantial improvement in data management practices.  Recently, a new generation Direct Acyclic Graph (DAG) workflow management systems like Apache Airflow have been developed in the IT industry to batch process data processing and organisation tasks.

We will discuss two case studies on the use of Apache airflow in the development of data processing and organisation workflows for the management of ecosystem science data. The first case study is about batch processing of ecology images from multiple sites and making them accessible from the Ecoimage data platform. The batch processing involves: read ecology site images from 12 different sites from cloudStor, run a quality control check on images, access and store image exif information in Postgres database, create a thumbnail of images for faster display on the web, move images to the image storage for archival, index all images and related metadata in elastic search to drive web dashboard and notify data providers once all the processes are successfully executed. The workflow runs on Kubernetes to improve scalability. The second case study is the implementation of extract, transform, or load (ETL) process to migrate plot-based ecology data from different institutional data sources and move them to integrated data platforms.


Biography:

Siddeswara Guru is a program lead for the TERN data Services and Analytics platform.

ORCID ID: https://orcid.org/0000-0002-3903-254X

Categories