Research data management pipelines with Apache Airflow

Dr Siddeswara Guru1, Mr Andrew  Cleland1, Mr Javier Sanchez Gonzalez1, Mr Gerhard  Weis1, Mr Wilma Karsdorp1, Mr Edmond Chuc1

1University Of Queensland, St Lucia, Australia

Managing research data is a complex, convoluted task of processing, organising and sharing of data collected from research projects.  The data processing and organisation is an integral part of the data management pipelines and are often set of repeated tasks. Automating these tasks will enable substantial improvement in data management practices.  Recently, a new generation Direct Acyclic Graph (DAG) workflow management systems like Apache Airflow have been developed in the IT industry to batch process data processing and organisation tasks.

We will discuss two case studies on the use of Apache airflow in the development of data processing and organisation workflows for the management of ecosystem science data. The first case study is about batch processing of ecology images from multiple sites and making them accessible from the Ecoimage data platform. The batch processing involves: read ecology site images from 12 different sites from cloudStor, run a quality control check on images, access and store image exif information in Postgres database, create a thumbnail of images for faster display on the web, move images to the image storage for archival, index all images and related metadata in elastic search to drive web dashboard and notify data providers once all the processes are successfully executed. The workflow runs on Kubernetes to improve scalability. The second case study is the implementation of extract, transform, or load (ETL) process to migrate plot-based ecology data from different institutional data sources and move them to integrated data platforms.


Biography:

Siddeswara Guru is a program lead for the TERN data Services and Analytics platform.

ORCID ID: https://orcid.org/0000-0002-3903-254X

Curating, Discovering, and Disseminating Research Elements Using iRODS

Mr David Fellinger1

1iRODS Consortium, Chapel Hill, United States

It is indisputable that we are in the age of “Big Data”. Data must be managed that has the attributes of variety, velocity, volume, and veracity. The attribute of velocity is growing as the global bandwidth increases.

The Integrated Rule-Oriented Data System (iRODS) is an open source technology initiative that has been developed to manage data from raw instrument readings through publication solving the challenges of curation and addressing secure collaboration.

Data management solutions of worldwide research institutions were presented at the recent iRODS User Group Meeting.

CyVerse in the US, provides a Discovery Environment (DE) for a research data repository with over 80,000 users with 5,690 participating academic institutions and 2,438 non-academic organizations.

In Europe, the EUDAT Collaborative Data Infrastructure (CDI) was formed to host the data of over 50 universities and research institutions.

In the Netherlands, SURF has built a Research Data Management (RDM) framework entirely based on iRODS to manage their consolidated data.

In Sweden, the Swedish National Infrastructure for Computing (SNIC) provides storage capacity and compute resources for the nation.

In the state of Victoria, Australia, the Department of Agriculture is capturing, managing, and analyzing data for “smart farms” in order to define new and more efficient farming methods.

These are just a few examples of the use of iRODS by institutions that take advantage of the features of data virtualization, data discovery, workflow automation, and secure collaboration that iRODS technology provides. These and more use cases will be presented.


Biography:

Dave Fellinger is a Data Management Technologist and Storage Scientist with the iRODS Consortium. He has over three decades of engineering experience including film systems, video processing devices, ASIC design and development, GaAs semiconductor manufacture, RAID and storage systems, and file systems.

In his role at the iRODS Consortium, Dave is working with users in research sites and high performance computer centers to confirm that a broad range of use cases can be fully addressed by the iRODS feature set.

He attended Carnegie Mellon University and holds patents in diverse areas of technology.

Recent Comments

    Categories

    About the conference

    eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

    Conference Managers

    Please contact the team at Conference Design with any questions regarding the conference.

    © 2018 - 2020 Conference Design Pty Ltd