A cloud-based system to enable streamlined access to and analysis of continental-scale environmental metagenomics data by non-genomics researchers

Jeff Christiansen1, Derek Benson2, Grahame Bowland3, Samuel Chang4, Simon Gladman5, Gareth Price6, Anna Syme7, Tamas Szabo8, Mike W C Thang9, Andrew Bissett10

1QCIF and RCC-University of Queensland, Brisbane, Australia, jeff.christiansen@qcif.edu.au

2RCC-University of Queensland, Brisbane, Australia, d.benson@imb.uq.edu.au

3Centre for Comparative Genomics, Murdoch University, Perth, Australia, gbowland@ccg.murdoch.edu.au

4Centre for Comparative Genomics, Murdoch University, Perth, Australia, schang@ccg.murdoch.edu.au

5Melbourne Bioinformatics, University of Melbourne, Melbourne, Australia, simon.gladman@unimelb.edu.au

6QFAB@QCIF, Brisbane, Australia, g.price@qfab.org

7Melbourne Bioinformatics, University of Melbourne, Melbourne, Australia, anna.syme@unimelb.edu.au

8Centre for Comparative Genomics, Murdoch University, Perth, Australia, tszabo@ccg.murdoch.edu.au  

9QFAB@QCIF, Brisbane, Australia, m.thang@qfab.org

10CSIRO, Hobart, Australia, Andrew.Bissett@csiro.au



‘Metagenomics’ refers to the study of genetic material from environmental samples (e.g. soil or water), where nucleic acids are sequenced using high throughput technology, and then analysed using informatics methods to identify and quantify the complex mixture of microorganisms that were present in the sample. Metagenomics as an approach has been revolutionary; demonstrating that microbial abundance and diversity in the environment is many times greater than expected [1]. For example, a gram of soil typically contains around 10,000 species of bacteria and 1,000,000,000 individuals. The application of metagenomics to environmental studies also suggests that microorganisms are fundamental to ecosystem health by mediating biogeochemical and nutrient cycling, thereby influencing crop and livestock production and mitigating waste/pollution.

To start to develop a continental-scale map of Australian environmental microbial communities (i.e., to document what microbes are present in the environment across the country), Bioplatforms Australia (BPA) and partners have formed the Australian Microbiome consortium [2] and jointly invested over $10M towards the collection of thousands of soil, inland water and marine water samples across Australia and its territories; the extraction of DNA from these samples; and the production of primary sequence data from these samples. Additionally, robust and standardised data analysis pipelines have been developed which produce primary-level derived data in the form of large gene abundance data tables (i.e., one table format lists counts for each of the ~2,000,000 specific sequence tags in each of the ~5,000 samples, and another table acts as a key to identify the closest related taxonomic grouping (e.g., species, genus etc.) that relates to each sequence tag). The consortium has also developed a data repository [2] to house the raw sequence data, derived data, and contextual metadata for each collection site and event (i.e., geolocation, time, depth, environment type, chemistry etc.).

While great progress has been made in both collection of the data and production of the primary-level derived tabular data, multiple challenges remain to make these data accessible to many environmental researchers, who need to perform ‘secondary’ level analysis – for example statistical analyses over the data (e.g., normalisation, alpha- and beta-diversity, taxonomic binning, serial group comparisons, correlations) and to have access to extensive visualisation outputs in order to interpret the results.

In late 2017, BPA acted on behalf of the Australian Microbiome community to attract funding from Nectar/ANDS/RDS under the Research Data Cloud (RDC) program [3] to establish a cloud-based system to address these challenges, especially for researchers without dedicated informatics resources at their disposal. This presentation will outline the cloud-based analysis system established.


We have developed a web accessible system to support all Australian environmental metagenomics researchers (whether within or outside of the Australian Microbiome consortium) to undertake a wide range of bioinformatics-based metagenomics analyses, ranging from the initial primary-level molecular aspects for taxonomic identification through to secondary-level microbial community analysis through their web browser.

The framework has been implemented by extending and connecting two well established NCRIS-funded national computational infrastructure components: the BPA Data Portal [4] and the Galaxy-Australia service [5] (which is part of the Genomics Virtual Laboratory [6]):

  • Extensions to the BPA Data Portal
    • Implementation of support for the discipline standard BIOM (BIological Observation Matrix) format [7],
    • Improvements to increase the Findability, Accessibility, Interoperability and Reusability of datasets in the portal (g. adding data licences, data persistence policy, citation requirements),
    • Contributing to the extension of international/national ontologies and publishing these in vocabulary repositories where appropriate (this activity will be ongoing at the time of presentation).
  • Extensions to Galaxy-Australia
    • Installation of the QIIME [8] and Mothur [9] molecular metagenomics analysis suites on the Galaxy-Australia service for primary-level analysis,
    • Wrapping of the Rhea [10] and Phyloseq [11] R-based microbial community analysis packages (for secondary-level analyses) for use in Galaxy; deposition into the global Galaxy-Toolshed [12] for subsequent installation on any Galaxy instance; and installation on the Galaxy-Australia service.
  • Methods to move data between the BPA Data Portal and Galaxy-Australia
    • Through implementation of a Galaxy API [13] on Galaxy-Australia, a CKAN API [14] on the BPA Data Portal, and a mechanism for individual users to call each API from within the other system when required.
  • Training on the above – due for delivery end-November 2018
    • Development of self-paced online training material – to be available via Galaxy-Australia and the EcoED Ecoscience training portal [15],
    • Delivery of one 3-hour hands-on workshop across Australia utilising the EMBL-ABR ‘Hybrid’ method of delivery [16].

The project has maintained extensive, ongoing and transparent engagement with a wide range of stakeholders with varying interests and challenges in metagenomics production, distribution and use. This has been undertaken via a series of face-to-face stakeholder engagement events at locations across Australia (which have significant numbers of groups associated with the Australian Microbiome consortium), and through the use of a project blog [17], and a public Trello board which lists user requirements and tracks development sprints [18].


The cloud-based system we have developed through leveraging previous NCRIS-supported research data infrastructure represents an Australian first for end-to-end analysis and interpretation of environmental metagenomics data. A wide range of users are supported, including critically, users who are not molecularly-aware, but need to interpret molecular-based metagenomics data in the context of species occurrence records in the environment.


  1. Green-Tringe, S., and E. M. Rubin, Metagenomics: DNA sequencing of environmental samples. Nature Reviews Genetics, 2005. 6: 805-814.
  2. Bioplatforms Australia (BPA) Australian Microbiome Project –https://data.bioplatforms.com/organization/about/australian-microbiome
  3. ANDS-Nectar-RDA Research Data Cloud (RDC) Program – https://www.ands-nectar-rds.org.au/researchdomainprogram
  4. BPA Data Portal – https://data.bioplatforms.com/
  5. Galaxy-Australia – https://usegalaxy.org.au
  6. Genomics Virtual Laboratory (GVL) – https://www.gvl.org.au
  7. BIOM format – http://biom-format.org
  8. QIIME (Quantitative Insights Into Microbial Ecology) software – http://qiime.org
  9. Mothur (software for describing and comparing microbial communities) – https://mothur.org/
  10. Rhea (a set of R scripts for the analysis of microbial profiles) – https://lagkouvardos.github.io/Rhea/
  11. Phyloseq (a set of R scripts for the analysis of microbiome census data) – https://joey711.github.io/phyloseq/
  12. Galaxy Toolshed – https://toolshed.g2.bx.psu.edu
  13. Galaxy API – https://galaxyproject.org/develop/api/
  14. CKAN API – http://docs.ckan.org/en/ckan-2.7.3/api/
  15. EcoED – http://ecoed.org.au
  16. EMBL-ABR Hybrid Training Delivery Method – https://www.embl-abr.org.au/wp-content/uploads/2017/12/Monica2017.pdf
  17. Project blog – https://bioscience-rdc.blogspot.com.au
  18. Project Trello Board – https://trello.com/b/qsmrSuPC/rdc-development


Jeff has a PhD in Biochemistry from the University of Queensland, and started his career conducting research in the fields of cancer, molecular genetics and embryo development in both Australia and the UK, prior to moving into the management of large biological data assets (gene sequence, images, etc.) through the establishment of EMAGE, a UK-based international database of gene expression and anatomy.

Prior to joining QCIF and RCC, Jeff was based at Intersect Australia in Sydney where he was the National Manager of the RDS-funded med.data.edu.au project and also responsible for a number of biology-focused data and IT-related projects across NSW (biobanking, omics, etc.). Prior to this, he was based in Melbourne at the Australian National Data Service (ANDS), where he was involved in commissioning and monitoring a number of biology/medicine-focused national data management projects.

Recent Comments