Jeff Christiansen1, Derek Benson2, Grahame Bowland3, Samuel Chang4, Simon Gladman5, Gareth Price6, Anna Syme7, Tamas Szabo8, Mike W C Thang9, Andrew Bissett10
1QCIF and RCC-University of Queensland, Brisbane, Australia, email@example.com
2RCC-University of Queensland, Brisbane, Australia, firstname.lastname@example.org
3Centre for Comparative Genomics, Murdoch University, Perth, Australia, email@example.com
4Centre for Comparative Genomics, Murdoch University, Perth, Australia, firstname.lastname@example.org
5Melbourne Bioinformatics, University of Melbourne, Melbourne, Australia, email@example.com
6QFAB@QCIF, Brisbane, Australia, firstname.lastname@example.org
7Melbourne Bioinformatics, University of Melbourne, Melbourne, Australia, email@example.com
8Centre for Comparative Genomics, Murdoch University, Perth, Australia, firstname.lastname@example.org
9QFAB@QCIF, Brisbane, Australia, email@example.com
10CSIRO, Hobart, Australia, Andrew.Bissett@csiro.au
‘Metagenomics’ refers to the study of genetic material from environmental samples (e.g. soil or water), where nucleic acids are sequenced using high throughput technology, and then analysed using informatics methods to identify and quantify the complex mixture of microorganisms that were present in the sample. Metagenomics as an approach has been revolutionary; demonstrating that microbial abundance and diversity in the environment is many times greater than expected . For example, a gram of soil typically contains around 10,000 species of bacteria and 1,000,000,000 individuals. The application of metagenomics to environmental studies also suggests that microorganisms are fundamental to ecosystem health by mediating biogeochemical and nutrient cycling, thereby influencing crop and livestock production and mitigating waste/pollution.
To start to develop a continental-scale map of Australian environmental microbial communities (i.e., to document what microbes are present in the environment across the country), Bioplatforms Australia (BPA) and partners have formed the Australian Microbiome consortium  and jointly invested over $10M towards the collection of thousands of soil, inland water and marine water samples across Australia and its territories; the extraction of DNA from these samples; and the production of primary sequence data from these samples. Additionally, robust and standardised data analysis pipelines have been developed which produce primary-level derived data in the form of large gene abundance data tables (i.e., one table format lists counts for each of the ~2,000,000 specific sequence tags in each of the ~5,000 samples, and another table acts as a key to identify the closest related taxonomic grouping (e.g., species, genus etc.) that relates to each sequence tag). The consortium has also developed a data repository  to house the raw sequence data, derived data, and contextual metadata for each collection site and event (i.e., geolocation, time, depth, environment type, chemistry etc.).
While great progress has been made in both collection of the data and production of the primary-level derived tabular data, multiple challenges remain to make these data accessible to many environmental researchers, who need to perform ‘secondary’ level analysis – for example statistical analyses over the data (e.g., normalisation, alpha- and beta-diversity, taxonomic binning, serial group comparisons, correlations) and to have access to extensive visualisation outputs in order to interpret the results.
In late 2017, BPA acted on behalf of the Australian Microbiome community to attract funding from Nectar/ANDS/RDS under the Research Data Cloud (RDC) program  to establish a cloud-based system to address these challenges, especially for researchers without dedicated informatics resources at their disposal. This presentation will outline the cloud-based analysis system established.
We have developed a web accessible system to support all Australian environmental metagenomics researchers (whether within or outside of the Australian Microbiome consortium) to undertake a wide range of bioinformatics-based metagenomics analyses, ranging from the initial primary-level molecular aspects for taxonomic identification through to secondary-level microbial community analysis through their web browser.
The framework has been implemented by extending and connecting two well established NCRIS-funded national computational infrastructure components: the BPA Data Portal  and the Galaxy-Australia service  (which is part of the Genomics Virtual Laboratory ):
- Extensions to the BPA Data Portal
- Implementation of support for the discipline standard BIOM (BIological Observation Matrix) format ,
- Improvements to increase the Findability, Accessibility, Interoperability and Reusability of datasets in the portal (g. adding data licences, data persistence policy, citation requirements),
- Contributing to the extension of international/national ontologies and publishing these in vocabulary repositories where appropriate (this activity will be ongoing at the time of presentation).
- Extensions to Galaxy-Australia
- Installation of the QIIME  and Mothur  molecular metagenomics analysis suites on the Galaxy-Australia service for primary-level analysis,
- Wrapping of the Rhea  and Phyloseq  R-based microbial community analysis packages (for secondary-level analyses) for use in Galaxy; deposition into the global Galaxy-Toolshed  for subsequent installation on any Galaxy instance; and installation on the Galaxy-Australia service.
- Methods to move data between the BPA Data Portal and Galaxy-Australia
- Through implementation of a Galaxy API  on Galaxy-Australia, a CKAN API  on the BPA Data Portal, and a mechanism for individual users to call each API from within the other system when required.
- Training on the above – due for delivery end-November 2018
- Development of self-paced online training material – to be available via Galaxy-Australia and the EcoED Ecoscience training portal ,
- Delivery of one 3-hour hands-on workshop across Australia utilising the EMBL-ABR ‘Hybrid’ method of delivery .
The project has maintained extensive, ongoing and transparent engagement with a wide range of stakeholders with varying interests and challenges in metagenomics production, distribution and use. This has been undertaken via a series of face-to-face stakeholder engagement events at locations across Australia (which have significant numbers of groups associated with the Australian Microbiome consortium), and through the use of a project blog , and a public Trello board which lists user requirements and tracks development sprints .
The cloud-based system we have developed through leveraging previous NCRIS-supported research data infrastructure represents an Australian first for end-to-end analysis and interpretation of environmental metagenomics data. A wide range of users are supported, including critically, users who are not molecularly-aware, but need to interpret molecular-based metagenomics data in the context of species occurrence records in the environment.
- Green-Tringe, S., and E. M. Rubin, Metagenomics: DNA sequencing of environmental samples. Nature Reviews Genetics, 2005. 6: 805-814.
- Bioplatforms Australia (BPA) Australian Microbiome Project –https://data.bioplatforms.com/organization/about/australian-microbiome
- ANDS-Nectar-RDA Research Data Cloud (RDC) Program – https://www.ands-nectar-rds.org.au/researchdomainprogram
- BPA Data Portal – https://data.bioplatforms.com/
- Galaxy-Australia – https://usegalaxy.org.au
- Genomics Virtual Laboratory (GVL) – https://www.gvl.org.au
- BIOM format – http://biom-format.org
- QIIME (Quantitative Insights Into Microbial Ecology) software – http://qiime.org
- Mothur (software for describing and comparing microbial communities) – https://mothur.org/
- Rhea (a set of R scripts for the analysis of microbial profiles) – https://lagkouvardos.github.io/Rhea/
- Phyloseq (a set of R scripts for the analysis of microbiome census data) – https://joey711.github.io/phyloseq/
- Galaxy Toolshed – https://toolshed.g2.bx.psu.edu
- Galaxy API – https://galaxyproject.org/develop/api/
- CKAN API – http://docs.ckan.org/en/ckan-2.7.3/api/
- EcoED – http://ecoed.org.au
- EMBL-ABR Hybrid Training Delivery Method – https://www.embl-abr.org.au/wp-content/uploads/2017/12/Monica2017.pdf
- Project blog – https://bioscience-rdc.blogspot.com.au
- Project Trello Board – https://trello.com/b/qsmrSuPC/rdc-development
Jeff has a PhD in Biochemistry from the University of Queensland, and started his career conducting research in the fields of cancer, molecular genetics and embryo development in both Australia and the UK, prior to moving into the management of large biological data assets (gene sequence, images, etc.) through the establishment of EMAGE, a UK-based international database of gene expression and anatomy.
Prior to joining QCIF and RCC, Jeff was based at Intersect Australia in Sydney where he was the National Manager of the RDS-funded med.data.edu.au project and also responsible for a number of biology-focused data and IT-related projects across NSW (biobanking, omics, etc.). Prior to this, he was based in Melbourne at the Australian National Data Service (ANDS), where he was involved in commissioning and monitoring a number of biology/medicine-focused national data management projects.