Adam Hunter1, Grahame Bowland2, Samuel Chang3, Tamas Szabo4, Kathryn Napier5, Mabel Lum6, Anna MacDonald7, Jason Koval8, Anna Fitzgerald9, Matthew Bellgard10
1Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, firstname.lastname@example.org
2Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, email@example.com
3Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, firstname.lastname@example.org
4Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, email@example.com
5Curtin Institute for Computation, Curtin University, Bentley, Australia, firstname.lastname@example.org
6Bioplatforms Australia, Sydney, Australia, email@example.com
7Centre for Biodiversity Analysis, Australian National University, Canberra, Australia, firstname.lastname@example.org
8Ramaciotti Centre for Genomics, University of New South Wales, Sydney, Australia, email@example.com
9Bioplatforms Australia, Sydney, Australia, firstname.lastname@example.org
10Office of eResearch, Queensland University of Technology, Brisbane, Australia, email@example.com
Innovative life science research requires access to state of the art infrastructure, ideally developed through a strategic investment plan that promotes technology development and builds expertise for the benefit of all Australian researchers. Bioplatforms Australia enables innovation and collaboration in life science research by investing in world class infrastructure and associated expertise in molecular platforms and informatics, such as genomics, proteomics, metabolomics, and bioinformatics. Through collaborative research projects, Bioplatforms Australia creates open-data initiatives that build critical ‘omics datasets that support scientific challenges of national significance .
Investment funding for Bioplatforms Australia has been provided through the Commonwealth Government National Collaborative Research Infrastructure Strategy (NCRIS) with co-investments made by State Governments, research institutions and commercial entities. Infrastructure investments are hosted across Australia by a network of leading universities and research institutions, which ensures broad access through contracted services and research collaborations.
To date, Bioplatforms Australia has invested in nine collaborative open-data projects to generate biological datasets of national significance such as the Australian Microbiome Database [2, 3], the Oz Mammals Genomics Initiative , and Antibiotic Resistant Sepsis Pathogens . The specific collective aims of these open-data projects are to: i) integrate scientific infrastructure with researchers; ii) build new data resources capturing essential meta data and integrating generated –omics data with other scientific data; iii) encourage, promote and facilitate multi-institutional, cross-discipline collaboration; iv) leverage co-investment from scientific, government, philanthropic and commercial partners; and v) enable participation in, and proactive engagement with, international research consortiums.
While the collaborative open-data projects are aligned with national research priorities that seek to improve Australia’s health and well-being, the datasets are contributing to increasing knowledge on issues of global significance. For example, the Antibiotic Resistant Sepsis Pathogens project brings together multidisciplinary teams to identify common pathogenic pathways in order to ultimately develop new approaches to disease management. The appropriate management of such research data is therefore of critical importance to ensure the data remains a valuable asset.
In order to appropriately manage researcher and public access to raw and analysed data and associated contextual metadata from numerous collaborative open-data projects and to bring research communities together, a sustainable and scalable digital platform solution was needed. The Bioplatforms Australia Data Portal (‘Data Portal’) was thereby created through a collaboration between Bioplatforms Australia and the Centre for Comparative Genomics at Murdoch University. The Data Portal provides online access to datasets and associated metadata, empowers research communities to curate and manage the data, and is built upon open source, best-of-breed technology using international standards.
development of the bioplatforms australia data portal
The Data Portal is a data archive repository that houses raw sequence data, analysed data, and associated contextual metadata for each of the nine collaborative open-data projects. In the development of this Data Portal, we identified several key criteria to be addressed to ensure the deployment of a sustainable and scalable digital platform that can be applicable for a broad community of users: i) open-source software adopting leading technology; and ii) purposeful application of data and adoption of the FAIR data principles .
Open-source software adopting CKAN
The Data Portal was originally developed by bespoke software development and deployed on traditional, on-premises data storage systems. However, software development of a bespoke system is generally time consuming, expensive, and is not sustainable long-term. In order to leverage off other national investments and ensure long-term sustainability, the Data Portal adopted the Comprehensive Knowledge Archive Network (CKAN) as the core technology to replace bespoke software code . CKAN is the world’s leading open-source data portal platform, and is used by numerous federal governments and public institutions to share data with the general public, including the Australian federal government  and Western Australian government  data portals, and the United Kingdom’s Natural History Museum data portal . As the on-premises data storage reached its end of life, the Data Portal was migrated to Amazon Web Services with the support of Bioplatforms Australia. The code for the Data Portal and associated tools, such as extensions to the CKAN project, is open-source .
Purposeful application of data and adoption of FAIR data principles
The Data Portal provides researchers access to data and associated contextual metadata. Metadata is necessarily in a state of flux, from collection by field scientists through to PCR and sequencing, and needs to be inherently updated in a reproducible manner from authorised data services. Standards have been developed and adopted for both data and processes in order to automatically ingest large amounts of sequencing data and associated metadata. The Data Portal has also established robust functionality in regard to the FAIR data principles of Findable, Interoperable, Accessible, Reusable . For example, all data in the Data Portal can be accessed via its identifier, using a standardised, open and documented API, subject to authentication. Bulk data access, allowing researchers to download data en-masse subject to user-defined search terms, is also available. Established international and national ontologies are also used in databases.
To date, the Data Portal has directly managed the ingestion of tens of thousands of samples constituting over 60 terabytes of data. Bioplatforms Australia enables a broad scope of research endeavours through investment in nationally collaborative programs that fund the building of new datasets and ultimately offering them as a public resource. By employing large scale consortia to build and analyse datasets, existing academic and end-user knowledge is combined with leading ‘omics capabilities to create distinctive sample collections of national and international importance. The Bioplatforms Australia Data Portal, built upon open-source, best-of-breed technology using the same underlying technology deployed by numerous governments and organizations worldwide , enables effective management and access to these valuable data resources to ensure their perpetual value.
- Bioplatforms Australia. Available from: http://www.bioplatforms.com/what-we-do/, accessed 4 June 2018.
- Australian Microbiome. Available from: https://data.bioplatforms.com/organization/about/australian-microbiome, accessed 4 June 2018.
- Bissett, A., et al. Introducing BASE: the Biomes of Australian Soil Environments soil microbial diversity database. GigaScience, 2016. 5(1): p. 21.
- Oz Mammals Genomics Initiative. Available from: https://data.bioplatforms.com/organization/about/bpa-omg, accessed 3 June 2018.
- Antibiotic Resistant Sepsis Pathogens. Available from: https://data.bioplatforms.com/organization/about/bpa-sepsis, accessed 3 June 2018.
- Wilkinson, M.D., et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 2016. 3: p. 160018.
- CKAN. Available from: https://ckan.org/, accessed 4 June 2018.
- Australian Government Data Portal. Available from: https://data.gov.au/, accessed June 3 2018.
- Western Australian Government Data Portal. Available from: https://data.wa.gov.au/, accessed June 3 2018.
- Natural History Museum Data Portal. Available from: http://data.nhm.ac.uk/, accessed June 3 2018.
- Bioplatforms Australia. Available from: https://github.com/BioplatformsAustralia, accessed on June 3 2018.
- CKAN. Available from: https://ckan.org/about/instances/, accessed on June 4 2018.
Grahame Bowland is a Software Developer at the Centre for Comparative Genomics at Murdoch University. Grahame is a senior member of the software development team which develops, deploys, and maintains eResearch software solutions such as the Bioplatforms Australia Data Portal, electronic biobank solutions, and disease registries.
Dr Kathryn Napier recently joined the Data Science team at the Curtin Institute for Computation at Curtin University. Kathryn previously worked at the Centre for Comparative Genomics at Murdoch University as a Research Associate in the areas of Bioinformatics and Health Informatics. Kathryn worked with the CCG’s software development team who develop and deploy eResearch software solutions such as disease and patient registries and the Bioplatforms Australia Data Portal.