Research Data Stories: Struggles and Successes in Molecular Bioscience

Mr Steven Androulakis1, Ms Anna Fitzgerald2, Dr Andrew Treloar3, Dr Michelle Barker4, Lien Le5

1ANDS, NeCTAR, RDS, Parkville, Australia, steve.androulakis@nectar.org.au
2Bioplatforms Australia, Sydney, Australia, afitzgerald@bioplatforms.com
3ANDS, Caulfield, Australia, andrew.treloar@ands.org.au
4NeCTAR, Parkville, Australia, michelle.barker@nectar.org.au
5RDS, St Lucia, Australia

DESCRIPTION

What do researchers want, anyway?

The experience of many researchers is too often one of time-consuming struggle: installing tools, fixing code, integrating data, uncovering inconsistencies and understanding complex formats. How can eResearch practitioners help?

While eInfrastructure projects such as ANDS, Nectar, and RDS are helping to ease collaboration and analysis, there’s so much more to be done.

This BOF presents reflections of molecular bioscience science researchers to the eResearch Australasia audience to:

1.       Provide the eResearch audience with a direct perspective from researchers of working with research data and software tools

  • Present and discuss patterns of success
  • Seed deeper engagement between researchers and eResearch practitioners going forward
  • Help the eResearch audience conceptually link data and tool activity in the molecular bioscience domain with other domains

Researchers will share stories of opportunity, success, and frustration in working with and understanding data and associated tools. They will offer suggestions of where eInfrastructure could help them be more efficient in their science, and invite the audience to provide their perspectives on addressing their most pressing issues.

Researchers giving presentations will be from a diverse set of scientific and organisational backgrounds, and at varying stages of their careers.

Duration: 60 minutes

Format: Short presentations, followed by a group discussion


 

Biography

Steve Androulakis is responsible for facilitating the strategic development and operational implementation of research community engagement at the Nectar, ANDS and RDS projects. In particular, Steve will support the ANDS, Nectar and RDS Directorates in providing management across a broad portfolio of projects focused on the development and delivery of community building and engagement programs for domain, technical and method stakeholders in research communities.

Materials Data Facility: Enabling Data-Driven Materials Discovery

Dr Ian Foster1, Ben Blaiszik1, Ian Foster1, Logan Ward1

1Argonne National Laboratory & University Of Chicago, Chicago, United States

 

OVERVIEW

The Materials Data Facility (MDF: materialsdatafacility.org) [1] is a set of data services built to support materials science researchers. MDF consists of two synergistic services, data publication and data discovery. Its data publication service offers a scalable repository where materials scientists can publish, preserve, and share research data. Its repository provides a focal point for the materials community, enabling publication and discovery of materials data of all sizes. Its discovery service indexes data from many different public sources, enabling rapid discovery of data regardless of location and integrated analysis of data from multiple sources. Our goal in presenting MDF in this context is to solicit feedback and to seek collaborations within the Australian materials research community.

CONTEXT AND GOALS

Scientific researchers are increasingly often not data constrained, but rather limited by their ability to integrate and act on data: i.e., to analyze, comprehend, synthesize and combine, track, share, model, and mine myriad data sources to derive new insights and technical knowledge. This shift is now particularly apparent in materials science, where scientists are generating vast amounts of computational and experimental data from a wide set of user facilities (e.g., light sources), from simulations at supercomputing centers, from individual research labs, and from high-throughput experiments. We are developing MDF to both (a) provide ready access to these large quantities of often untapped data and (b) enable the ready application of new analysis methods, such as deep learning, to guide and, indeed, lead discovery.

Figure 1: MDF schematic, as described in text.

 

ARCHITECTURE

Figure 1 provides a summary of the MDF service ecosystem, showing how the MDF data publication (DPS) and data discovery (DDS) services are connected, and the actions users can perform in these services. Using MDF services, researchers can publish data to the DPS as bundles of data and metadata, leveraging distributed storage across endpoints. When a data publication is added, the metadata is automatically synced with the discovery service and deep indexing of materials-specific file contents also occurs. Users may query, browse, and aggregate data and metadata from the DDS through a web UI or through the API. Importantly, we are also investigating the harvesting and deep indexing of datasets external to the MDF ecosystem to bootstrap the index with scientifically relevant data.

STATUS

As of June 2017 MDF has been used to publish around 11 TB of materials science data from a variety of experimental and simulation studies. The data discovery service has indexed data from more than 50 other repositories and datasets comprising 200 TB, for a total of more than 1.8M records. Leveraging indexing and search capabilities provided by the Globus cloud service, MDF supports powerful faceted search. API access makes it easy to develop applications that query and analyze MDF content, for example to combine data from multiple sources to train machine learning models, and to implement “bots” that query, analyze, and dynamically update MDF content.

Figure 2 shows an example of MDF in action. A researcher looking for data about nearly stable compounds as determined by computational results in the Open Quantum Materials Database (OQMD) [2]. As these data are indexed in MDF, it is straightforward to write a few lines of Python to first query for, and then download, the desired data.

 

ACKNOWLEDGMENTS

This research was supported in part by NIST as part of the CHiMAD project funded by the U.S. Department of Commerce, National Institute of Standards and Technology, under financial assistance Award Number 70NANB14H012, and the the U.S. Department of Energy under Contract DE-AC02-06CH11357. We are grateful to our partners at the National Center for Supercomputing Applications, CHiMaD, and NIST for their assistance with this project.

REFERENCES

  1. Blaiszik, B., K. Chard, J. Pruyne, R. Ananthakrishnan, S. Tuecke, and I. Foster. “The Materials Data Facility: Data services to advance materials science research.” JOM68, no. 8 (2016): 2045-2052.
  2. Saal, J., S. Kirklin, M. Aykol, B. Meredig, and C. Wolverton. “Materials design and discovery with high-throughput density functional theory: the open quantum materials database (OQMD).” JOM 65, no. 11 (2013): 1501.

omics.data.edu.au: a powerful and adaptable multi-omic data integration, management and analysis framework

Dr Jeff Christiansen1, Shilo Banihit2, Dr Xin-Yi Chua3, Thom Cuddihy3, Dr Dominique Gorse3, Simon Gladman4, Dr Andrew Isaac4, Dr Neil Killeen5, Wei (Wilson) Liu5, Dr  Steven Manos5, Sara Ogston5, Nick Rhodes3, A/Prof Torsten Seemann4, Dr Anna Syme4, Dr Mike Thang3, Koula Tsiaplias5, Nigel Ward1, Dr Mabel Lum6, A/Prof Andrew Lonie4

1 QCIF and RCC-University of Queensland, Brisbane, Australia, jeff.christiansen@qcif.edu.au, nigel.ward@qcif.edu.au

2 QCIF and Queensland University of Technology, Brisbane, Australia, shilo.banihit@qut.edu.au

3 QFAB@QCIF and RCC-University of Queensland, Brisbane, Australia, xinyi.chua@qfab.org, t.cuddihy1@uq.edu.au, d.gorse@qfab.org, n.rhodes@qfab.org, m.thang@qfab.org

4 Melbourne Bioinformatics, University of Melbourne, Melbourne, Australia, simon.gladman@unimelb.edu.au aisaac@unimelb.edu.au, t.seemann@unimelb.edu.au, anna.syme@unimelb.edu.au, alonie@unimelb.edu.au

5 VicNode and Research Platform Services, University of Melbourne, Melbourne, Australia, nkilleen@unimelb.edu.au, wliu5@unimelb.edu.au, smanos@unimelb.edu.au, k.tsiaplias@unimelb.edu.au, sogston@unimelb.edu.au

6 Bioplatforms Australia, Sydney, Australia, mlum@bioplatforms.com

BACKGROUND

To understand all functions that occur within a biological system (e.g. a cell or organism) under different environmental or experimental conditions, a global profiling and analysis of all the biomolecular players in that system under the different conditions is required.

Over the past few decades, rapid technological advances in molecular profiling techniques of biological systems have made it possible for the biomolecular repertoire in such samples to be comprehensively characterised. This includes the genome (DNA which instructs the cell how to behave and found in all cells); transcriptome (different mRNAs that are copied from the DNA, whose presence and amounts are specific to the system and condition being examined); proteome (proteins that are formed according to the instructions in these mRNAs); and metabolome (small molecules produced by the organism or obtained from external sources and associated with processes such as metabolism).

Despite the technical ability to undertake such global profiling, integrating and making sense of these different ‘-omics’ data types remains very challenging for researchers. This is from both a conceptual information integration perspective as well as a logistical data management and analysis perspective – there is a lack of integrated and accessible storage, compute, software methods, tools and workflows that enable the integrative analysis of such data [1].

In late 2015, VicNode, Intersect, QCIF/QFAB and Melbourne Bioinformatics embarked on a collaborative project funded through the NCRIS Research Data Services (RDS) Food and Health Flagship program [2] to bring together a team with a broad skill set (across data management, biological metadata standards, interoperability, bioinformatics tool development, training and research systems hosting) to develop omics.data.edu.au – a cloud-based system to address these challenges.

In the first phase of its funding and development, the system has been built to accommodate data from bacterial pathogens for a specific research consortium: the Bioplatforms Australia (BPA) coordinated Antibiotic Resistant Pathogens Initiative (ABPRI) [3], whose members range from microbiologists to clinical researchers and are based at many research intensive universities in Australia including the University of Queensland, the University of Sydney, the University of Melbourne, Monash University, UNSW Australia, University of Technology Sydney, and the University of Adelaide.

KEY OUTCOMES

The project team developed an integrated cloud-based framework for the ABPRI researchers to find data and undertake a wide range of bioinformatics analyses across genomic, transcriptomic, proteomic and metabolomic data. The omics.data.edu.au system includes:

  1. An underlying data management platform (DMP)
    • allowing researchers to find specific data for their own analyses based on many criteria (e.g. raw versus analysed data; experimental condition; bacterial host and associated disease; omics data type; profiling technology used).
    • with an underlying data model that is conceptually applicable to any biological system [i.e. Project > Subject (specimen) > (experimental) Method > Study (omics-type specific) > Dataset(s)], whose specific elements adhere to internationally-agreed community standards for bacterial pathogen data required by global data repositories for each omics type (i.e. European Nucleotide Archive [4], ArrayExpress [5], ProteomeXchange [6] and Metabolights [7]). These information standards have been adopted to facilitate any future exchange of data from the DMP to such repositories, and is an approach aligned to the FAIR Data Principles [8].
    • built on DaRIS [9] /Mediaflux [10].
  • An associated data analysis platform (DAP)
    • that includes hundreds of tools for bacterial (and general) genomic, transcriptomic, proteomic and metabolomic data analysis.
    • tools cover two broad types: (a) to take raw instrument-derived data and convert into meaningful analysed data; and (b) exploratory (e.g. visualisation) and analysis tools to understand comparative differences between different sets of analysed data (e.g. condition A versus condition B).
    • caters to different bioinformatics skill levels – from novice to expert.
    • provides a variety of access methods – from GUI-based to command-line.
    • built on the microbial flavour of the Genomics Virtual Lab (GVL) [11] (which includes Galaxy [12]); and other key services such as Pathway Tools [13].
  • Tools and methods to move data between the DMP and DAP
    • facilitated by a GenomeSpace [14] connector.
    • supported data transfer methods/protocols include drag-and-drop for GUI users or SCP/SFTP/FTP.
    • also allows transfer of data to other computational environments (e.g. institutional resources, private GVL instances in the Nectar cloud etc.)
  • Training materials for the above

The project has maintained extensive and ongoing engagement with a wide range of stakeholders with varying interests and/or challenges in biological data production, distribution, management and use: the ABPRI consortium coordinators (BPA); data production facilities (Ramaciotti Centre for Genomics, Australian Genome Research Facility (AGRF), Australian Proteomic Analysis Facility (APAF), Monash Biomedical Proteomics Facility (MBPF), Metabolomics Australia (MA)); bioinformaticians across the consortium (at Melbourne Bioinformatics, AGRF, APAF, MBPF and MA); and the end user researchers.

The project has spearheaded for the first time the connection of multiple separate components that have been NCRIS-funded through previous Nectar, ANDS, RDSI and RDS eResearch investments.

The DMP and DAP have been designed to allow for future flexibility in that they: can be utilised independently of each other if required; can be adapted and extended for future research communities (e.g. mammalian, plant, population (meta-omics); and can accommodate a very wide variety of data types arising from multiple data generation techniques and/or facilities.

CONCLUSION

In building omics.data.edu.au, we have developed and presented to a research community, a national first: a cloud-based system for both integrated biological data management and associated informatics analysis for four broad “-omics” data types (DNA, RNA, proteins and metabolites), which enables the sharing of data and collaborative analysis amongst members of a research consortium. The platform has been designed so that it leverages existing national eResearch infrastructure and can be adapted and extended for future research communities.

REFERENCES

  1. Gomez-Cabrero, D, et al. Data integration in the era of omics: current and future challenges. BMC Systems Biology, 2014. 8(S2): I1
  2. Research Data Services (RDS) Food and Health Flagship program – https://www.rds.edu.au/omics
  3. BPA Antibiotic Resistant Pathogens Initiative (ABPRI) – http://www.bioplatforms.com/antibiotic-resistant-pathogens/
  4. European Nucleotide Archive – http://www.ebi.ac.uk/ena
  5. ArrayExpress – http://www.ebi.ac.uk/arrayexpress/
  6. ProteomeXchange – http://www.proteomexchange.org
  7. Metabolights – http://www.ebi.ac.uk/metabolights/
  8. FAIR Data Principles – https://www.force11.org/group/fairgroup/fairprinciples
  9. DaRIS – https://wiki.cloud.unimelb.edu.au/resplat/doku.php?id=data_management:daris
  10. Mediaflux – http://www.arcitecta.com/Products
  11. Genomics Virtual Lab (GVL) – https://www.gvl.org.au
  12. Galaxy – https://usegalaxy.org
  13. Pathway Tools – http://brg.ai.sri.com/ptools/
  14. GenomeSpace – http://genomespace.org

Biography

Jeff has a PhD in Biochemistry from the University of Queensland, and started his career as a researcher in the fields of cancer, molecular genetics and embryo development in both Australia and the UK, prior to moving into the management of large biological data assets through the establishment of a UK-based international database of mouse gene expression and anatomy.

Prior to joining QCIF/RCC, Jeff was based at Intersect Australia in Sydney where he was the National Manager of the RDS-funded med.data.edu.au project and also responsible for a number of biology-focused data and IT-related projects across NSW.

Prior to this, he was based in Melbourne at the Australian National Data Service (ANDS), where he was involved in commissioning and monitoring a number of biology/medicine-focused national data management projects.

http://orcid.org/0000-0002-8146-1225

About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2017 - 2018 Conference Design Pty Ltd