Enabling Australian Genomics research through enhancements to the Genomics Virtual Lab

Gareth Price1Derek Benson1*,Simon Gladman2*, Igor Makunin1, Anna Syme2, Helen van de Pol2, Christina Hall2, Nuwan Goonasekera2, Andrew Isaac2, Andrew Lonie2, Nigel Ward3, Jeff Christiansen4, Gareth Price5

  1. RCC-University of Queensland, Brisbane, Australia, {benson, i.makunin}@imb.uq.edu.au
  2. Melbourne Bioinformatics, University of Melbourne, Melbourne, Australia, {aisaac, alonie, syme, cr.hall, helen.vanderpol , n.goonasekera, simon.gladman}@unimelb.edu.au
  3. QCIF, Brisbane, Australia, ward@qcif.edu.au
  4. QCIF and RCC-University of Queensland, Brisbane, Australia, christiansen@qcif.edu.au
  5. QFAB@QCIF, Brisbane, Australia, price@qfab.org

* These Authors contributed equally and are listed alphabetically.


The rise of “next generation” sequencing has transformed biological research into a data-intensive endeavour. In order to reduce entry-level complexity for bioinformatics analyses, global efforts have focused on the generation of graphical-user interface front-ends to computational back-ends, of which the Galaxy Project is the preeminent example. Within Australia, Galaxy, R and other analysis environments have been made available inside the Genomics Virtual Lab (GVL), as both a self-installable environment and as a managed service [1, 2]. The BioDeVL project aims to provide a “data enhanced” and “user-enhanced” managed GVL service for all researchers in Australia and to up-skill the community in the use of this platform [3]. This will offer all Australian bioscience researchers the opportunity to more easily access and apply bioinformatics approaches to their research, without needing to worry about resourcing, deploying, configuring and performing other system administration tasks that currently preclude use by many researchers.

The project will:

  • Provide a world leading data advantage by:
    • Enabling all Australian bioscience researchers to access a professionally managed and appropriately resourced on-line computational service platform to underpin their biomolecular data analyses;
    • Ensuring all reference datasets are added to the service with appropriate descriptions and provenance information such that they are unambiguously identifiable – therefore affording Australian researchers an ability to better undertake reproducible analyses on the service.
  • Accelerate innovation by:
    • Enabling sophisticated biomolecular data analysis capability on top of cloud compute resources;
    • Providing reliable, quality controlled and trusted analysis tools within the service to encourage research innovation and collaboration;
    • Training and upskilling biology researchers to understand and utilise molecular biology-related reference data and analytical tools.
  • Create collaborative technology and partnerships for borderless research by:
    • Creation of a single national managed computational service platform, allowing collaborative and borderless research for all Australian researchers, and by invitation, their international collaborators;
    • Continue partnerships with existing international development and technology partners to ensure best practice tools and methodologies are applied to the Australian service to underpin borderless research.
  • Enhance the translation of research by:
    • Helping to translate basic bioinformatics research into tools for diagnostic purposes in fields such as public health by acting as a user-friendly vehicle to make tools and workflows accessible to non-bioinformatics experts.


The key outcome for the GVL and Galaxy Australia is user numbers and number of tool executions. The latest figures will be shown in this presentation. However since the beginning of the project specific effort has been focused on the Galaxy Australia component of the GVL, to provide an easy to use analysis platform for genomics research [4]. Galaxy Australia has been aligning its “look and feel” to international Galaxy sites, whilst maintain the tools and reference datasets necessary to support Australian research activity. The reconciliation of computational resources to allow for efficient management of the GVL, ensuring reference datasets are up-to-date and cited in alignment with FAIR principles as well as maintaining reliable, quality controlled and trusted analysis tools is all aimed to increase use of the GVL. Further, the deployment of increased resources for the training and upskilling of biology researchers will support the community to maximize their experience in the GVL.


Achieving an optimized GVL and Galaxy Australian service first involved reconfiguration of existing resources hosted in Brisbane and Melbourne, historically both operating a Galaxy service, to provide a single controlling (Head) node for Galaxy Australia. Through the use of tool resource allocation and job submission (based on long run time or high memory usage) to dedicated resources, users will experience the fastest run time possible for their analyses. This has been achieved using a new Head Node configuration, new minimally packaged worker nodes, HTC Condor job management and integrated docker spawned environments. Rationalisation, publication and alignment of training material maintained at Australia and global repositories plus EMBL-ABR led Train-the-Trainer activities will drive the “user-enhanced” experience for Galaxy and GLV users. Finally, having a single user facing help desk will allow for reduced query response time [5].


The GVL has undergone transformation and Galaxy Australia has been launched; with the latest version of Galaxy, the most up to date reference genomes, datasets and tools, plus tools requested specifically by Australian researchers to enable their studies, Figure 1

Figure 1: Galaxy Australia Landing Page



  1. GVL server image: https://www.gvl.org.au/get/
  2. GVL managed hosted data analysis environments (Galaxy and R-studio): https://www.gvl.org.au/use/
  3. ANDS / RDS / Nectar DeVL projects: https://www.ands-nectar-rds.org.au/researchdomainprogram
  4. GalaxyAustralia website: https://usegalaxy.org.au
  5. GalaxyAustralia User Support: help@genome.edu.au



Adding value to research data: Collaboration between APO, VIVO and Research Graph

Presenters: Ms Michelle Zwagerman1, Mr Les Kneebone1

Authors: Amanda Lawrence1, Camilo Jorquera1, Peter Vats2, Michael Conlon³, Amir Aryani2

1Analysis & Policy Observatory (APO.org.au), Hawthorn, Australia

2Research Graph, Melbourne, Australia, {peter.vats, amir.aryani}@researchgraph.org

3University of Florida, Gainesville, Florida, USA, mconlon@ufl.edu


Analysis & Policy Observatory (APO) is an early adopter of the Research Graph Augment API. In Feb 2018, APO has joined the Duraspace pilot project to test this new cloud-hosted API. APO has leveraged this pilot and the Research Graph Technology to augment social science and open policy data using the global network of scholarly works. In this presentation, we report on the outcome of the pilot and describe how the Augment API has added value to APO ‘s research repository by increasing the number of linked publications and datasets by 71%. Also, we will present how the Augment API has transformed the APO’s data to VIVO RDF and provided a bridge between bibliographic records and semantically enabled data infrastructures. Finally, we will talk about the future roadmap for the social science and open policy data graph — a collaborative project between APO and international partners such as GESIS.


³APO was recently ranked the 5th most important repository in Australia and 141 out of over 2,000 repositories around the world (Webometrics 2017)1. APO includes nearly 40,000 records and features the work of over 5,000 organisations and 20,000 authors. The open access database specialises in policy and practice grey literature such as commissioned reports, discussion papers, working papers, briefings, conference papers, evaluations and case studies, but also includes datasets and over 10,000 policy related journal articles.


A recent collaboration between VIVO and Research Graph [1] developed and demonstrated a repeatable process for using seed data to build first and second order graphs, and to export, transform, and load those graphs in VIVO RDF format to a hosted VIVO instance [2]. As illustrated in Figure 1, this process enriches the research repositories’ data by (1) Transforming repository data to a graph database, (2) Augmenting the graph with the Research Graph data, (3) Making this graph available as a VIVO instance. In Feb 2018, APO has joined a pilot project by Duraspace2 to leverage this technology [2].

Adding  APO   content   to  the  Research  Graph  environment resulted  in  immediate  increase  in  links  from  APO  research objects  to  publications  via  publishers  and  authors.  APO had

1,462 research objects with PID (Persistent IDs) such as ORCID, DOI and ScopusID. The PIDs play a key role in connecting APO content entities  within  the Research Graph. The Augment API was then able to predict matches with, and assign new PIDs to research objects, thereby enriching the overall stock of PIDs in APO repository.   After one iteration, links from APO content to external publications grew from 25,959 to 44,542 – a 71% increase. With the newly enhanced PID data, further iterations of APO exposure to the Research Graph resulted a snowballing of connections  with  research  objects  in  the  graph.  Figure  2 shows the APO’s graph before and after augmentation.

Figure 1: Research Graph Augment API.


1 Cybermetrics Lab, 2017, Ranking web of world repositories: Oceania http://repositories.webometrics.info/en/Oceania/Australia

2 https://wiki.duraspace.org/display/VIVO/Proposal


Figure 2: APO’s graph before and after augmentation

The results of the Augment API can be accessed in a Neo4j Graph DB. The graph data is also converted into RDF and the triples available to search and browse interfaces including the OpenVIVO interface.

In this talk, we will discuss the augmentation process and the lessons learnt from the pilot. In addition, we will present the APO’s graph visualisation and describe the changes appeared in the graph as the result of linkage to external data sources such as ORCID, Scholix, DataCite, and Crossref.


With a deluge of unstructured documents and diverse data to sift and analyse, researchers working on multidisciplinary public policy issues urgently need new digital research methods and integrated data solutions if they are to provide the evidence needed  to  have  an  impact on policy decisions and practices. By augmenting data we enable linkage of previously disconnected information. This pilot has demonstrated the possibilities. The next step is to integrate the open  policy  data  graph  within  APO’s  existing  service  offerings.  Furthermore,  APO is exploring the possibility of expanding it’s repository of social science and open policy documents by transforming the existing work to a larger graph that includes data from international partners such as GESIS in Germany and British Library. This engagement is part of a new  Research Graph collaborative project to build a domain-specific graph for social science research. As part of this presentation, we will provide further updates on this project.


[1]   A. Aryani, M. Poblet, K. Unsworth, J. Wang, B. Evans, A. Devaraju, B. Hausstein, P. Klas, B.Zapilko, S. Kaplun, “A Research Graph dataset for connecting research data repositories using RD-Switchboard”, Nature Scientific Data, Volume 5, Pages 180099, 2018, http://dx.doi.org/10.1038/sdata.2018.99

[2] M. Conlon, A. Aryani. “Creating an Open Linked Data Model for Research Graph Using VIVO Ontology,” July 24, 2017. https://doi.org/10.4225/03/58ca600d726bd.


Michelle Zwagerman is the Digital Product Manager for Swinburne’s APO.org.au and the CRC for Low Carbon Living’s BuiltBetter.org Knowledge Hub. She has completed a Master of Public Policy at RMIT, a Master of Business Administration at University of NSW, and a Bachelor of Science at University of Melbourne. She has over 20 years’ experience in Information Technology having delivered numerous IT projects and managed various IT support services.

Les Kneebone has worked in information management roles in government, school, community and research sectors since 2002. He mainly contributed to managing metadata, taxonomies and cataloging standards used in these sectors. Les is currently supporting the Analysis & Policy Observatory by developing and refining metadata standards and services that will help to link policy literature with datasets.

Unlocking the mystery of our urban population history

Dr Serryn Eagleson1, Steven McEachern2, Michael Rigby1, Josh Clough1, Amir Fila1, Xavier Goldie1, Phil Greenwood1, Rob Hutton1, Ivan Widjaja1

1 AURIN, University of Melbourne, Melbourne, Australia, admin@aurin.org.au

2 ADA, Australian National University, Australia, ada@anu.edu.au



Future decision-making processes are often limited by our ability to comprehend the impacts of decisions made in the past. Currently cities across Australia are rapidly growing and good decisions are needed to ensure that people have access to adequate housing, services and jobs. Though planners today have access to a range of high value datasets describing the current state of society, they also require an understanding of the impacts of historical decisions and how environments have changed. Over the last 200 years, Australia’s census products have collected a wealth of information on the growth and development of its population, society and economy, with much of this historical data now housed within the Australian Data Archive (ADA). If this vast archive were to be both spatially and temporally enabled, this rich resource would become available and usable by researchers and decision makers, facilitating greater understanding how regions have grown and changed in response to activities such as the provision of infrastructure. Further, this might also reveal a myriad of unknowns to learn from in areas such as health, built and natural environment, education, and energy. This presentation outlines new workflows developed to expose historical data from the ADA through to modern mapping systems via the Australian Urban Research Infrastructure Network (AURIN). It will present an example investigation into population changes within the City of Logan in Queensland over the past 30 years.


The publishing of historical census products required data relationships to be established between the Australian Bureau of Statistics (ABS), the ADA and AURIN at both human and machine levels. The initial phase consisted of the identification of points of contact, existing data infrastructures, formats, standards, etc. This was followed by agreement on the project’s high-level purpose, understanding the data’s context and eventual context of use (Lloyd and Dykes, 2011). The resulting baseline considered the translation of data into a usable format, where usability was considered in different dimensions.

Based on these requirements, an agile workflow was designed (Figure 1) and iteratively refined with project milestones. As raw data from the ABS was originally stored on technology using legacy formats, significant effort was required by the ADA to extract, translate and load the data into a version that reflected the ABS’s data structure and AURIN’s publishing requirements. Once formalized, statistical software was used to summarise the numerical data, defining types and categories. Next, the open-source Dataverse Project (https://dataverse.org/) was used for data sharing (King, 2007), outputting both the processed and numerical summary data. This provides the first entry point for users to access the authoritative tabular data from the ABS. Next, AURIN consumed historic boundaries from the ABS, which had undergone cleaning, re-alignment and re-projection for compatibility with its data infrastructure and web solutions. Following this, the boundary geometry was generalized/optimized for web use, and joined with the processed census data from the ADA. The summary data was then combined with other human readable material to curate each census product’s metadata according to the ISO 19115 standard (ISO, 2014), to ensure that the data is FAIR (findable, accessible, interoperable and re-usable) (FORCE11, 2014). This implementation was extended to address additional data citation principles (Data Citation Synthesis, 2014) and provenance requirements (World Wide Web Consortium, 2013). Once prepared, essential legal matters were identified and agreed, and final products were registered and published via AURIN’s web applications: Map, Portal and API, as the second access point for visualization and analysis tasks.

Using the new census products, analyses were performed on population density over Logan in Queensland. The region is the centre of new data exploration by the Griffith University (Regional Innovation Data Lab) leveraging the AURIN data infrastructure in conjunction with Queensland Cyber Infrastructure Foundation. To encourage participation from stakeholders, an early prototype visualisation was designed to demonstrate the power of spatio-temporal data (MacEachren and Taylor, 1994). The purposes of this is to examine how visual thinking and exploration may be used in the design of an interactive dashboard for the Logan community. This prototype is currently undergoing user testing with results fed back to guide the design of subsequent releases.

Figure 1: Workflow for publishing historical census data via AURIN


This project was funded by the National Collaborative Infrastructure Scheme (NCRIS) and the Australian National Data Service (ANDS). It aims to assist collaboration across the following HASS platforms: eResearch SA Limited, Australian Data Archive, Alveo, Griffith University, AARNet and TROVE as part of the Humanities, Arts and Social Sciences (HASS) Data Enhanced Virtual Lab (DEVL) project (https://hasscloud.net.au/). Additionally, we acknowledge the support of Griffith University (Regional Innovation Data Lab) and the Queensland Cyber Infrastructure Foundation (QCIF) and the Australian Bureau of Statistics for providing data and contextual information.


1    Lloyd, D. and Dykes, J. (2011) “Human-centred approaches in geovisualisation design: Investigating multiple methods through a long-term case study”, IEEE Transactions on Visualization and Computer Graphics, vol. 17, no. 12, pp. 2498-2507
2.   King, g. (2007) “An Introduction to the Dataverse Network as an Infrastructure for Data Sharing” Sociological Methods and Research, vol. 36, pp. 173–199. Available online: http://j.mp/2owjurr

  1. World Wide Web Consortium (2013) “PROV Model Primer”, W3C Working Group Note 30 April 2013. Available online: https://www.w3.org/TR/2013/NOTE-prov-primer-20130430/
  2. FORCE11 (2014) “Guiding Principles for Findable, Accessible, Interoperable and Re-usable Data Publishing version B1.0”, FORCE11: The Future of Research Communications and e-Scholarship. Available online: https://www.force11.org/fairprinciples
  3. Data Citation Synthesis Group (2014) “Joint Declaration of Data Citation Principles”. Martone M. (ed.) San Diego CA: FORCE11. Available online: https://doi.org/10.25490/a97f-egyk
  4. International Standards Organization (2014) “ISO 19115-1:2014 Geographic information — Metadata — Part 1: Fundamentals”, ISO. Available online: https://www.iso.org/standard/53798.html
  5. MacEachren, A. M., & Taylor, D. (Eds.) (1994) Visualization in Modern Cartography. First edition, London, Great Britain: Pergamon.


Serryn is Deputy Director at AURIN and Assistant Research Director at the CRC for Spatial Information. Serryn completed her PhD in Geographical Information Systems (GIS) and the design of administrative boundaries at the University of Melbourne in 2003. This research contributed to changing in the allocation of administrative boundaries across Australia, in recognition for her work was awarded a prestigious Victorian Fellowship. In addition to academic qualifications Serryn has over 15 years of applied experience in spatial modelling and applying this expertise to inform urban planning for local and state governments. Serryn has also actively worked in consulting and provided advice as an Expert Witness for the City of Melbourne at the Victorian Planning Panel.

Safe Havens – enabling data driven outcomes

Ms Komathy Padmanabhan1, Dr Stevan Quenette1, Ms Anitha Kannan1, Mr Paul  Bonnington1

1 Monash University, Clayton, Australia, komathy.padmanabhan@monash.edu

2 Monash University, Clayton, Australia, steve.quenette@monash.edu

3 Monash University, Clayton, Australia, Anitha.Kannan@monash.edu

4 Monash University, Clayton, Australia, paul.bonnington@monash.edu


The value of Big Data collected through various studies lies not just in the data, but through appropriate data management, where the governance allows data to be analysed to generate insights & outcomes in a timely manner, without compromising on the ethics, privacy & IP of the data subjects and data custodians.

Outcomes can either be a major breakthrough discovery or a translatable method that has impact, like an integrated system utilising real-time monitoring & feedback loop using AI and Machine learning.

Democratisation of data becomes a challenge when the data is sensitive or protected, and sharing it inappropriately may lead to adverse impact that can either make the data subjects vulnerable or breach the IP/contracts/regulations.

Custodians of sensitive/protected data establish a data governance model for their data and require secure curated conduct for the data to be analysed by internal and external collaborators, without the data leaving the governed boundaries. Traditional data sharing techniques, like secure FTP, requires data to be completely de-identified and transferred to the analyst’s computer, and from that point, the data custodian loses control and oversight. Analysis on identifiable and re-identifiable data remains challenging or impossible outside the custodian’s premises.

Safe havens are secure data sharing and analysis environments, that enable identifiable and re-identifiable data to be made available to the researcher across geographic locations, without having to move the data physically, through secure remote access machines hosted within the custodian’s research infrastructure. Curated data ingress and egress to and from safe havens are within custodian governance and control, thus enabling the remote analysis of sensitive or protected data  within the stewardship of the data custodian.

This presentation will include:

  • Background on data safe havens
  1. Enabling a data custodian driven governance model
  2. Piloting for healthcare outcomes
  3. Future

What are Safe havens

Safe Havens can represent a whole range of capabilities – from research data catalogue management, data access management, data anonymisation and data linkage apart from the core secure data sharing & analysis platform.

The analysis environment itself, can be a custom virtual lab available for the collaborators of a particular repository, with a range of analysis tools, data handbook and the readonly copy of the actual data to be analysed. The Virtual Machines are secured with required policies to ensure appropriate access management, restricting data movement in/out of the machine and disabling  internet access. They are managed/governed instances with regular vulnerability scanning, penetration testing and firewall restrictions.

The augmenting capabilities like catalogue management, access request management, anonymisation and linking makes the whole Safe Haven suite of capabilities a one stop for data custodians to enable collaboration on their invaluable asset.

Enabling data custodian driven governance model

Safe havens are built on highly scalable and customisable architecture with tiered permission structure at various levels, which enables custodians to implement their governance policies within the Safe Haven, through processes and workflows. The Safe Havens can also be an effective vehicle to implement FAIR data principles through Metadata cataloguing capabilities

Piloting healthcare outcomes

Monash eResearch Centre has piloted implementing Safe Havens for a few  healthcare use cases and this presentation will cover the various flavors of safe havens that has been implemented, as outlined below

  • A lightweight Virtual Lab that provides data transfer capabilities and analysis tools
  • A multi component workflow enabled safe haven, that provides data cataloguing, access management and secure analysis environment
  • A comprehensive safe haven environment, leveraging international best practices & technologies, that provide a governed environment with a complete pre-installed software suite, access controls and permissions, secure remote access and data linkage.


Safe Havens are enabled through the secure IaaS capabilities within Monash eResearch Centre. The roadmap  is to achieve ISO 27001 accreditation for the Information security management practices for the underlying infrastructure, secure data hosting and sharing platforms. This will enable us to achieve commercial/industry grade secure platform and  proactively address any upcoming  regulatory requirements.

Unlocking the potential of huge set of data repositories that the Monash custodians and partners hold through data linking capabilities

Leverage the High Performance Computing capabilities  for near real time data processing, linking, anonymization, sharing and analysis to achieve healthcare and other high value outcomes


Komathy Padmanabhan is the Strategic Business Analyst at Monash eResearch Centre. Komathy has over 10 years of experience in leading Business improvements, business-technology alignment and strategic initiatives across various industries

Introduction to an Integrated Jupyter Notebook

Ingrid Mason1, Frankie Stevens1, Ghulam Murtaza2

1AARNet, Sydney, Australia, ingrid.mason@aarnet.edu.au, frankie.stevens@aarnet.edu.au
2Intersect Australia, Sydney, Australia, Ghulam.murtaza@intersect.org.au


The Introduction to Jupyter Notebook workshop will cover:

  1. What sort of tool a Jupyter notebook is
  • Where the Integrated Jupyter notebook (as a researcher tool) fits into different researchers’ toolkits
  • What advantages there are of using the notebook in the cloud versus on the desktop
  • Basic Python programming using a Jupyter notebook


The workshop is ~3 hours (including breaks):

  1. What is a Jupyter notebook and how does it function. 15 minutes
  2. Where does a Jupyter notebook fit in different researcher toolkits. 15 minutes
  3. Why use Jupyter notebook on a desktop and why use it in the cloud. 20 minutes
  4. Basic Python programming using a Jupyter notebook. 75 minutes


The workshop is targeted at data librarians, research support, and eResearch professionals interested in what the Jupyter notebook does, how it works, and how it can be used to train researchers in basic programming skills.  Workshop participants will want to know where the Jupyter notebook fits into different researchers’ toolkits (along with Excel, SPSS, Stata, RebExr, RStudio, or MatLab). This workshop will include basic programming commands using the basics of the Python programming language.


Workshop preparation:

  • Come with a laptop and with a CloudStor account setup


Workshop breakdown:

  • Workshop is half-day and includes a hands-on component
  • Up to 40 attendees with no special seating or table requirements


Ingrid, Frankie and Ghulam are eResearch specialists with extensive experience in researcher engagement, training, and have expertise in research data and technologies across STEM and HASS research areas.


About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2017 - 2018 Conference Design Pty Ltd