Fundamentals of Deep Learning for Computer Vision

Christopher Watkins1

1CSIRO, Melbourne, Australia

This workshop teaches you to apply deep learning techniques to a wide range of computer vision tasks through a series of hands-on exercises. You will get to work with widely-used deep learning tools, frameworks,   and   workflows   by   performing   neural   network   training   and   deployment   on   a fully-configured GPU accelerated workstation in the cloud. After a quick introduction to deep learning to start  the  course,  you  will  advance  to  building and deploying deep learning applications for image classification and object detection, followed by modifying the applications to improve their accuracy and performance, and finish by implementing the workflow that you have learned on a final project. At the end of the workshop, you will have access to additional resources to create new deep learning applications on your own.

  • Duration: 8 hours
  • Certification: Upon successful completion of this workshop, you will receive NVIDIA DLI Certification to prove subject matter competency and support professional career growth
  • Prerequisites: Familiarity with basic programming fundamentals, like functions and variables
  • Languages: English
  • Tools, libraries, and frameworks: Caffe, DIGITS

LEARNING OBJECTIVES

At the conclusion of the workshop, you will have an understanding of the fundamentals of deep learning and be able to:

  • Implement common deep learning workflows, such as image classification and object detection
  • Experiment with data, training parameters, network structure, and other strategies to increase performance and capability of neural networks
  • Integrate and deploy neural networks in your own applications to start solving sophisticated real-world problems

CONTENT OUTLINE

  Components Description
Introduction

(45 mins)

  • Course Overview
  • Getting Started with Deep Learning
Introduction to deep learning, situations in which it is useful, key terminology, industry trends, and challenges.
Break (15 mins)
Unlocking New Capabilities

(120 mins)

  • Biological inspiration for Deep Neural Networks (DNNs)
  • Training DNNs with Big Data
Hands-on exercise: Training neural networks to perform image classification by harnessing the three main ingredients of deep learning: Deep Neural Networks, Big Data, and the GPU.
Break (45 mins)

 

Unlocking New Capabilities

(40 mins)

  • Deploying DNN models
Deployment of trained neural networks from their training environment into real applications.
Measuring and Improving Performance

(100 mins)

  • Optimizing DNN Performance
  • Incorporating Object Detection
Hands-on exercise: neural network performance optimization and applying DNNs to object detection.
Summary

(20 mins)

  • Summary of Key Learnings
Review of concepts and practical takeaways.
Break (15 mins)
Assessment

(60 mins)

  • Assessment Project: Train and Deploy a Deep Neural Network
Validate your learning by applying the deep learning application development workflow (load dataset, train and deploy model) to a new problem.
Next Steps

(15 mins)

  • Workshop Survey
  • Setting up your own GPU enabled environment
  • Additional project ideas
  • Getting Data
Learn how to setup your GPU-enabled environment to begin work on your own projects. Get additional project ideas along with resources to get started with NVIDIA AMI on the cloud, nvidia-docker, and the NVIDIA DIGITS container.

 

This course is also available as a self-paced online option at https://courses.nvidia.com/

WHY DEEP LEARNING INSTITUTE HANDS-ON TRAINING?

  • Learn how to build deep learning and accelerated computing applications across a wide range of industry segments such as Autonomous Vehicles, Digital Content Creation, Finance, Game Development, and Healthcare
  • Obtain guided hands-on experience using the most widely used, industry-standard software, tools, and frameworks
  • Attain real world expertise through content designed in collaboration with industry leaders such as the Children’s Hospital of Los Angeles, Mayo Clinic, and PwC
  • Earn NVIDIA DLI Certification to prove your subject matter competency and support professional career growth
  • Access courses anywhere, anytime with a fully configured GPU-accelerated workstation in the cloud

Visualisation of research activity in Curtin’s virtual library

Peter Green1, Pauline Joseph2, Amanda Bellenger3, Aaron Kent4, Mathew Robinson5

1Curtin University, Perth, Australia, P.Green@curtin.edu.au

2Curtin University, Perth, Australia, P.Joseph@curtin.edu.au

3Curtin University, Perth, Australia, A.Bellenger@curtin.edu.au

4Curtin University, Perth, Australia, Aaron.J.Kent1@gmail.com

5Curtin University, Perth, Australia, Matt.Robinson@curtin.edu.au

 

Curtin University Library manages authenticated access to its online journal, book and database collections using the URL re-writing proxy service called EZproxy. EZproxy mediates the request between user and publisher platform via the Library. The proxy service is widely deployed in libraries worldwide and has been a standard authentication solution for the industry for many years. The EZproxy software creates a log entry for each request in the Combined HTTP Log format[2]. The log files are extensive, with approximately 30 million lines written per month. The log files capture information for each request such as the IP address, client ID, date and time, HTTP request and response and so forth. The Curtin Library has retained at least five years of the log files.

This large dataset presents an opportunity to learn more about the information seeking behaviour of Curtin Library clients, but also presents a challenge. Traditional analysis of such data tends to produce aggregated usage statistics that do not reveal activity at a granular level. Immersive visualisation could provide a means to see the data in a new way and reveal insights into the information seeking behaviour of Curtin Library clients. In collaboration with Dr Pauline Joseph, Senior Lecturer (School of Media, Creative Arts and Social Inquiry) the Curtin Library proposed this work for funding under the Curtin HIVE[3] Research Internships program. The proposal was successful and a computer science student, Aaron Kent, was employed for a ten week period to produce visualisations from the EZproxy log file dataset.

The data was anonymised to protect client confidentiality whilst retaining granularity. The number of lines in the log file were reduced by removing ‘noise’. The Unity3D[4] software was chosen for its ability to provide visualisations that could be displayed on the large screens of the HIVE but also desktop screens. Many possibilities were discussed for visualisations that might give insight into client behaviour, but two were chosen for the internship.

The first visualisation focusses on the behaviour of individual users in time and space and represents each information request using an inverted waterfall display on a global map as illustrated by Figure 1. Different sizes and shapes are used to present different client groups and the size of the information request is reflected in the size of the object. Geolocation information is used to anchor each request on the map.

Figure 1: Global user visualisation

 

The second visualisation focusses on the usage of particular resources over time and represents each information request as a building block in a 3D city as illustrated by Figure 2. The different client groups and the volume of requests are illustrated over time by location and size against each particular scholarly resource.

Figure 2: Scholarly resource visualisation

The successful visualisation prototypes have shown that the EZproxy log file data is a rich source of immersive visualisation and further development will yield tools that Curtin Library can use to better understand client information seeking behaviour.

REFERENCES

  1. EZproxy Documentation. Available from: https://www.oclc.org/support/services/ezproxy/documentation/learn.en.html accessed 8 June 2018.
  2. Combined Log Format. Available from: http://fileformats.archiveteam.org/wiki/Combined_Log_Format accessed 8 June 2018.
  3. Curtin HIVE (Hub for Immersive Visualisation and eResearch). Available from: https://humanities.curtin.edu.au/research/centres-institutes-groups/hive/ accessed 8 June 2018.
  4. Unity3D. Available from: https://unity3d.com/ accessed 8 June 2018.

Biography:

Peter Green is the Associate Director, Research, Collections, Systems and Infrastructure in the Curtin University Library, Australia. He is responsible for providing strategic direction, leadership and management of library services in support of research, the acquisition, management, discovery and access of scholarly information resources, and information technology, infrastructure and facilities.

Citizen Data Science can cure our data woes

Dr Linda Mciver1

1Australian Data Science Education Institute, Glen Waverley, Australia

 

We are drowning in data. Every two days we generate more data than the entirety of human history up to 2003. And those are old numbers. We have scientific instruments that generate more data than we can store, much less process. And we have data on everything you can possibly dream of – and in the case of Facebook, many things you’d probably prefer not to. In short, we have more data than we can possibly begin to understand.

Coincidentally, we have an entire cohort of school students who we are failing to engage with technology. Students whose experience of “tech” subjects is formatting word documents or making web pages. Students who don’t see the relevance of technology to their own lives and careers. We are increasingly ruling their world with data. It’s time to engage them with it. Teaching kids Data Science has potential to enable students to make scientific discoveries, understand the discipline that’s changing the face of the world, and engage with technology on a whole new level. I’m going to show you how.


Biography:

Dr Linda McIver started out as an Academic with a PhD in Computer Science Education. When it became apparent that High School teaching was a lot more fun, Linda began a highly successful career at John Monash Science School, where she built innovative courses in Computational and Data Science for year 10 and year 11 students.

Nominated one of the inaugural Superstars of STEM in 2017, Linda is passionate about creating authentic project experiences to motivate all students to become technologically and data literate.

While Linda loves the classroom, it was rapidly becoming clear that teachers in the Australian School system were keen to embrace Data Science, but that there was a serious lack of resources to support that. That’s why Linda created ADSEI – to support Data Science in education.

Higher-level cloud services to support data analytics

Glenn Moloney1, Paul Coddington2, Andy Botting3, Siddeswara Guru4

1Australian Research Data Commons (ARDC), University of Melbourne, Melbourne, Australia, glenn.moloney@ardc.edu.au

2Australian Research Data Commons (ARDC), eResearch SA, Adelaide, Australia, paul.coddington@ardc.edu.au

3Australian Research Data Commons (ARDC), University of Melbourne, Melbourne, Australia, andrew.botting@ardc.edu.au

4TERN, University of Queensland, Brisbane, Australia, s.guru@uq.edu.au

 

The Nectar Research Cloud has been very successful in providing cloud infrastructure-as-a-service. It now supports some additional higher-level services such as database-as-a-service, big data analytics services (Hadoop, Spark, etc), and also supports some virtual laboratories and related services that are customised for the needs of particular research domains.

The Nectar Research Cloud is now part of the new Australian Research Data Commons (ARDC). As part of the planning for the ARDC, there is interest in what higher-level cloud services might be provided by the ARDC, for example we see an accelerating uptake of data analytics platforms like Jupyter notebooks, R and R-Studio, and virtual desktop environments across many fields of research: including the Humanities, Geosciences, Marine Sciences, Ecosystems Sciences and Astronomy. Many of these platforms are well suited to running on the cloud. These platforms are also adopting container orchestration technologies (such as Docker and Kubernetes) to deploy these platforms across cloud and non-cloud environments.

This BoF session will have a facilitated discussion on these issues:

  • what types of higher level data analytics solutions should we be aiming to support on the research cloud?
  • what are current practices for supporting these tools on the research cloud and elsewhere?
  • how can we make them easier for researchers to use?

To start the discussion there will be some short presentations on what is currently being done to support some of these tools on the Nectar Research Cloud and other commercial and international research cloud platforms.


Biography:

Glenn Moloney – Director, Research Communities, Australian Research Data Commons. Specialties: eresearch, elearning, grid computing, research collaboration, experimental particle physics, data analysis and acquisition

Dr Paul Coddington is Associate Director, Research Cloud and Storage, Australian Research Data Commons. He has over 30 years experience in eResearch including computational science, high performance and distributed computing, cloud computing, software development, and research data management.

Andy Botting – Senior Engineer at the Australian Research Data Commons (ARDC), Nectar Research Cloud.  I’m a cloud-native Systems Engineer with a background in Linux, HPC.  Specialities: Linux, Android, Puppet, OpenStack and AWS.

Siddeswara Guru is a program lead for the TERN data services capability. He has experience in the development of domain-specific research e-infrastructure capabilities.

“Would you like to chat?” The Ethics of AI in Higher Education

Dr Craig Bellamy1, Mr Mohsin Murtaza1, Mr Ather Saeed1

1Charles Sturt University, Fitzroy, Australia

 

After many false dawns, AI may be gaining traction. Chatbots, Natural Language Processing, robots, autonomous vehicles, and the combination of big data and AI are all findings applications in a myriad of commercial and other contexts.  AI was once about explicit commands; what you put in is what you got out, but now it is about machine learning from big data, about machines that not only learn, but can also make decisions.

This ability to make decisions poses numerous thorny ethical dilemmas, can an autonomous vehicle avoid collisions ‘ethically’; can a chatbot impersonate a human for nefarious purposes, and can an autonomous military drone decipher images of illicit activity and then take action?  These are not dystopian projections of a sci-fi future, rather these ethical issues that exist now within the province of AI and its applications.

Whilst ethicists have been quick to provide critique, debate, and numerous frameworks for an ethical AI future, (indeed the Australian Government has just proposed a “technology roadmap, a standards framework and a national AI Ethics Framework”, and regulation in the space), higher education has been fairly quiet in terms of debating the impacts of AI on teaching and research and the broader HE education system.  Indeed, while AI applications are not yet readily used in research, this could change quite rapidly as has the use of ‘big data’ in research across both the digital humanities and the sciences.

Many ethical issues surround the foremost issue of IT ethics, being privacy, but also new issues arise, particularly centred upon the interpretation of data using machine learning, transparency, and Ai’s influence upon later research findings, its accreditation, and broader social influence.  This is a particularly difficult issue as AI does afford many benefits in terms of the researcher’s ability to deal with the scale and complexity of big data along with the phenomena it records and represents, but there are things that machines are good at and things that people are good at, and this intersection of machines and people, including the ethics of interpretation and decision making, needs to be considered from the very emergence of AI in research and education.

This Birds of Feather session proposes to discuss the ethics of AI, big data and research, with the purpose of providing a rudimentary ethical framework for embryonic AI in research and teaching practice.  This framework may be used as a stand-alone guide for researchers or ethics teachers or as an addendum to existing research ethics, privacy and data processing guidelines. During the BOF session, discussion materials, provocative example, and talking points will be provided to draw on the experience of the audience to help develop the framework.

Reference:

  • Bostrom, Nick, “Superintelligence: Paths, Dangers, Strategies”, Oxford University Press, 2014
  1. Pollit, Edward, “Budget 2018: National AI ethics framework on the way, Increased regulation signalled as part of $30m investment” Australian Computer Society, https://ia.acs.org.au/article/2018/budget-2018–ai-boost-with-an-ethical-focus.html (Accessed 13 June, 2018).
  2. Rose Luckin, “Enhancing Learning and Teaching with Technology: What the research says” Institute of Education Press (IOE Press), 2018
  3. Seldon, Anthony, “The Fourth Education Revolution”, University of Buckingham Press, 2018

Biographies:

Dr Craig Bellamy is a Lectures in IT and Ethics at Charles Sturt University’s Study Centre in Melbourne.  He has a background in the Digital Humanities and has presented and published widely in the field.

Mohsin Murtaza currently working at Study Centre Melbourne CSU as Adjunct Lecturer and Course Coordinator IT. He worked as Lecturer in several Australian Universities including La Trobe, Central Queensland and Federation University. He has completed Master of Telecommunication Engineering from La Trobe University and achieved “Golden Key Award”.

Ather Saeed is a  Course Coordinator for CSU (IT & Networking Programs)  and is currently pursuing a PhD (thesis titled “Fault-tolerance in the Healthcare Wireless Sensor Networks”). He has a Masters in Information Technology & Graduate Diploma ( IT) from the University of Queensland, Master of Computer Science (Canadian Institute of Graduate Studies). He has published several research papers in international journals.

Scalable Research Data Visualisation Workflows: From Web to HPC

Owen L Kaluza1, Andreas Hamacher2, Toan D Nguyen3, Daniel Waghorn4, David G Barnes5, C Paul Bonnington6

1 *owen.kaluza@monash.edu,

2 *andreas.hamacher@monash.edu,

3 *toan.nguyen@monash.edu

4 *daniel.waghorn@monash.edu,

5 *david.g.barnes@monash.edu,

6 *paul.bonnington@monash.edu

*Monash University, Clayton, Victoria, Australia

 

MOTIVATION

The challenge of providing access to high-end visualisation, compute and storage resources has many aspects, including the difficulties of easing new users into accessing their capabilities and providing ways to seamlessly mesh new technological approaches with research workflows. When offering specialised cluster based visualisation hardware, such as the CAVE2 [1], that cannot run standard desktop applications, there is a need to provide researchers with a way to use them with ease and familiarity.

We attempt to provide researchers with access to high end immersive visualisation resources in a manner that scales in multiple dimensions:

  • From low-power portable devices to high-end cluster computing
  • From local data storage and processing to cloud based infrastructure
  • From researchers with simple visualisation demands and limited relevant technology experience to advanced custom vis workflows

METHODS AND DISCUSSION

We provide a set of modular components driven from a web portal based graphical interface, “previs”, which can be used to provide a pre-programmed visualisation output at first visit, but which can later become part of a much more advanced programmable pipeline.

Data sources include integration with Monash research cloud storage sources, data stored on MASSIVE [2] and sourced from various instruments, and the ability to upload local files or source data from other URLs. In this way, researchers who wish to try out visualisation facilities, or show their data to visitors can do so quickly and easily. For example, an artist can upload a laser scanned point cloud dataset for exploration from the web with no involvement from our technical staff. If there is a potential for analysis and exploration of their data within immersive or higher-powered compute environments, then further work is required; the same portal allows a breakout to higher level programmable tools.

The graphical output at each level of the hierarchy uses the same rendering techniques, scaled to best utilise the available hardware. Cluster-based rendering techniques are used on high-performance computing (HPC) facilities and data reduction and re-sampling is used to make viewing feasible on mobile devices and VR headsets. WebGL allows this capability to be scaled to mobile and desktop browsers, even for volume rendering [3], and large point clouds [4], with the potential for use in WebVR as the standard matures.

The underlying process is scriptable in python, allowing access to the visualisation pipeline for advanced users, augmented with the multitude of other libraries available in a python environment. Use of IPython notebooks [5] extends this capability to the browser, with built in visualisation tools designed for interactive analysis in IPython/Jupyter.

Further advanced visualisation techniques we are developing can be integrated into the workflow, such as utilising comparative visualisation on large cluster display systems with “encube” [6].

                   

Figure 1: The previs workflow.                                    Figure 2: Working with a volume dataset

CONCLUSIONS: PUBLICATION OF RESULTS AND FUTURE WORK

Our framework makes it much simpler to get data into our high-end facilities, but also generates visualisations that can be used on desktop and web devices. These tools also provide the researcher with additional capabilities that can be built into their existing workflows and potentially used for publication and outreach.

The ability to generate customisable visualisations that can be taken away and viewed on portable devices helps consolidate standard desktop workflows with the exploration of data in immersive facilities and the final goal of communicating research via publication, driving us towards further development of these aspects.

3D PDF [7] and WebGL [8] visualisations have made some inroads into being accepted as a publication medium. Cloud/web server based visualisations and notebooks [9] have also been published on a non-permanent basis. For larger data, however, it becomes necessary to run cloud based visualisations on graphics processor (GPU) powered hardware. Barriers to the use of these tools exist, which may come down if standardised GPU powered cloud services become ubiquitous enough. For example, if a publication could include a Docker [10] container based visualisation with certainty that the ability to launch and view remote data and visualisation elements will exist for the long term.

REFERENCES

  1. Febretti. A. et al., CAVE2: a hybrid reality environment for immersive simulation and information analysis. Proceedings Volume 8649, The Engineering Reality of Virtual Reality 2013.
  2. Goscinski. W.J. et al., The multi-modal Australian ScienceS Imaging and Visualization Environment (MASSIVE) high performance computing infrastructure: applications in neuroscience and neuroinformatics research. Frontiers in Neuroinformatics., 27 March 2014
  3. Congote. J. et al., Interactive visualization of volumetric data with WebGL in real-time. Web3D ’11 Proceedings of the 16th International Conference on 3D Web Technology, pages 137-146
  4. Schuetz. M. Potree: Rendering Large Point Clouds in Web Browsers. Diploma Thesis, Vienna University of Technology, 19 Sep 2016
  5. Perez. F. Granger. B. E., IPython: A System for Interactive Scientific Computing. Computing in Science & Engineering, Volume: 9, Issue: 3, May-June 2007.
  6. Vohl. D. et al., Large-scale comparative visualisation of sets of multidimensional data, PeerJ, 10 Oct 2016
  7. Barnes, D.G., Fluke, C.J. et al., Incorporating interactive three-dimensional graphics in astronomy research papers. New Astronomy, 2008. Volume 13, Issue 6: p. 599-605.
  8. Quayle. M.R. et al., An interactive three dimensional approach to anatomical description—the jaw musculature of the Australian laughing kookaburra (Dacelo novaeguineae). PeerJ, 1 May 2014.
  9. Shen. H. Interactive Notebooks: Sharing the code, Nature Toolbox, 5 Nov 2014. Available from: https://www.nature.com/news/interactive-notebooks-sharing-the-code-1.16261, accessed 7 Jun 2018.
  10. Cito. J. et al., Using Docker Containers to Improve Reproducibility in Software and Web Engineering Research, International Conference on Web Engineering 2016, Lecture Notes in Computer Science, vol 9671. Springer, Cham

Biography:

Andreas is a software developer with a history in scientific visualisation applications and research. After earning his bachelor degree in scientific programming he was employed at RWTH Aachen University, Aachen, in the Virtual Reality Group which also operates a CAVE system. There, Andreas worked across disciplines, helping researchers apply visualisation and VR methods to their fields.

Owen is a research software developer with a computer science and graphics programming background, who has been working with researchers at Monash since 2009, initially developing cross-platform and web-based 3D visualisation software for simulation data and later involved in parallel GPU-based implementations of imaging algorithms.

In his current role as a visualisation software specialist he develops tools for processing, viewing and interacting with scientific datasets in high-resolution immersive VR environments and exploring new scientific and creative applications of the CAVE2 facility at the Monash Immersive Visualisation Platform.

Phoebe—A tool for Fast Visualisation of Time Series Volumetric Data

Mr Oliver Cairncross1

1Research Computing Centre, University of Queensland, Brisbane, Australia, o.cairncross@uq.edu.au

 

Summary

Modern high throughput microscopy instruments, such as lattice light-sheet microscopy (LLSM), generate large volumes of data necessitating the development of new systems to facilitate computation and analysis in a timely and efficient manner. Phoebe is such a system and it provides a means for researchers to visually survey these datasets quickly and efficiently. The purpose of this presentation is to discusses and demonstrate a 3D real time visualiser developed at the University of Queensland’s Research Computing Centre in collaboration with the Institute of Molecular Bioscience’s Stow Lab.

Extended Abstract

This presentation will detail several components that make up an integrated visualisation system for large scale time series volumetric data. Namely:

  • A pre-processing stage, converting LLSM data into manageable datasets for consumption downstream;
  • A computationally intensive, high performance, distributed processing engine deployed on clusters to produce meshes for visualisation;
  • A multiplatform lightweight visualiser allowing researchers to explore the data as well as drive the entire process from their desktops

The pre-processing stage serves as the entry point of the visualisation pipeline. It utilises various tools (such as ImageJ, ITK or custom software) to convert raw LLSM images into a format that can easily be used downstream. Note that various types of volume base imaging data could be converted for use by the visualiser. The system is not limited to LLSM data.

The visualiser and processing engine are tightly coupled to provide a flexible and responsive visualisation service. End users drive the process from a desktop application which directs the backend compute processes to vary depending on what timepoint and / or parameters the user chooses to asses. As soon as changes are made by the user the system provides immediate visual feedback by continually adjusting work queues and pushing out results to the desktop. These results are computed and pushed on an individual timepoint basis (as opposed to waiting for the entire time series to be computed) enabling researches to quickly move about in the dataset in a non-linear fashion.


Biography:

Oliver Cairncross is a data visualisation specialist at the Research Computing Centre, University of Queensland. He has been working in the field of scientific computing for a decade starting with large database systems for storing electron microscope tomography data. Since then Oliver has been involved in various research fields with a focus on developing novel data visualisations techniques.

‘Cursed Forest’ – a random forest implementation for big, extremely highly dimensional data

Piotr Szul1,Aidan O’Brien2, Robert Dunne3, Denis C. Bauer

1Data61, CSIRO, Brisbane, QLD, Australia, piotr.szul@data61.csiro.au

2Health & Biosecurity, CSIRO, Sydney, NSW, Australia

3Data61, CSIRO, Sydney, NSW, Australia

 

INTRODUCTION

Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes  to be studied at population-level  rather then for small number of individuals.  This provides  new power to whole genome  association studies (WGAS), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.

As WGAS involve studying  thousands  of genomes,  they pose both technological  and methodological  challenges.  The volume of data is significant, for example the dataset from 1000 Genomes project [6] with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous  and greatly exceeds  the  number of samples,  which  makes  it challenging  to apply  traditional  statistical  approaches  especially  if potential interaction between variants need to be considered.

Random forest [1] is one of the methods that was found to be useful in this context [3], both because its propensity for parallelization  and  robustness  and  the  inherent  ability  to  model interaction  [2].  The  variable  importance  measure extracted from random forest models can be used to identify variants associated with traits of interests.

There is a number of random forest implementation  available for single machine computing with interfaces both for R and Python. Some of them, such as Ranger [4], are specifically designed to process highly dimensional WGAS-like data sets and boast significant performance improvements over the more generic implementations.

These implementations  however are limited by the size of a single machine memory, and for larger datasets distributed solutions,  in  which  data  can  be  partitioned  among  multiple  machine are  needed.  This  approach  underpins  the computational model of many leading big data technologies including Apache Hadoop and more recently Apache Spark [7] –  a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Notably Spark machine learning library (Spark MLlib [8]) comes with a random forest implementation capable of dealing with huge datasets. This implementation however is tuned to work on typical dataset with large number of samples and relatively small number of variable and either fails or is inefficient for highly dimensional data [5].

To address these problems, we have developed the CursedForest – a Spark based, distributed implementation of random forest optimized for large, highly dimensional data sets. We have successfully CursedForest applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying CursedForest, released as part of the VariantSpark [9] toolkit, to a number of WGAS studies.

CURSED FOREST

Typically, random forest implementations operate on the traditional matrix-like data representation with samples in rows and features in columns. In distribute scenario that leads to ‘by row’ partitioning,  which for building decision trees on highly dimensional  data has been proven to be significantly  less efficient that the alternative  ‘by column’ layout and partitioning [5].

CursedForests uses the basic principle of ‘by column’ partitioning but extends it to ensembles of trees and introduces a number of optimizations aimed at efficient processing of extremely highly dimensional genomics dataset which include memory efficient representation of genomics data, optimized communication patterns and computation batching.

RESULTS

We have compared the performance  of CursedForset  on genomics datasets against other leading implementation  and tested  its  ability  to  scale  on  very  large  datasets.  The  results demonstrate  not  only  that  CursedForset  is the  single implementation  able to process  the WGAS-size  datasets  but also that it significantly  outperforms  other methods  on smaller datasets.

Figure 1: Peformance of CursedForest compared to MLlib and Ranger (left); Scaling of CursedForest to WGAS-sized datasets (right).

We have also demonstrated that CursedForest can accurately predict ethnicity from whole genome sequencing profiles of 2,500 individuals being the only implementation scaling to the full dataset. We have trained CursedForest on the 1000 genomes dataset, which consists of 2,504 samples with 81,047,467 features each to predict the ethnicity from genomic profiles. CursedForest achieves an out of bag error of OOB=0.01 and completes in 36 min 54 seconds, demonstrating its capability to run on population-scale cohorts of real world applications.

CONCLUSIONS

We  we  have  developed  the  `CursedForest`  –  a  Spark  based  random  forest  implementation   optimized  for  highly dimensional  data  sets  –  and  demonstrated  that  it  outperforms  other tools  on  genomics  datasets  both  in  terms computation time and data-set size limits and is capable of running on population-scale cohorts of real world applications.

REFERENCES

  1. 1. Breiman, Random Forests. Machine Learning. 2001 October;45(1):5–32.
  2. 2. Lunetta et  a,  Screening  large-scale  association  study  data:  exploiting  interactions  using  random  forests.  BMC Genetics 2004 5:32
  3. 3. Díaz-Uriarte R, Alvaréz de Andres S., Gene selection and classification of microarray data using random fore BMC Bioinformatics. 2006;7:3
  4. 4. Wright, Marvin N., and Andreas Z, “ranger: A fast implementation of random forests for high dimensional data

in C++ and R.” arXiv preprint arXiv:1508.04409 (2015)

  1. 5. Firas Abuzaid  et a,  Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale.  Annual Conference

on Neural Information Processing Systems 2016, December 5-10, 2016,  Barcelona, Spain

  1. 6. 1000  Genomes  Project  Consortium.  An  integrated  map  of  genetic  variation  from  1,092  human  genom  Nature.

2012 Nov;491(7422):56–65.

  1. 7. Apache Spark, available from https://spapache.org/
  2. 8. Spark MLLib, available from: https://spapache.org/mllib/
  3. 9. VariantSpark available from: https://github.com/aehrc/VariantSpark 

Biography:

Piotr Szul is a Senior Engineer in CSIRO Data61. He holds a MSc degree in Computer Science and has over fifteen year of experience in developing various types of commercial and scientific software applications.

Since his joining of CSIRO in 2012 Mr Szul has been involved in a number of project of developing research platforms on clouds or applying emerging big data technologies to research problems. The domains of application included image processing, social media, genomics and material sciences.

In addition to his core software engineering activities Mr Szul has also being involved in building experimental cloud and big data processing infrastructure as well as authored and co-authored a number of paper and delivered a number of presentation on various topics related to big data technologies and their applications.

Leveraging Integrated Spatial Biological and Environmental Data in the Atlas of Living Australia

Mr Lee Belbin1

1The Atlas of Living Australia, Carlton, Australia,lee.belbin@csiro.au

 

The Atlas of living australia

The Atlas of Living Australia (ALA: http://www.ala.org.au) has around 75 million observations of species. The ALA has also collected over 500 regional, national and international environmental layers that relate in some way to the understanding and management of the environment on which we all depend for life [1].

The biological data are mainly comprised of human observations of species at a point location, but there are also an increasing number of machine observations using camera traps and electronic tags attached to animals [2]. Two other spatial representations of species include what we term ‘expert distributions’ where scientists with relevant experience suggest where the species may or should exist, and what are termed ‘checklist areas’: areas within which several species have been observed.

The biological and environmental data, taken separately, provide a significant resource for visualizing the environment of the Australian region. For example, we can see where scores of thousands of species occur, on land, in the rivers and the oceans. We can also see what typifies the environment of any area of land or ocean in the Australian region. Taken together however, we can spatially intersect the biological with environmental characteristics to help address significant questions. For example, “What species characterize this area (which may be a reserve)?” and “Where could this species occur in 2030?” [3].

The research Portal of the Atlas

In a research context, exploring the ALA functionality is best done first through the interface provided by the Spatial Portal [4] (http://spatial.ala.org.au) where integrated data, analysis and visualization tools are available in a systematic environment (Figure 1). In this interface, users can explore species and their attributes, environments and their interactions, all with a spatial emphasis. Seventeen tools are available to help understand the interactions of points and areas, for example point densities can be produced, or the evaluation of the conservation of species across multiple jurisdictions. Import of points using CSV and Excel® formats, and areas using shapefiles, KML or WKT formats are available to get data into and out of the ALA’s Spatial Portal.

Alternative Atlas interfaces

The Spatial Portal, like all the Atlas of Living Australia’s functionality, is based on a suite of web services listed at http://api.ala.org.au. You can access the ALA’s data and the Spatial Portal’s analytical tools from these web services. For example, submitting the URL

http://spatial.ala.org.au/ws/intersect/el874/-27.476/153.018,

(the location of the Queensland Museum), returns

[{field: “el874”,

layername: “Temperature – annual mean (Bio01)”,

units: “degrees C”,

value: 20.4}]

This tells us that the mean annual temperature at this location is 20.4c. For those more comfortable or more efficient in the R environment (https://www.r-project.org/), the ALA4R package (https://cran.r-project.org/web/packages/ALA4R/index.html) can also be used to access most of the ALA data and a range of Spatial Portal tools. For example, in R and using the ALA4R function “intersect_points”, the following three lines

layers = c(‘cl1049′,’el874′,’el893’)

points = c(-37.043,146.733,-37.120,146.672,-37.173,146.837)

intersect_points(points,layers)

produces values for the bioregion name (IBRA 7), the mean annual temperature and the annual rainfall at the three nominated locations:

1   -37.043   146.733 Highlands-Northern FalL      10.8     1267

2   -37.120   146.672              Victorian Alps                             7.7       1483

3   -37.173   146.837   Highlands-Southern Fall       10.6     1158

These are a very small subset of the functionality in the Atlas of Living Australia that are designed to support the research commu8nity.

REFERENCES

  1. Belbin, L., Williams, K.J. Towards a national bio-environmental data facility: experiences from the Atlas of Living Australia. International Journal of Geographical Information Science 2015, 30(1) 108-125. http://dx.doi.org/10.1080/13658816.2015.1077962
  2. Campbell, H.A., Urbano, F., Davidson, S., Dettki, H., Cagnacci, F. A plea for standards in reporting data collected by animal-borne electronic devices. Animal Biotelemetry 2016, 4. https://doi.org/10.1186/s40317-015-0096-x
  3. Booth T.H., Williams K.J. and Belbin L. Developing biodiverse plantings suitable for changing climatic conditions 2: Using the Atlas of Living Australia. Ecological Management & Restoration 2012, 13(3), http://onlinelibrary.wiley.com/doi/10.1111/emr.12003/abstract
  4. Belbin, Lee. The Atlas of Livings Australia’s Spatial Portal, in, Proceedings of the Environmental Information Management Conference 2011 (EIM 2011), Jones, M., B. & Gries, C. (eds.), 39-43. Santa Barbara.

Biography:

My professional career has evolved from being an exploration geologist with Cundill Meyers Pty Ltd in Canada and Australia (1970-1972) to running Blatant Fabrications Pty Ltd (2005-present).  In between, I’ve lectured on computer application in geoscience at the ANU (1972-78), done research in analytical ecology and managed environmental assessment projects in CSIRO (1979-1995), and established Australia’s first multidisciplinary science data centre at the Australian Antarctic Division (1995-2005). I have extensive experience in ecology, multivariate analysis, information management and policy and published over 100 papers. ORCID: https://orcid.org/0000-0001-8900-6203.

A cloud-based system to enable streamlined access to and analysis of continental-scale environmental metagenomics data by non-genomics researchers

Jeff Christiansen1, Derek Benson2, Grahame Bowland3, Samuel Chang4, Simon Gladman5, Gareth Price6, Anna Syme7, Tamas Szabo8, Mike W C Thang9, Andrew Bissett10

1QCIF and RCC-University of Queensland, Brisbane, Australia, jeff.christiansen@qcif.edu.au

2RCC-University of Queensland, Brisbane, Australia, d.benson@imb.uq.edu.au

3Centre for Comparative Genomics, Murdoch University, Perth, Australia, gbowland@ccg.murdoch.edu.au

4Centre for Comparative Genomics, Murdoch University, Perth, Australia, schang@ccg.murdoch.edu.au

5Melbourne Bioinformatics, University of Melbourne, Melbourne, Australia, simon.gladman@unimelb.edu.au

6QFAB@QCIF, Brisbane, Australia, g.price@qfab.org

7Melbourne Bioinformatics, University of Melbourne, Melbourne, Australia, anna.syme@unimelb.edu.au

8Centre for Comparative Genomics, Murdoch University, Perth, Australia, tszabo@ccg.murdoch.edu.au  

9QFAB@QCIF, Brisbane, Australia, m.thang@qfab.org

10CSIRO, Hobart, Australia, Andrew.Bissett@csiro.au

 

BACKGROUND

‘Metagenomics’ refers to the study of genetic material from environmental samples (e.g. soil or water), where nucleic acids are sequenced using high throughput technology, and then analysed using informatics methods to identify and quantify the complex mixture of microorganisms that were present in the sample. Metagenomics as an approach has been revolutionary; demonstrating that microbial abundance and diversity in the environment is many times greater than expected [1]. For example, a gram of soil typically contains around 10,000 species of bacteria and 1,000,000,000 individuals. The application of metagenomics to environmental studies also suggests that microorganisms are fundamental to ecosystem health by mediating biogeochemical and nutrient cycling, thereby influencing crop and livestock production and mitigating waste/pollution.

To start to develop a continental-scale map of Australian environmental microbial communities (i.e., to document what microbes are present in the environment across the country), Bioplatforms Australia (BPA) and partners have formed the Australian Microbiome consortium [2] and jointly invested over $10M towards the collection of thousands of soil, inland water and marine water samples across Australia and its territories; the extraction of DNA from these samples; and the production of primary sequence data from these samples. Additionally, robust and standardised data analysis pipelines have been developed which produce primary-level derived data in the form of large gene abundance data tables (i.e., one table format lists counts for each of the ~2,000,000 specific sequence tags in each of the ~5,000 samples, and another table acts as a key to identify the closest related taxonomic grouping (e.g., species, genus etc.) that relates to each sequence tag). The consortium has also developed a data repository [2] to house the raw sequence data, derived data, and contextual metadata for each collection site and event (i.e., geolocation, time, depth, environment type, chemistry etc.).

While great progress has been made in both collection of the data and production of the primary-level derived tabular data, multiple challenges remain to make these data accessible to many environmental researchers, who need to perform ‘secondary’ level analysis – for example statistical analyses over the data (e.g., normalisation, alpha- and beta-diversity, taxonomic binning, serial group comparisons, correlations) and to have access to extensive visualisation outputs in order to interpret the results.

In late 2017, BPA acted on behalf of the Australian Microbiome community to attract funding from Nectar/ANDS/RDS under the Research Data Cloud (RDC) program [3] to establish a cloud-based system to address these challenges, especially for researchers without dedicated informatics resources at their disposal. This presentation will outline the cloud-based analysis system established.

KEY OUTCOMES

We have developed a web accessible system to support all Australian environmental metagenomics researchers (whether within or outside of the Australian Microbiome consortium) to undertake a wide range of bioinformatics-based metagenomics analyses, ranging from the initial primary-level molecular aspects for taxonomic identification through to secondary-level microbial community analysis through their web browser.

The framework has been implemented by extending and connecting two well established NCRIS-funded national computational infrastructure components: the BPA Data Portal [4] and the Galaxy-Australia service [5] (which is part of the Genomics Virtual Laboratory [6]):

  • Extensions to the BPA Data Portal
    • Implementation of support for the discipline standard BIOM (BIological Observation Matrix) format [7],
    • Improvements to increase the Findability, Accessibility, Interoperability and Reusability of datasets in the portal (g. adding data licences, data persistence policy, citation requirements),
    • Contributing to the extension of international/national ontologies and publishing these in vocabulary repositories where appropriate (this activity will be ongoing at the time of presentation).
  • Extensions to Galaxy-Australia
    • Installation of the QIIME [8] and Mothur [9] molecular metagenomics analysis suites on the Galaxy-Australia service for primary-level analysis,
    • Wrapping of the Rhea [10] and Phyloseq [11] R-based microbial community analysis packages (for secondary-level analyses) for use in Galaxy; deposition into the global Galaxy-Toolshed [12] for subsequent installation on any Galaxy instance; and installation on the Galaxy-Australia service.
  • Methods to move data between the BPA Data Portal and Galaxy-Australia
    • Through implementation of a Galaxy API [13] on Galaxy-Australia, a CKAN API [14] on the BPA Data Portal, and a mechanism for individual users to call each API from within the other system when required.
  • Training on the above – due for delivery end-November 2018
    • Development of self-paced online training material – to be available via Galaxy-Australia and the EcoED Ecoscience training portal [15],
    • Delivery of one 3-hour hands-on workshop across Australia utilising the EMBL-ABR ‘Hybrid’ method of delivery [16].

The project has maintained extensive, ongoing and transparent engagement with a wide range of stakeholders with varying interests and challenges in metagenomics production, distribution and use. This has been undertaken via a series of face-to-face stakeholder engagement events at locations across Australia (which have significant numbers of groups associated with the Australian Microbiome consortium), and through the use of a project blog [17], and a public Trello board which lists user requirements and tracks development sprints [18].

CONCLUSION

The cloud-based system we have developed through leveraging previous NCRIS-supported research data infrastructure represents an Australian first for end-to-end analysis and interpretation of environmental metagenomics data. A wide range of users are supported, including critically, users who are not molecularly-aware, but need to interpret molecular-based metagenomics data in the context of species occurrence records in the environment.

REFERENCES

  1. Green-Tringe, S., and E. M. Rubin, Metagenomics: DNA sequencing of environmental samples. Nature Reviews Genetics, 2005. 6: 805-814.
  2. Bioplatforms Australia (BPA) Australian Microbiome Project –https://data.bioplatforms.com/organization/about/australian-microbiome
  3. ANDS-Nectar-RDA Research Data Cloud (RDC) Program – https://www.ands-nectar-rds.org.au/researchdomainprogram
  4. BPA Data Portal – https://data.bioplatforms.com/
  5. Galaxy-Australia – https://usegalaxy.org.au
  6. Genomics Virtual Laboratory (GVL) – https://www.gvl.org.au
  7. BIOM format – http://biom-format.org
  8. QIIME (Quantitative Insights Into Microbial Ecology) software – http://qiime.org
  9. Mothur (software for describing and comparing microbial communities) – https://mothur.org/
  10. Rhea (a set of R scripts for the analysis of microbial profiles) – https://lagkouvardos.github.io/Rhea/
  11. Phyloseq (a set of R scripts for the analysis of microbiome census data) – https://joey711.github.io/phyloseq/
  12. Galaxy Toolshed – https://toolshed.g2.bx.psu.edu
  13. Galaxy API – https://galaxyproject.org/develop/api/
  14. CKAN API – http://docs.ckan.org/en/ckan-2.7.3/api/
  15. EcoED – http://ecoed.org.au
  16. EMBL-ABR Hybrid Training Delivery Method – https://www.embl-abr.org.au/wp-content/uploads/2017/12/Monica2017.pdf
  17. Project blog – https://bioscience-rdc.blogspot.com.au
  18. Project Trello Board – https://trello.com/b/qsmrSuPC/rdc-development

Biography:

Jeff has a PhD in Biochemistry from the University of Queensland, and started his career conducting research in the fields of cancer, molecular genetics and embryo development in both Australia and the UK, prior to moving into the management of large biological data assets (gene sequence, images, etc.) through the establishment of EMAGE, a UK-based international database of gene expression and anatomy.

Prior to joining QCIF and RCC, Jeff was based at Intersect Australia in Sydney where he was the National Manager of the RDS-funded med.data.edu.au project and also responsible for a number of biology-focused data and IT-related projects across NSW (biobanking, omics, etc.). Prior to this, he was based in Melbourne at the Australian National Data Service (ANDS), where he was involved in commissioning and monitoring a number of biology/medicine-focused national data management projects.

12

About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2017 - 2018 Conference Design Pty Ltd