Immersive Visualization Technologies for eResearch

Dr Jason Haga1, Dr David Barnes2, Dr Maxine Brown3, Dr Jason Leigh4

1Cyber-Physical Cloud Research Group, AIST, Tsukuba, Japan,

2Monash Immersive Visualisation Platform, Monash Univ., Melbourne, Australia,

3Electronic Visualization Laboratory, Univ. of Illinois at Chicago, Chicago, USA,

4Univ. of Hawaii, Manoa, Honolulu, USA

 

Title Immersive Visualization Technologies for eResearch
Synopsis

One Paragraph

It is well known that data is accumulating at an unprecedented rate. These troves of big data are invaluable to all sectors of society, especially eResearch activities. However, the amount of data is posing significant challenges to data-intensive science. The visualization and analysis of data requires an interdisciplinary effort and next generation technologies, specifically interactive environments that can immerse the user in data and provide tools for data analytics. One type of immersive technology is virtual reality (VR) and the Unity development platform, which together are becoming a viable, innovative solution for a wide variety of applications. To highlight this concept, we showcase two prototype VR applications: 1) for river disaster management using over 17,000 different sensors deployed throughout Japan, and 2) a “multi-player” collaborative virtual environment for scientific data that works across the scale of displays from desktop through ultra-scale CAVE2-like systems for two datasets: a segmented Kookaburra anatomy and an archaeological dig in Laos. These applications explore how combinations of 2D and 3D representations of data can support and enhance eResearch efforts using these new VR platforms. This presentation can also generate interest in the live demonstrations at the PRAGMA33 booth at eResearch.
Format of demonstration

Video, Live Demonstration, Slide Show

Video and Slide Show with reference to PRAGMA booth live demonstrations
Presenter(s)

Name, Title, Institution

Jason H. Haga, Senior Researcher, Cyber-Physical Cloud Research Group, AIST, Japan

David G. Barnes, Associate Professor and Director, Monash Immersive Visualisation Platform, Monash Univ., Melbourne, Australia

Maxine Brown, Director, Electronic Visualization Laboratory, Univ. of Illinois at Chicago

Jason Leigh, Professor, Univ. of Hawaii, Manoa

Target research community

One Sentence

Any research community looking for novel data visualization solutions.
Statement of Research Impact

One (short) Paragraph

Virtual reality and the Unity development platform are becoming a viable, innovative solution for eResearch. This Showcase presentation highlights two data visualization applications for which virtual reality is having a significant impact.
Request to schedule alongside particular conference session

Optional – List relevant conference sessions if any

Request to have our Showcase presentation early in the conference to provide sufficient time for people to visit the PRAGMA booth and experience live demos.
Any special requirements

Audio Visual Needs? Date/Time? Anything else…

 

 


 

Biography:

I am currently a senior researcher in the Cyber-Physical Cloud Research Group at the Information Technology Research Institute of AIST. My past research work involved the design and implementation of applications for grid computing environments and tiled display walls. I also work with cultural heritage institutions to deploy novel interactive exhibits to engage public learning. My research interests are in immersive visualization and analytic environments for large datasets. I have over 13 years of collaborative efforts with members of the PRAGMA community and continue to look for interdisciplinary collaboration opportunities.

orcid.org/0000-0002-6407-0003

Stemformatics Live Demonstration

Mr Rowland Mosbergen1

1University of Melbourne, Parkville, Australia

 

Title Stemformatics Live Demo
Synopsis Stemformatics is primarily a web-based pocket dictionary for stem cells biologists running on the NeCTAR cloud. Part of the stem cell community for over 6 years, it allows biologists to quickly and easily visualise their private datasets. They can also benchmark their datasets against 330+ high quality, preprocessed public datasets.
Format of demonstration Live Demonstration
Presenter(s) Rowland Mosbergen, Stemformatics Project manager, University of Melbourne
Target research community Biologists and individual bioinformaticians who want their
users to look at their data interactively
Statement of Research Impact Stemformatics allows biologists to benchmark their dataset against 350+ public, manually curated and high quality datasets that include stem cells, leukaemia and infection and immunity samples. This has contributed to the recent influx of biologists wanting to identify potential cells of origin from a particular tissue for expression of some of the genes that they are interested in.
Request to schedule alongside particular conference session I’m giving a talk on Thursday afternoon
Any special requirements Monitor to display Stemformatics

 


Biography:

Rowland Mosbergen is the Project Manager and Lead Developer for the Stemformatics.org collaboration resource. Rowland has 17 years experience in IT while working in research, corporate financial software and small business. He graduated QUT in 1997 with a Bachelor of Engineering in Aerospace Avionics, then worked for GBST, a software company servicing the financial industry, where he worked with National Australia Bank and Merrill Lynch in their Margin Lending products for over 4 years. Rowland owned and ran a computer support business for over 5 years, then worked as a web developer for 2 years before joining the Wells laboratory as part of the Stemformatics team in 2010.

Rowland’s experience in the commercial and private sectors gives him a solid understanding of customer-requirements when designing and implementing web resources. He has implemented scalable design solutions for database querying and data visualisation that services a growing research community. He is a key member of a diverse academic team that is product-focused with an emphasis on quality, responsiveness and customer usefulness. He prides himself on developing web environments that are fast, useful and intuitive.

Immersive Visualization Technologies for eResearch

Dr Jason Haga1, Dr David Barnes2, Dr Maxine Brown3, Dr Jason Leigh4

1Cyber-Physical Cloud Research Group, AIST, Tsukuba, Japan,

2Monash Immersive Visualisation Platform, Monash Univ., Melbourne, Australia,

3Electronic Visualization Laboratory, Univ. of Illinois at Chicago, Chicago, USA,

4Univ. of Hawaii, Manoa, Honolulu, USA

Title Immersive Visualization Technologies for eResearch
Synopsis

One Paragraph

It is well known that data is accumulating at an unprecedented rate. These troves of big data are invaluable to all sectors of society, especially eResearch activities. However, the amount of data is posing significant challenges to data-intensive science. The visualization and analysis of data requires an interdisciplinary effort and next generation technologies, specifically interactive environments that can immerse the user in data and provide tools for data analytics. One type of immersive technology is virtual reality (VR) and the Unity development platform, which together are becoming a viable, innovative solution for a wide variety of applications. To highlight this concept, we showcase two prototype VR applications: 1) for river disaster management using over 17,000 different sensors deployed throughout Japan, and 2) a “multi-player” collaborative virtual environment for scientific data that works across the scale of displays from desktop through ultra-scale CAVE2-like systems for two datasets: a segmented Kookaburra anatomy and an archaeological dig in Laos. These applications explore how combinations of 2D and 3D representations of data can support and enhance eResearch efforts using these new VR platforms. This presentation can also generate interest in the live demonstrations at the PRAGMA33 booth at eResearch.
Format of demonstration

Video, Live Demonstration, Slide Show

Video and Slide Show with reference to PRAGMA booth live demonstrations
Presenter(s)

Name, Title, Institution

Jason H. Haga, Senior Researcher, Cyber-Physical Cloud Research Group, AIST, Japan

David G. Barnes, Associate Professor and Director, Monash Immersive Visualisation Platform, Monash Univ., Melbourne, Australia

Maxine Brown, Director, Electronic Visualization Laboratory, Univ. of Illinois at Chicago

Jason Leigh, Professor, Univ. of Hawaii, Manoa

Target research community

One Sentence

Any research community looking for novel data visualization solutions.
Statement of Research Impact

One (short) Paragraph

Virtual reality and the Unity development platform are becoming a viable, innovative solution for eResearch. This Showcase presentation highlights two data visualization applications for which virtual reality is having a significant impact.
Request to schedule alongside particular conference session

Optional – List relevant conference sessions if any

Request to have our Showcase presentation early in the conference to provide sufficient time for people to visit the PRAGMA booth and experience live demos.
Any special requirements

Audio Visual Needs? Date/Time? Anything else…

 

 


Biography:

I am currently a senior researcher in the Cyber-Physical Cloud Research Group at the Information Technology Research Institute of AIST. My past research work involved the design and implementation of applications for grid computing environments and tiled display walls. I also work with cultural heritage institutions to deploy novel interactive exhibits to engage public learning. My research interests are in immersive visualization and analytic environments for large datasets. I have over 13 years of collaborative efforts with members of the PRAGMA community and continue to look for interdisciplinary collaboration opportunities.

orcid.org/0000-0002-6407-0003

 

Australian Urban Research Infrastructure Network (AURIN) Workshop

Xavier Goldie – Outreach Manager, AURIN1

1Australin Urban Research Infrastructure Network, The University of Melbourne, Melbourne 3000, Australia, Xavier.goldie@unimelb.edu.au

 

GENERAL INFORMATION

  • Is this workshop half-day or full-day?

Half Day

  • Who is the primary presenter for the workshop?

Xavier Goldie

  • Does the workshop include a hands-on component?

Yes

  • Are there any constraints on the number of attendees?

Try to limit to 25 people max

  • Are there any special seating or table requirements (e.g. for breakouts, teams)?

No

  • Are there any technical requirements beyond AV and access to wireless network?

Plenty of powerboards for laptops to be provided for users

DESCRIPTION

An AURIN workshop is a great opportunity to break down the AURIN Workbench and other spatial tools and to dive deep into the data access and analytics available through our platform.

Participants in the AURIN workshop will explore the extensive data repositories and extract information about Australian cities. Using the user friendly yet sophisticated tools contained within the AURIN portal, participants can mould this information into visible and sharable knowledge. Until now, much of this information has remained behind closed doors. AURIN enables access to this data for policy decision makers and planning professionals across all urban fields, letting users discover and mash-up data, information and knowledge.

Attendees will be able to expand their skills in GIS, accessing data through the AURIN Portal and AURINMap. We will learn how to interoperate between systems (AURIN Portal and QGIS) for maximum analysis impact.

Participants can undertake comparative analyses to study health data, analyse revealing socio-economic information, investigate walkability of neighbourhoods and more. Familiarity with these metrics is essential to understanding patterns of urban development and to best inform smart urban growth for a sustainable future.

Please provide an outline of the workshop content using the following format.

  1. Intro to AURIN

        20 minutes

  1. Using AURIN Map

        30 Miniutes

  1. Using the AURIN Portal

        60 minutes

  1. Interoperating with QGIS

        60 minutes

WHO SHOULD ATTEND

The workshop is open to all academic and government researchers who wish to learn more about AURIN and the potential to incorporate a spatial decision support aspect to their research.

WHAT TO BRING

Participants should bring a laptop, and should, if possible, install QGIS prior to attending the workshop (free Open Source GIS, available at www.qgis.org)


 

 

Biography

Xavier Goldie is Outreach Manager at AURIN

Vector Space Models and Semantic Analysis

Dr Simon Musgrave1, Dr Alice Gaby1, Mr Gede Primahadi Wijaya Rajeg1

1Monash University, Melbourne, Australia, Simon.Musgrave@monash.eduAlice.Gaby@monash.edu, gpri21@student.monash.edu

 

INTRODUCTION

Distributional semantic analysis is based on the idea that words which occur in the same contexts tend to have similar meanings, encapsulated by J.R.Firth: “a word is characterized by the company it keeps” [1, p. 57]. One way of implementing such an approach is to use space vector models [2], [3] of word meaning. Such models represent text as a matrix which locates each word in a multi-dimensional space; words which are used in similar contexts are close to each other in the spatial model and words which rarely co-occur in the text are far apart. Given a sufficiently large text sample, a model can be constructed which approximates the Saussurean ideal of showing the differences between every lexical element of a language. Implementations of algorithms to produce such models are now easily available [4], [5]. In this paper, we present initial results of semantic analysis using space vector models. This case study took 22 verbs used to describe events of cutting and breaking as identified by [6], [7].

METHOD & RESULTS

A 20 dimensional model was built using the entire contents (more than 500 million words) of the Corpus of Contemporary American English [8]-. A vector matrix for the 22 cut/break words to was extracted from the model. The matrix was then the basis for hierarchical clustering analysis, resulting in the dendrogram in Figure 1 which shows seven clusters as the most parsimonious grouping of the data.

Figure 1 – Hierarchical cluster analysis of 22 verbs of cutting and breaking

DISCUSSION

We suggest that the dendrogram shows two aspects of the value of these methods in semantic analysis.

Firstly, the clustering reflects semantic intuitions in most cases. For example, the first split in the clustering contrasts words which can be viewed as more basic, such as cut and break themselves, against more specific words, such as slash and hack, which are hyponyms of the first group. As another example at a lower level of the clustering, the somewhat archaic words hew and cleave group together. However, there are anomalies in the clustering: for example, saw is in the first main group discussed, and scythe does not group with hew and cleave.

Secondly, the lowest level of clustering show us the words which are closest to each other in the model, allowing us to ask what conceptual differences are relevant in distinguishing these words. An interesting example of this is the group slice, peel and chop. Intuition might suggest that slice and chop would be close to each other with peel denoting a rather different type of cutting. But in the model, peel and chop are closest with slice grouping together with them at the next level in the hierarchy.

The anomalies in these results suggest that the next step in applying these methods is to use them in association with collocational analysis. The space vector model is built from co-occurrence of words, therefore a phenomenon such as the relation seen here between peel and chop may be based on a commonality in what entities the activity is applied to rather than intrinsic properties of the activity. This suggestion is confirmed by Figure 2, which represents a network analysis of the co-occurrence patterns of the 20 words closest to each of the target verbs. The cluster in the upper right of this figure consists of words all used in recipes and this suggests that this genre may be over-represented in the data source.

FUTURE WORK

The studies from which we drew inspiration [6], [7] make comparisons across languages and we are extending our research in this direction, initially to include data from Dutch, German and Swedish (as in [7]).

REFERENCES

[1]           J. R. Firth, “A synopsis of linguistic theory 1930-1955,” in Selected Papers of J.R. Firth 1952-1959, F. R. Palmer, Ed. London: Longman, 1968, pp. 168–205.

[2]           S. Clark, “Vector Space Models of Lexical Meaning,” in The Handbook of Contemporary semantic theory, Second Edition., S. Lappin and C. Fox, Eds. Hoboken: John Wiley & Sons, 2015, pp. 493–522.

[3]           P. D. Turney and P. Pantel, “From frequency to meaning: Vector space models of semantics,” J. Artif. Intell. Res., vol. 37, no. 1, pp. 141–188, 2010.

[4]           T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” ArXiv Prepr. ArXiv13013781, 2013.

[5]           R. Řehůřek, “Scalability of Semantic Analysis in Natural Language Processing,” phdthesis, Masaryk University, 2011.

[6]           A. Majid, J. S. Boster, and M. Bowerman, “The cross-linguistic categorization of everyday events: A study of cutting and breaking,” Cognition, vol. 109, no. 2, pp. 235–250, 2008.

[7]           A. Majid, M. Gullberg, M. van Staden, and M. Bowerman, “How similar are semantic categories in closely related languages? A comparison of cutting and breaking in four Germanic languages,” Cogn. Linguist., vol. 18, no. 2, Jan. 2007.

[8]           M. Davies, “The Corpus of Contemporary American English: 520 million words, 1990-present.” 2008.


 

Biography

http://orcid.org/0000-0003-3237-9943

Simon Musgrave is a lecturer in linguistics at Monash University who locates much of his work in recent years in the field of Digital Humanities. This continues a longstanding interest in the use of computational tools for linguistic research.  Simon is a member of the executive of the Australasian Association for Digital Humanities and of the management committee of the Australian National Corpus.

Cursed Forest’ – a random forest implementation for big, extremely highly dimensional data

Mr Piotr Szul1, Mr Aidan O’Brien2, Dr Rob Dunne3, Dr Denis Bauer2

1Data61, CSIRO, Brisbane, QLD, Australia, piotr.szul@data61.csiro.au

2Health & Biosecurity, CSIRO, Sydney, NSW, Australia

3Data61, CSIRO, Sydney, NSW, Australia

 

INTRODUCTION

Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.

As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project [6] with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches especially if potential interaction between variants need to be considered.

Random forest [1] is one of the methods that was found to be useful in this context [3], both because its propensity for parallelization and robustness and the inherent ability to model interaction [2]. The variable importance measure extracted from random forest models can be used to identify variants associated with traits of interests.

There is a number of random forest implementation available for single machine computing with interfaces both for R and Python. Some of them, such as Ranger [4], are specifically designed to process highly dimensional WGAS-like data sets and boast significant performance improvements over the more generic implementations.

These implementations however are limited by the size of a single machine memory, and for larger datasets distributed solutions, in which data can be partitioned among multiple machine are needed. This approach underpins the computational model of many leading big data technologies including Apache Hadoop and more recently Apache Spark [7] –  a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Notably Spark machine learning library (Spark MLlib [8]) comes with a random forest implementation capable of dealing with huge datasets. This implementation however is tuned to work on typical dataset with large number of samples and relatively small number of variable and either fails or is inefficient for highly dimensional data [5].

To address these problems, we have developed the CursedForest – a Spark based, distributed implementation of random forest optimized for large, highly dimensional data sets. We have successfully CursedForest applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying CursedForest, released as part of the VariantSpark [9] toolkit, to a number of WGAS studies.

‘CURSED FOREST’

Typically, random forest implementations operate on the traditional matrix-like data representation with samples in rows and features in columns. In distribute scenario that leads to ‘by row’ partitioning, which for building decision trees on highly dimensional data has been proven to be significantly less efficient that the alternative ‘by column’ layout and partitioning [5].

CursedForests uses the basic principle of ‘by column’ partitioning but extends it to ensembles of trees and introduces a number of optimizations aimed at efficient processing of extremely highly dimensional genomics dataset which include memory efficient representation of genomics data, optimized communication patterns and computation batching.

RESULTS

We have compared the performance of CursedForset on genomics datasets against other leading implementation and tested its ability to scale on very large datasets. The results demonstrate not only that CursedForset is the single implementation able to process the WGAS-size datasets but also that it significantly outperforms other methods on smaller datasets.

Figure 1: Peformance of CursedForest compared to MLlib and Ranger (left); Scaling of CursedForest to WGAS-sized datasets (right).

We have also demonstrated that CursedForest can accurately predict ethnicity from whole genome sequencing profiles of 2,500 individuals being the only implementation scaling to the full dataset. We have trained CursedForest on the 1000 genomes dataset, which consists of 2,504 samples with 81,047,467 features each to predict the ethnicity from genomic profiles. CursedForest achieves an out of bag error of OOB=0.01 and completes in 36 min 54 seconds, demonstrating its capability to run on population-scale cohorts of real world applications.

CONCLUSIONS

We we have developed the `CursedForest` – a Spark based random forest implementation optimized for highly dimensional data sets – and demonstrated that it outperforms other tools on genomics datasets both in terms computation time and data-set size limits and is capable of running on population-scale cohorts of real world applications.

REFERENCES

  1. Breiman L., Random Forests. Machine Learning. 2001 October;45(1):5–32.
  2. Lunetta et al., Screening large-scale association study data: exploiting interactions using random forests. BMC Genetics 2004 5:32
  3. Díaz-Uriarte R, Alvaréz de Andres S., Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3
  4. Wright, Marvin N., and Andreas Ziegler., “ranger: A fast implementation of random forests for high dimensional data in C++ and R.” arXiv preprint arXiv:1508.04409 (2015)
  5. Firas Abuzaid et al.,  Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale.  Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016,  Barcelona, Spain
  6. 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012 Nov;491(7422):56–65.
  7. Apache Spark, available from https://spark.apache.org/
  8. Spark MLLib, available from: https://spark.apache.org/mllib/
  9. VariantSpark available from: https://github.com/csirobigdata/variant-spark

 


Biography

Piotr Szul is a Senior Engineer in CSIRO Data61. He holds a MSc degree in Computer Science and has over fifteen year of experience in developing various types of commercial and scientific software applications.

Since his joining of CSIRO in 2012 Mr Szul has been involved in a number of project of developing research platforms and applying emerging big data technologies to research problems in domains of social media, genomics and material sciences. In addition to his core software engineering activities Mr Szul has also actively involved in building experimental cloud and big data processing infrastructure.

Omicxview: an interactive metabolic pathway visualisation tool

Miss Ariane Mora1, Mr Rowland Mosbergen2, Mr Steve Englart1, Mr Othmar Korn1, Associate Professor Mikael Boden3, Professor  Christine Wells2,4

1 Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, QLD, Australia

ariane.mora@uq.net.auo.korn@uq.edu.austeve.englart@stemformatics.org

2 Department of Anatomy and Neuroscience, The University of Melbourne, Melbourne, Victoria, Australia

rowland.mosbergen@unimelb.edu.auwells.c@unimelb.edu.au

3 School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Queensland, Australia

m.boden@uq.edu.au

4 The Walter and Eliza Hall Research Institute, Melbourne, Victoria, Australia

 

Omicxview is an interactive visualisation portal that enables researchers to display large metabolic datasets on well-defined Escher pathways [1]. It addresses the gap between very simple static views, such as the common approach of colouring KEGG pathways, and comprehensive networks such as Reactome, which can be so complex that the signal of interest is dwarfed by background information, and thus difficult for untutored users to navigate. Omicxview overlays experimental data onto Escher metabolic pathways, providing users with intuitive and interactive ways to explore large multi-omic datasets.

 

The biggest challenge faced by this project was the lack of standardised nomenclature in the Metabolomics community, leading to ambiguities when assigning a metabolite to a pathway term. Our solution was to develop a robust identification process that enables users to map their uploaded metabolites to a range of public database identifiers, including ChEBI, KEGG, BiGG, HMDB and MetaNetX.

Omicxview is an open source web application running on the NeCTAR [2] cloud. It was designed to integrate with the Stemformatics.org stem cell portal. Stemformatics is a collaboration platform for the stem cell community, designed for the rapid visualisation of multi-omics data [3].

Omicxview provides an enhanced user experience and tools for interactive visualisation of metabolic pathways. The remit includes the ability to share views and datasets and the capacity to evolve and add functionality as the underlying datasets require.

Omicxview has been developed using D3 and SVG to produce an extensible, interactive environment where the activity of metabolites of interest can be explored using a range of metrics. These include significant differences between sample types in the form of fold change or the users’ choice of statistical value. Figure 1, Figure 2 and Figure 3 are examples of such filters applied to randomly generated datasets on a range of pathways.

By providing cross-database mapping of identifiers, Omicxview supports the development of in-house standards. By providing rapid and clear visualisation of experimental data, Omicxview aids identification of experimental trends within metabolomic data series. Furthermore, collaboration is fostered between research groups by providing the opportunity to host and share private data. These attributes are representative of Stemformatics’ commitment to developing high quality, open source tools that benefit the wider stem cell community.

Stemformatics [3] developed Omicxview in collaboration with Metabolomics Australia [4], as part of the Bioplatforms Australia initiative [5].

Figure 1: Carbohydrate metabolism pathway with a randomly generated dataset to represent metabolite expression values. The metabolites have been highlighted on arbitrarily assigned p-values to display the tools provided by Omicxview.

Figure 2: Ascorbate metabolism with a randomly generated dataset to represent metabolite expression values. The graph containing the minimum sample difference has been highlighted in orange.

Figure 3: Glycolysis tricarboxylic acid cycle with a dataset containing four sample types and randomly generated metabolite expression values. Reactions between two metabolites within the dataset have been highlighted. The opaque blue outer sphere indicates that the user has uploaded statistical information for the given metabolite. The user can view the information by hovering over, or clicking on the metabolite.

 

REFERENCES

  1. King, Z et al Escher: A web application for building, sharing, and embedding data-rich visualizations of biological pathways, PLOS Computational Biology 11(8): e1004321. doi:10.1371/journal.pcbi.1004321
  2. NeCTAR research cloud website http://nectar.org.au/ accessed 12th of June 2017
  3. Wells CA et al Stemformatics: Visualisation and sharing of stem cell gene expression. Stem Cell Research, DOI http://dx.doi.org/10.1016/j.scr.2012.12.003
  4. Metabolomics Australia website http://www.metabolomics.net.au/ accessed 12th of June 2017
  5. Bioplatforms Australia website http://www.bioplatforms.com/ accessed 12th of June 2017

Biographies

Ariane Mora is completing a Bachelor of Electrical and Computer Engineering at the University of Queensland. She has recently finished her honours thesis under the supervision of Rowland Mosbergen and Associate Professor Mikael Boden, developing a metabolic pathway visualisation portal. For the past year she has worked for Stemformatics developing web based visualisation tools for stem cell researchers.

Rowland Mosbergen is the Project Manager and Lead Developer for the Stemformatics.org collaboration resource. Rowland has 17 years experience in IT while working in research, corporate financial software and small business. Rowland helped to design the Stemformatics code-base to be flexible enough to handle multi-omics datasets, and while the application is aimed at the stem cell community, the fundamentals are suitable for any community-based data-visualisation environment.

Visualisation on Popular Platforms with a bit Extra Novelty

Dr Chao Sun1

1The University Of Sydney, Camperdown, Australia, chao.sun@sydney.edu.au

 

INTRODUCTION

As the data scientist in the faculty of arts and social sciences, one of the most commonly received request is visualisation. In the area of digital humanities, the visualisations are mostly used for showing the networks, presenting the findings and analysing numeric records. A number of popular platforms/software are well designed and widely used for various purposes, such as Gephi is often used for exploring networks, NVivo is a good platform for analyse and visualise textual data, and Tableau is the trending business analytic solution for generating reports etc.

The existence of these tools makes things much easier for replicating and standardise some works. However, the nature of research is never limited to any pre-designed functions, and many social sciences researches raise unique digital requests for assisting the qualitative studies.

This presentation includes two showcases where visualisations were developed more as a research tool than as a standard visual outcome. In either case, popular software (Gephi or Tableau) was employed but tweaked to present information differently.

SHOWCASE 1: TIMELINE OF WIKIPEDIA PAGES & EDITORS

The Wikipedia is a very rich resource of knowledges, however it also acts as a crowd media agency that gathers and records up-to-date information especially for emergency events, such as the event of 2014 Sydney hostage crisis (the Lindt Café siege).

One of our researchers was interested in how the relevant Wikipedia page was constructed, who has contributed to the editions, how the debates on this page happened and eventually settled. The good news is that the Wikipedia is fully open, and all revision details can be retrieved using API. However, during the past two and half years, there are over 3,000 revisions of this page and hundreds of editors who edited the page more or less. It is impossible to read through all the history and do a throughout qualitative study. The revision data can be gathered, cleansed and organised in a spreadsheet, however it is still not easy to find the right spot to drill in.

After a lot of communication, both the researcher and I figured out the best tool for approaching the research questions would be a timeline, on which the active editors, significant contributions and important time points can be visually identified. However, no known tool can be used for generating such a timeline on this specific problem.

Figure 1. Important Revision/Editor Timeline for the “2014_Sydney_hostage_crisis” Wikipedia Page.

 

As shown in Figure 1, Python functions were made to analyse the Wikipedia revision data and to generate a network graph with all coordinates, then used Gephi to draw the nodes and edges as a visually informative research tool for studying the problem. This timeline vis only shows significant editions made by the most active editors. The X-Axis is the time and the Y-Axis is the size of the article (word count). The node size represents how much change is made to the page, and nodes colour stands for the editor. Edits made by the same editor are of the same colour and are linked together, so it’s easy to see when an editor made contribution and revisited for more edits.

Because all nodes and edges are automatically generated, similar timelines can be quickly generated for various periods and selected editors. The interactive interface of Gephi makes the timeline an even more powerful tool with filtering, highlighting and customised information display capacity.

SHOWCASE 2: HIERACHY INSCRIPTION DISPLAY WITH TABLEAU

In another research project, the business analytic platform, Tableau, is used purely as an interactive interface for displaying Buddhist inscription with multiple hierarchy levels of meaning.

In this work, the Gāndhārī inscriptions have been closely studied, analysed and modelled using a specially developed workbench for ancient documents studies (READ). Texts, annotations and tags of the inscription are retrieved from an online database server and then processed using Python. Each grapheme in the inscription is displayed as tiles with various colours and shapes at different panels in the Tableau workbook, depending on the hierarchy and analysis relationships. The researcher could easily filter, highlight, click to display relevant levels and check the meta-data attached to each token, as shown in Figure 2.

Figure 2. Tableau Workbook for Displaying Hierarchy Levels of Buddhist Inscription as Interactive Tiles.

Disregarding the great number crunching capacity of the Tableau, we turned it into an interactive text displaying platform on which researchers could view and explore the conceptual relationships among texts. This has been proven to be a much more efficient and effective way of studying than exploring the tables stored in a database.

CONCLUSION

The digital humanities and social sciences is still a quite new area of research, and there are a lot of potentials as well as problems in this domain. With sufficient level of communication and understanding between the data scientist and the researcher, it is often necessary to design novel methodologies for approaching certain research questions using the digitalised tools.

However, often, it is not necessary to make everything from the scratches. With good problem solving skills, and some creativities, we are able to utilise the existing tools in a different way and achieve the goals smartly.

 


Biography

Chao Sun obtained his PhD degree from University of Wollongong on Data Mining, and joined Faculty of Arts and Social Sciences, the University of Sydney as a data scientist in 2016. Chao has been supporting and collaborating many digital humanities and social sciences research projects, providing services such as consulting, research methodology designing, data collecting and crunching, visualisation generating etc. Chao is also jointly working in the Sydney Informatics Hub in USyd and as the representative for TrISMA project (QUT). Chao is enthusiastically acting as a bridge for generating cross-discipline collaborations between the faculty and a broader network of data specialised researchers.

eStoRED – A Distributed Platform for Research Data Evaluation, Enrichment and Stories Drafting

Mr Guillaume Prevost1, Professor Heinrich Schmidt2

1 RMIT University, Melbourne, Australia, guillaume.prevost@rmit.edu.au 
2 RMIT University, Melbourne, Australia, heinrich.schmidt@rmit.edu.au

 

In recent times several eResearch applications appeared assisting discipline experts in Australian universities in telling stories about research data. We describe eStoRED, a platform that not only helps gathering data and quickly pulling it together into a meaningful story draft, but also assists the researcher to enrich them with calculations and visualizations. Keeping a focus on research users, the platform adds value to data using calculators and connectors, fusing heterogeneous data together. It fits in the research method and with existing eResearch tools supporting this process. We first have a look at some of the use cases eStoRED has been used for, and describe some uncommon aspects and features that make eStoRED valuable as an eResearch platform.

 

RESEARCH FOCUSED

Since the genesis of eStoRED, it is a key to add value to data through interpretation by experts in their relevant fields. The platform originated as a tool that focused on providing data related to Australian and South Pacific seaports to support early-­‐stages climate risk assessment and climate change adaptation training/planning [3]. Now more versatile and more mature, eStoRED remains a tool in the hands of the expert.

We have a brief look at cases where the platform has been used to enrich and facilitate re-­‐use of research data in different contexts. RMIT University gathers signature data collections used or produced by some of the University’s researchers, and needs enrichment of these very diverse collections so that users browsing collections will have an almost immediate understanding of their content and how they could be re-­‐used for other research purposes.

The first example is a research data collection of tweets during the UK riots of 2011. It includes data on the course of the events and on the role of the software facilitating and shaping the discussion [2]. The data are a snapshot of Twitter activity returned by the Twitter streaming API over the one-­‐week period, over 22 millions tweets. This dataset is larger than 100GB, making it difficult to quickly grasp its global traits. eStoRED was used in combination with Neo4J graph database to pull data from the collection and show an overview of some of its aspects.
The second collection consists of open data coming from the UK company eCourier, and consists in actual movement
data of couriers tracked over more than a month. Prof. Matt Duckham et al. [1] used the collection for analyzing the
modeling possibilities for reconstructing individual movements or flow based on checkpoint counts at different times.

Driven by researchers’ deep understanding of the data, eStoRED was used to calculate and generate meta data and visualizations on this large datasets and enabled the creation valuable meta data enriching the collection and making it much more accessible for other researchers.

 

CALCULATORS

eStoRED stories are composed of annotated “data elements” connecting to data providers via a publish-­‐subscribe system. This feature allows a researcher to add specific model-­‐driven calculators with only a small effort and enables their seamless integration into the platform, without changing the eStoRED software itself. They simply need to be added to a curated catalog of calculators that keeps meta data for each of them for discoverability, provenance and re-­‐ usability, stored onto MyTardis data curation system [4].

Visualizers are capable of presenting data under a specific angle while calculators can apply some complex processing to the data received. An example of built-­‐in calculator is the asset risk estimator based on the ISO31000 risk management standard, with risk and mitigation lists, assessment formulae and enterprise dashboards. Concrete examples of past extensions include infrastructure deterioration models calculating timber, steel and concrete deterioration under climate adaptation risk scenarios for Australian and Pacific seaports [3], driven by Excel spread sheets containing the model as formulae.

 

DATA FUSION

Connecting a data element to several heterogeneous sources of data enables combining them into a single calculator or visualizer, augmenting considerably the possibilities of trying things out with data. This is crucial for a platform used close to data capture, at a time when data analytics is perhaps most powerful: models, hypotheses and evaluations are still being fine-­‐tuned, failure is a key to success, and changes in experiments can accelerate the research perhaps most. At such a stage, data is still being explored and tinkered with, varying their formats, their algorithmic processing and visual presentation to target the academic or sponsor community, and data is fresh in the minds of researchers and can be described and documented with least effort.

A simple example of the fusion of data in eStoRED was implemented in a proof of concept for the Australia-­‐India Research Center for Automation Software Engineering (AICAUSE), where the topological organization of a production line in a factory, modeled as static data, is combined in a single visualization with live sensory monitoring data from the production line as it operates [5].

 

CONNECTORS

A key extensibility feature of eStoRED is its open architecture permitting researchers to define connectors to a variety of external services federated around RabbitMQ service bus. The benefits of connectors include opening links to and RESTful services, live real-­‐time data feeds, data ingestion and conversion scripting as part of connector functionality, and a federated peer-­‐to-­‐peer architecture in a distributed Model-­‐View-­‐Controller pattern. In contrast, many other visualisation and story drafting tools depend on local data, databases and/or often, static data.

 

CURATION

eStoRED is an integral researcher-­‐facing part of a platform that includes the MyTardis research data curation system. MyTardis is an application for cataloguing, managing and assisting the sharing of large scientific datasets privately and securely over the web [4].

 

CONTINUITY

The Chiminey parallel sweeper and smart connector to clusters and clouds [6], and the KNIME workflow engine are major components in the software stack surrounding eStoRED. Both components are configured to take curated data and meta data from the MyTardis curation platform [4], push them through predefined analytics processes and cycle the results back to the MyTardis with much of the required meta-­‐data predefined. This data-­‐centric and cyclic research data process supports the intrinsic model-­‐experiment-­‐evaluate cycle underpinning the scientific method and places eStoRED as a key research-­‐user facing front to add descriptions, meta data, calculations and scripting, which are all curated themselves. This not only assists automating the scientific process using existing open-­‐source tools such as KNIME, Chiminey and others but also supports repeatability and reproducibility of a continuous scientific process.

These are just a few of the strengths of eStoRED. Its versatility allows adapting easily to various research domains, its scalability enables it to work on large amount of data, and overall provides an environment that supports data exploration and evaluation, enables significant enrichment and prepares to tell the researchers’ stories whether in publications or in other data sharing tools.

 

REFERENCES

1. Duckham M. et al. (2016), Modeling Checkpoint-­‐Based Movement with the Earth Mover’s Distance. In Miller J., O’Sullivan D., Wiegand N. (eds) Geographic Information Science. GIScience 2016. Lecture Notes in Computer Science, vol 9927. Springer, Cham
2. Pond P. (2016), Software and the struggle to signify: theories, tools and techniques for reading Twitter-­‐ enabled
communication during the 2011 UK Riots. PhD thesis, RMIT University, January 2016
3. McEvoy, D, Mullett, J, Trundle, A, Hunting, A, Kong, D and Setunge, S (2016), A decision support toolkit for climate resilient seaports in the Pacific region. In Ng, Becker, Cahoon, Chen, Earl and Yang (ed.) Climate Change and Adaptation Planning for Ports, Routledge, London, pp. 215-­‐231.
4. S. Androulakis, J. Schmidberger, M. A. Bate, et al. (2008), Federated repositories of X-­‐ray diffraction images. In Acta
Crystallographica Section D, 64(7):810–814, Jul 2008.
5. Prévost G., Blech J., Foster K. and Schmidt H. (2017), An Architecture for Visualization of Industrial Automation Data. In Proceedings of the 12th International Conference on Evaluation of Novel Approaches to Software Engineering -­‐ Volume 1: ENASE, ISBN 978-­‐989-­‐758-­‐250-­‐9, pages 38-­‐46. DOI: 10.5220/0006289700380046
6. Yusuf I I, Thomas I E, Spichkova M, et al. (2017), Chiminey: Connecting Scientists to HPC, Cloud and Big Data. Big Data Research Volume 8, July 2017, Pages 39-­‐49.

 


Biographies

Guillaume is a Research Software Engineer at RMIT University. He obtained a Master degree in computer science at the European Institute of Information Technology (Epitech) in France in 2012. He worked in the industry with AtoS Worldgrid for a year before starting to work in eResearch with RMIT in 2013.

Heinz (Heinrich) is a Professor of Software Engineering at RMIT University, where he holds the post of eResearch Director and Director of the Australia-India Research Centre for Automation Software Engineering. He is also Adjunct Professor at Mälardalen University and has been Adjunct Professor at Monash University for several years. Prior to RMIT he worked at Monash University (Professor of Software Engineering and various posts as Head of Department, Centre or Associate Dean), CSIRO and ANU, ICSI UC Berkeley and GMD (now Fraunhofer) in Germany. http://orcid.org/0000-0001-6278-4793

About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2016 - 2017 Conference Design Pty Ltd