Towards ‘end-to-end’ research data management support

Mrs Cassandra Sims1

1Elsevier, Chatswood, Australia, c.sims@elsevier.com

 

Information systems supporting science have come a long way and include solutions that address many research data management needs faced by researchers, as well as their institutions. Yet, due to a fragmented landscape and even with the best solutions available, researchers and institutions are sometimes missing crucial insights and spending too much time searching, combining and analysing research data [1].

Having this in mind, we are working on holistically addressing all aspects of the research life cycle as it is shown in Figure 1. The research lifecycle starts from the design phase when researchers decide on a new project to work on next, prepare their experiments and collect initial data. Then it moves into the execution mode when research experiments are being executed. Research data collected, shared within the research group, processed, analysed and enriched. And finally research results get published and main research outcomes shared within the scientific community networks.

Figure 1: Research lifecycle

Throughout this process researchers use a variety of tools, both within the lab as well as to share their results. Research processes like this happen every day. However, there are no current solutions that enable end-to-end support of this process for researchers and institutions.

Many institutes have established internal repositories, which have their own limitations. At the same time, various open data repositories [2] have grown with their own set of data and storage/retrieval options, and many scholarly publishers now offer services to deposit and reference research datasets in conjunction with the article publication.

One challenge often faced by research institutes is developing and implementing solutions to ensure that researchers can find each other’s research in the various data silos in the ecosystem (i.e. assigning appropriate ontologies, metadata, researcher associations). Another challenge is to increase research impact and collaboration both inside and outside their institution to improve quantity and quality of their research output.

Making data available online can enhance the discovery and impact of research. The ability to reference details, such as ownership and content, about research data could assist in improved citation statistics for published research [3]. In addition, many funders increasingly require that data from supported projects is placed in an online repository. So research institutes need to ensure that their researchers comply with these requirements.

This talk will be about a suite of tools and services developed to assist researchers and institutions in their research data management needs [4], covering the entire spectrum which starts with data capture and ends with making data comprehensible and trusted enabling researchers to get a proper recognition and institutions to improve their overall ranking by going “beyond the mandates”.

I will explain how it integrates through open application programming interfaces with the global ecosystem for research data management (shown in Figure 2), including:

  • DANS [7] for long-term data preservation,
  • DataCite [5] for DOIs and indexed metadata to help with data publication and inventory,
  • Scholix [6] for support of links between published articles and datasets,
  • More than 30 open data repositories for data discovery.

Figure 2: Integration with the global research data management ecosystem

The talk will conclude with the overview of the current data sharing practices and a short demonstration of how we incorporate feedback from our development partners: University of Manchester, Rensselaer Polytechnic Institute, Monash University and Nanyang Technological University.

REFERENCES

  1. de Waard, A., Cousijn, H., and Aalbersberg IJ. J., 10 aspects of highly effective research data. Elsevier Connect. Available from https://www.elsevier.com/connect/10-aspects-of-highly-effective-research-data, accessed 15 June 2018.
  2. Registry of research data repositories. Available from: https://www.re3data.org/, accessed 15 June 2018.
  3. Vines, T.H. et al., The Availability of Research Data Declines Rapidly with Article Age. Current Biology, 2014, 24(1): p. 94-97.
  4. Elsevier research data management tools and services. Available from: https://www.elsevier.com/solutions/mendeley-data-platform, accessed 15 June 2018.
  5. DataCite. Available from: https://www.datacite.org/, accessed 15 June 2018.
  6. Scholix: a framework for scholarly link exchange. Available from http://www.scholix.org/, accessed 15 June 2018.
  7. Data Archiving and Networked Service (DANS). Available from: https://dans.knaw.nl/en, accessed 15 June 2018.

Biography:

Senior Research Solutions Manager ANZ

Cassandra has worked for Elsevier for over 6 years, as Product Solutions Manager APAC and currently as Senior Research Solutions Manager ANZ. Cassandra has demonstrated experience and engagement in both the Academic, Government and Health Science segments in region, working with Universities, Government Organisations, Local Area Health Districts, Funders and Industry, to assist in the development of business strategies, data asset management and core enterprise objectives. Specialising in detailed Analytics, Collaboration Mapping and Bibliometric Data, Cassandra builds on her wealth of knowledge in these areas to assist our customer base with innovative and superior solutions to meet their ever changing needs. Cassandra has worked with the NHMRC, ARC, MBIE, RSNZ, AAMRI and every university in the ANZ region. Cassandra is responsible for all new business initiatives in ANZ and in supporting strategic initiatives across APAC.

Making the University of Adelaide Magnetotellurics data collection FAIR and onto the path towards reproducible research

Nigel Rees1, Ben Evans2, Graham Heinson3, Jingbo Wang4, Lesley Wyborn5, Kelsey Druken6, Dennis Conway7

1The Australian National University (NCI), Canberra, Australia, nigel.rees@anu.edu.au

2The Australian National University (NCI), Canberra, Australia, ben.evans@anu.edu.au

3The University of Adelaide, Adelaide, Australia, graham.heinson@adelaide.edu.au

4The Australian National University (NCI), Canberra, Australia, jingbo.wang@anu.edu.au

5The Australian National University (NCI), Canberra, Australia, lesley.wyborn@anu.edu.au

6The Australian National University (NCI), Canberra, Australia, kelsey.druken@anu.edu.au

7The University of Adelaide, Adelaide, Australia, dennis.conway@adelaide.edu.au

 

Magnetotelluric (MT) data in the research community is traditionally stored on departmental infrastructures and when published, the data is in the format of processed esoteric downloadable files with limited metadata. In order to obtain the source raw MT time-series data, a lengthy process ensues where one would typically have to email the data owner and transfer would be either via FTP download for local processing, or in some cases, the files sizes are so large that they need to be transferred on hard disk via Australia Post.

It has become increasingly apparent to the MT community that in order to increase online collaboration, reduce time for analysis, and enable reproducibility and integrity of scientific discoveries both inside and beyond the MT community, datasets need to evolve to adopt Findable, Accessible, Interoperable and Reusable (FAIR) data principles. The National Computational Infrastructure (NCI) has been working with The University of Adelaide to address these challenges as part of the 2017-2018 AuScope-ANDS-NeCTAR-RDS funded Geoscience Data Enhanced Virtual Laboratory (DeVL) project. The project aims to make the entire University of Adelaide MT data collection (from 1993-2018) FAIR. NCI have also added an assortment of MT processing and modelling software on both their Virtual Desktop Infrastructure and Raijin Supercomputer, which has helped to reduce data processing and subsequent modelling times.

The University of Adelaide MT data collection needs to be both discoverable and accessible online, and conform to agreed international community standards to ensure interoperability with other international MT collections (e.g., AusLAMP [1], EarthScope USArray [2], SinoProbe [3]), as well as reusability for purposes other than what the data was collected for. For the process to become more transparent, the MT community will need to address fundamental issues including publishing FAIR datasets, publishing model outputs and processing regimes, re-evaluating vocabularies, semantics and data structures, and updating software to take advantage of these improvements. For example, it is no longer sufficient to only expose the processed data; the raw instrument data needs to be preserved persistently so that as algorithms improve, the original source data can be reprocessed and enhanced. Consistent with the FAIR and reproducibility principles, the MT processing and modelling tools should also be easily discoverable and accessible and where required usable in online virtual environments, with software versions citable. The journey from the raw-data to the final published models should be transparent and well documented in provenance files, so that published scientific discoveries can be easily reproduced by an independent party.

One of the components of this project has been to explore the value of converting raw MT time-series into open scientific self-describing data formats (e.g., Network Common Data Form (netCDF)), with a view to showing the potential for accessibility through data services. Such formats open up the ability to analyse the data using a much wider range of scientific software from other domains. As an example, Jupyter Notebooks have been created to show how the MT data can be accessed and processed via OPeNDAP data services. These changes alone will aid in the usability of the data, which can be accessed without having to explicitly pre-download the data before commencing any analysis.

The Geoscience DeVL project has focused on making the University of Adelaide MT data available online as well as assembling software and workflows available in a supercomputer environment that significantly improve the processing of data. This project has also made a valuable addition to the AuScope Virtual Research Environment, which is progressively making more major Earth science data collections, software tools and processing environments accessible to the Australian Research community. The results of our work are also being presented at international MT forums such as the 24th EM Induction Workshop [4] held in Helsingør, Denmark, to ensure that the data capture, publishing, curation and processing being undertaken at NCI is in line with best practice internationally.

ACKNOWLEDGEMENTS

This work was supported by the National Computational Infrastructure, AuScope Limited, ANDS-NeCTAR-RDS and The University of Adelaide.

REFERENCES

  1. The Australian Lithospheric Architecture Magnetotelluric Project (AusLAMP). Available from: http://www.ga.gov.au/about/projects/resources/auslamp , accessed 21 June 2018.
  2. The EarthScope USArray magnetotelluric program. Available from: http://www.usarray.org/researchers/obs/magnetotelluric , accessed 22 June 2018.
  3. SinoProbe – Deep Exploration in China. Available from: http://sinoprobe.cags.ac.cn/About-Sinoprobe/ , accessed 22 June 2018.
  4. The 24th EM Induction Workshop (EMIW2018). Available from: https://emiw2018.emiw.org/ , accessed 21 June 2018.

Biography:

Nigel Rees is a Research Data Management Specialist at the National Computational Infrastructure (NCI) with a background in magnetotelluric geophysics. In his role at NCI, he supports research data needs and assists with the management, publishing and discovery of data.

An Open Question: A comparison of proprietary and open-access teaching materials for researchers

Mr Aidan Wilson1, Dr Anastatios Papaioannou1

1Intersect Australia, Sydney, Australia

 

Intersect Australia has been a significant eResearch training provider for several years. Since the first courses in eResearch tools like HPC and Microsoft Excel, the Intersect repertoire has expanded to over 25 distinct courses, delivered at our 12 member universities, hundreds of times per year to thousands of researchers.

Intersect began utilising open access training materials in 2015: teaching Software Carpentry’s Creative Commons licensed courseware in Python, Matlab, R, Unix, and Git. Shortly thereafter, two Intersect eResearch Analysts were accredited as Software Carpentry instructors. The following year this was expanded with four more accredited instructors, and in 2017, a further six instructors were accredited and Intersect joined the Software Carpentry Foundation as a silver member, a status we recently reaffirmed.

Throughout this period, Intersect has continued to maintain a proprietary catalogue of Intersect-developed courses taught alongside the Software Carpentry materials.

In this presentation, we will explore the differences, if any, in the reception of Intersect developed course material and openly available Software Carpentry material by course attendees. The differences in cost to maintain proprietary courseware or utilise openly available materials is explored. We will also analyse differences between the delivery of the two sets of courses based on other variables, such as the experience level and teaching style of the trainer.

This presentation will be valuable to similar organisations who are grappling with the logistics of running eResearch training courses, and deciding on strategies regarding developing their own material or using material that already exists in the public domain.

As one of Australia’s most recognised eResearch training organisations, Intersect hopes that other, similar organisations may be able to benefit from our experiences, so that the research community can ultimately benefit from high-quality training from a diverse range of providers.


Biography:

Aidan Wilson is Intersect’s eResearch Analyst for the Australian Catholic University, and coordinator of Intersect’s training platform. Aidan’s research background is in documentary linguistics, concentrating on the syntax and morphology of Australia’s Aboriginal languages. He has also been actively involved in research support, and worked as a data manager for PARADISEC, an archive of Pacific and regional digital enthographical data, including linguistic and ethnomusicological recordings. In his time at Intersect, Aidan has been involved in a number of engineering and data science projects, including secure data movement for health and medical, and imaging datasets, and genome sequencing as-a-service.

http://orcid.org/0000-0001-9858-5470

Anastasios Papaioannou is Intersect’s Research Data Scientist, and one of the coordinators of Intersect’s training platform. Anastasios holds a BSc in Physics and MSc in Computational Physics, with his research focus mainly being on computational physics applied in medicine and biology. He also holds a Ph.D. in Computational Biophysics/Medical Physics from the University of Sydney. He has over 4 years of experience as an academic tutor and over 6 years in research. His role at Intersect involves working collaboratively with relevant stakeholders to develop and implement activities to ensure Intersect’s success in the Data Science field for research. He is involved in various national and state level health and medical (and other) eResearch data, while possessing a deep technical understanding of data and a combination of expertises such as programming, data and business analysis and analytics.

https://orcid.org/0000-0002-8959-4559

Visualisation of research activity in Curtin’s virtual library

Peter Green1, Pauline Joseph2, Amanda Bellenger3, Aaron Kent4, Mathew Robinson5

1Curtin University, Perth, Australia, P.Green@curtin.edu.au

2Curtin University, Perth, Australia, P.Joseph@curtin.edu.au

3Curtin University, Perth, Australia, A.Bellenger@curtin.edu.au

4Curtin University, Perth, Australia, Aaron.J.Kent1@gmail.com

5Curtin University, Perth, Australia, Matt.Robinson@curtin.edu.au

 

Curtin University Library manages authenticated access to its online journal, book and database collections using the URL re-writing proxy service called EZproxy. EZproxy mediates the request between user and publisher platform via the Library. The proxy service is widely deployed in libraries worldwide and has been a standard authentication solution for the industry for many years. The EZproxy software creates a log entry for each request in the Combined HTTP Log format[2]. The log files are extensive, with approximately 30 million lines written per month. The log files capture information for each request such as the IP address, client ID, date and time, HTTP request and response and so forth. The Curtin Library has retained at least five years of the log files.

This large dataset presents an opportunity to learn more about the information seeking behaviour of Curtin Library clients, but also presents a challenge. Traditional analysis of such data tends to produce aggregated usage statistics that do not reveal activity at a granular level. Immersive visualisation could provide a means to see the data in a new way and reveal insights into the information seeking behaviour of Curtin Library clients. In collaboration with Dr Pauline Joseph, Senior Lecturer (School of Media, Creative Arts and Social Inquiry) the Curtin Library proposed this work for funding under the Curtin HIVE[3] Research Internships program. The proposal was successful and a computer science student, Aaron Kent, was employed for a ten week period to produce visualisations from the EZproxy log file dataset.

The data was anonymised to protect client confidentiality whilst retaining granularity. The number of lines in the log file were reduced by removing ‘noise’. The Unity3D[4] software was chosen for its ability to provide visualisations that could be displayed on the large screens of the HIVE but also desktop screens. Many possibilities were discussed for visualisations that might give insight into client behaviour, but two were chosen for the internship.

The first visualisation focusses on the behaviour of individual users in time and space and represents each information request using an inverted waterfall display on a global map as illustrated by Figure 1. Different sizes and shapes are used to present different client groups and the size of the information request is reflected in the size of the object. Geolocation information is used to anchor each request on the map.

Figure 1: Global user visualisation

 

The second visualisation focusses on the usage of particular resources over time and represents each information request as a building block in a 3D city as illustrated by Figure 2. The different client groups and the volume of requests are illustrated over time by location and size against each particular scholarly resource.

Figure 2: Scholarly resource visualisation

The successful visualisation prototypes have shown that the EZproxy log file data is a rich source of immersive visualisation and further development will yield tools that Curtin Library can use to better understand client information seeking behaviour.

REFERENCES

  1. EZproxy Documentation. Available from: https://www.oclc.org/support/services/ezproxy/documentation/learn.en.html accessed 8 June 2018.
  2. Combined Log Format. Available from: http://fileformats.archiveteam.org/wiki/Combined_Log_Format accessed 8 June 2018.
  3. Curtin HIVE (Hub for Immersive Visualisation and eResearch). Available from: https://humanities.curtin.edu.au/research/centres-institutes-groups/hive/ accessed 8 June 2018.
  4. Unity3D. Available from: https://unity3d.com/ accessed 8 June 2018.

Biography:

Peter Green is the Associate Director, Research, Collections, Systems and Infrastructure in the Curtin University Library, Australia. He is responsible for providing strategic direction, leadership and management of library services in support of research, the acquisition, management, discovery and access of scholarly information resources, and information technology, infrastructure and facilities.

Citizen Data Science can cure our data woes

Dr Linda Mciver1

1Australian Data Science Education Institute, Glen Waverley, Australia

 

We are drowning in data. Every two days we generate more data than the entirety of human history up to 2003. And those are old numbers. We have scientific instruments that generate more data than we can store, much less process. And we have data on everything you can possibly dream of – and in the case of Facebook, many things you’d probably prefer not to. In short, we have more data than we can possibly begin to understand.

Coincidentally, we have an entire cohort of school students who we are failing to engage with technology. Students whose experience of “tech” subjects is formatting word documents or making web pages. Students who don’t see the relevance of technology to their own lives and careers. We are increasingly ruling their world with data. It’s time to engage them with it. Teaching kids Data Science has potential to enable students to make scientific discoveries, understand the discipline that’s changing the face of the world, and engage with technology on a whole new level. I’m going to show you how.


Biography:

Dr Linda McIver started out as an Academic with a PhD in Computer Science Education. When it became apparent that High School teaching was a lot more fun, Linda began a highly successful career at John Monash Science School, where she built innovative courses in Computational and Data Science for year 10 and year 11 students.

Nominated one of the inaugural Superstars of STEM in 2017, Linda is passionate about creating authentic project experiences to motivate all students to become technologically and data literate.

While Linda loves the classroom, it was rapidly becoming clear that teachers in the Australian School system were keen to embrace Data Science, but that there was a serious lack of resources to support that. That’s why Linda created ADSEI – to support Data Science in education.

Streamlining Collaboration for the Murchison Widefield Array Telescope Project with the Australian Access Federation and eduGAIN

Greg Sleap1, Alison Lynton3, John Scullen4, Scott Koranda5, Randall Wayth1, Adam Beardsley2, Benjamin Oshrin5, Heather Flanagan5

1Curtin Institute of Radio Astronomy, greg.sleap@curtin.edu.aur.wayth@curtin.edu.au

2Arizona State University, Adam.Beardsley@asu.edu  

3Curtin University, A.Lynton@curtin.edu.au  

4Australian Access Federation, john.scullen@aaf.edu.au   

5Spherical Cow Group, New York, NY, skoranda@sphericalcowgroup.combenno@sphericalcowgroup.comhlflanagan@sphericalcowgroup.com  

 

The Murchison Widefield Array (MWA) is a low-frequency radio telescope operating between 80 and 300 MHz. It is located at the Murchison Radio-astronomy Observatory (MRO) in Western Australia, the planned site of the future Square Kilometre Array (SKA) lowband telescope, and is one of three telescopes designated as a Precursor for the SKA. Initially developed by an international collaboration including partners from Australia, Canada, India, New Zealand, and the United States, and led by the Curtin University node of the International Centre for Radio Astronomy Research (ICRAR), today the MWA includes partners from more than 20 institutions around the world including China and Japan.

To streamline collaboration and facilitate access to data and resources, MWA staff deployed an identity management infrastructure built on the foundation of federated identity. By leveraging the existing investment Curtin University had made in federated identity infrastructure and the Australian Access Federation (AAF), MWA published its federated services into the worldwide eduGAIN interfederation and enabled single sign-on (SSO) access using home organization credentials for collaborators throughout the world.

This presentation discusses the issues encountered, processes engaged, and challenges faced when implementing the MWA eduGAIN solution from four unique perspectives:

  1. The MWA data manager and project scientist charged with enabling secure and scalable access to a growing collaboration with partners throughout the world.
  2. The Senior Systems Engineer at Curtin University tasked with facilitating access to AAF services including eduGAIN for MWA resources.
  3. AAF staff enabling subscribers to connect internationally with eduGAIN.
  4. Consultants providing technical input on scalable and sustainable federated identity architecture in support of international collaboration.

Biographies:

Scott Koranda specializes on identity management architectures that streamline and enhance collaboration for research organizations. https://orcid.org/0000-0003-4478-9026

Greg Sleap has been the Murchison Widefield Array (MWA) Data Manager since mid-2016, planning, developing and supporting the systems which allow astronomers around the world to utilise the MWA’s extensive data archive. https://orcid.org/0000-0003-0134-3884

Alison Lynton has worked for Curtin University as a Senior Systems Engineer since 2001, specialising in Unix. She has a passion for advocating for researchers needs within her organisation. https://orcid.org/0000-0002-6236-1915

John Scullen joined AAF in February 2016 to lead the development of new processes and tools in the Next Generation AAF project. His role has since expanded to oversee the AAF’s project portfolio.

Transforming Research Code to A Modern Web Architecture – Pipetools

Paulus Lahur, Kieran Lomas

CSIRO, Clayton, Australia, paulus.lahur@csiro.au

CSIRO, Clayton, Australia, kieran.lomas@csiro.au

 

SUMMARY

In this paper, we outline the process of transforming research code to a web application, using Pipetools project as the study case. The target is to reach a wide audience of users who can benefit from the code. We are constructing infrastructure and code that support and encapsulate the research code to significantly improve its usability, as well as expand its modes of usage. Currently the project is moving along at reasonable pace. We would like to share challenges that we face and the thought process in solving them. The lessons learned here will hopefully benefit researchers and software developers working in similar projects.

WHY WEB APPLICATION?

Research code is a highly valuable asset hidden deep inside research institutions. It typically runs on a very specific device and environment, and is accessible only to a few researchers. Throughout the course of the research, algorithms and codes are accumulated and improved. As the research matures, the potential benefit to other people increases. In many cases, there will be people out there who are willing to give money to be able to use the software. The problem is, the software is practically usable only to those who makes it, or at least to those who have intimate understanding of how it works. In order to make the software usable to a wider audience, another stage of software development is required. More code needs to be built around the research code in order to improve user experience.

There are two major approaches here. The first one is to make “native application,” that is, software that is native to a certain operating system. In fact, the research software itself belongs to this type. The other approach is to turn it into web application, that is, software that runs on remote machine. Although there are many pro s and cons for either approach, we opt for the latter, because it is accessible to people on various operating systems, and is therefore easier to support and maintain. Software licensing and protection becomes simpler. Rolling out a new version is also trivial. Furthermore, web application also opens door to collaborative work, where a number of people, possibly on different parts of the World, are working on the same set of data.

THE SYSTEM

In order to develop an effective solution, we need to create a modular system, where developers can focus on specific modules. This is outlined in Figure 1. In essence, the development is split into these parts:

  • Front End. It deals with user interface. It translates user commands and send them to Back End.
  • Back End. It receives commands from Front End and calls the research code to do the actual computation.
  • Infrastructure. It deals with services that enable Front and Back Ends to work. This includes containers, as well as continuous integration and deployment.

Each parts have their own challenges. Details of the system will be presented in the paper.

Figure 1: Simplified layout

ACKNOWLEDGEMENT

Research code: Lionel Pullum (Mineral Resources, Clayton VIC) Project manager: Andrew Chryss (Mineral Resources, Clayton VIC) Team lead on IMT side: Daniel Collins (IMT, Kensington WA)

Front End: Kieran Lomas (IMT, Clayton VIC)

Back End: Paulus Lahur, Sam Moskwa (IMT, Clayton VIC)

Infrastructure: Dylan Graham, Andrew Spiers, Sam Moskwa (IMT, Clayton VIC)


Paulus Lahur is CSIRO staff since 2015. He is in Scientific Computing of IMT.

From the soil sample jar to society: an example of collating and sharing scientific data

Hannah Mikkonen1, Ian Thomas2, Paul Bentley3, Andrew Barker4, Suzie Reichman5

1 RMIT University, Melbourne, Australia, hannah.mikkonen@student.rmit.edu.au

2 RMIT University, Melbourne, Australia, ian.edward.thomas@rmit.edu.au

3 CDM Smith, Melbourne, Australia, bentleypd@cdmsmith.com

4 CDM Smith, Melbourne, Australia, barkerao@cdmsmith.com

5 RMIT University, Melbourne, Australia, suzie.reichman@rmit.edu.au

 

Introduction

Background concentrations of metals and elements in soil are the natural geogenic concentrations. Soil data on background metal/element concentrations is important for assessments of agricultural health and productivity, ecological risk, mineral exploration, and assessment of pollution. However, soil surveys and the associated collection and chemical analysis of soil samples take a considerable amount of time and are financially expensive. Therefore, soil survey datasets are a valuable resource for other scientist’s, land assessors and policy makers.
A website “The Victorian Background Soil Database” (http://doi.org/10.4225/61/5a3ae6d48570c) and an interactive map titled “Soil Explorer” were developed to present and share the results of a Background Soil Survey for Victorian soils.  The database and map were developed by RMIT researchers in collaboration with Data Scientists at CDM Smith, the Environment Protection Authority, Victoria; the Australian Contaminated Land Consultants Association and with help from the RMIT eResearch team. Soil Explorer is a Shiny [6] web-based application to visualise data (based on the R language). The app provides an interactive platform that integrates individual soil data points, soil statistics and spatial groups of geology and region for the background soil data. The data collation process involved collection of soil samples from across Victoria, collation of soil sample data from publicly available environmental assessment reports, screening the quality of collated data, and calculation of summary statistics. The data communication process involved development of an interactive map using Shiny, licensing of the dataset, development of a DOI, placement of the Shiny application onto a secure and reliable server, launching of the website, and recording the use of the website using Google’s data analytics platform. This presentation will describe how soil scientists, e-research support and the environmental industry worked together to tackle the cross disciplinary barriers and challenges involved in collecting, analysing, visualising, and communicating data using a web-based Shiny dashboard, written using the R Language.

Understanding what the end user wants

The need for a background soil database was identified by the members of Australian Contaminated Land Consultants Association (ACLCA) who identified mis-classification of soil (due to lack of understanding of background concentrations of soil) as a potential cause of unsustainable disposal of natural soils to landfill. ACLCA approached the RMIT researchers to develop a HazWaste Fund proposal that was ultimately successful. Throughout the project, ACLCA and the EPA Victoria (as the HazWaste Fund administrator) played key roles in scoping the project and ensuring the methods and deliverables were relevant to industry and in forms that were usable. One of the advantages of this project was that the research was undertaken by a student who also worked in the environmental assessment industry, and supervised by a researcher who had previously worked in environmental regulatory industry.

Methods

The project development was handled using an agile development and deployment approach, with two-week “sprints” of allocated work tracked on an online task board. Changes to site source code during development were communicated between different collaborators using a source control repository.

The website, maps and summary-statistic sheets were scripted and automated using the R language [1]. R was adopted for several reasons. Firstly, all the statistical analysis could be automated, including the output of 126 separate summary-statistic sheets. Second, several R packages facilitate the generation of HTML dashboards and formatted reports; (e.g. leaflet [2], crosstalk, sf [3], rmarkdown [4], knitr [5]). Finally, R is open source, which allows for the code to be edited by people from different industries and institutions.

Following emerging best practices for dataset publication, steps were made to ensure that the data was both accessible and had potentially larger reach. A Digital Object Identifier (DOI) was made so that the dataset can easily be referenced in publications and can be discovered, through records to be created in Research Data Australia. Google Analytics have been used to assess the site traffic and allow for us to better understand the users’ interests.  Beyond these automated analytic metrics, an online form was embedded on the website to allow visitors to reach out to get further information and initiate conversations with the authors. These steps were based on the requirement that the site would serve as a starting point for further discussion and collaboration.

App deployment was managed using Bitbucket for version control. App hosting occurred on an RMIT server using Redhat with SSL browser security. This project was a ’proof of concept’ for research translation and for communication of environmental science using digital platforms, which are “bespoke” / customised to the research project. The R language (and packages there within) provide a complete coding environment from: data processing, analysis, through to data visualisation and reporting (both pdf and web based), providing researchers with a single environment to undertake and communicate their research. One outcome of this project is to share the roadmap for other researchers at RMIT, with the purpose of introducing researchers to new tools and techniques for enhancing and communicating their research practice.

Presentation of point data rather than models

Soil data is increasingly being presented as modelled spatial layers. There is often a lack of communication of the accuracy and confidence in predicted information. Background concentrations of metals in soil can vary by 100 fold within a single soil sample. Therefore, at this stage it was considered most relevant to simply present the results and then provide summary statistics that clearly describe the data variability.

Next Steps

There are key directions for further research. First, expand the dataset and user interface to meet needs of not just the environmental assessment industry but also agricultural, mining and research industries. Second, assess how to merge local data with national datasets. Third, to develop predictive spatial models for background concentrations.

Acknowledgements

The authors would like to acknowledge the financial support of the Hazwaste Fund (project S43-0065) and the Australian Contaminated Land Consultants Association (ACLCA) Victoria.  We also acknowledge and thank the R project for freely available statistical computing (http://www.r-project.org).

References

  1.    R Core Team, 2016. R: A language and environment for statistical computing, in: Computing, Vienna, Austria.
  2.    Chang, W., Cheng, J., Allaire, J., Xie, Y., McPherson, J., 2017. Package ‘shiny’, Web Application Framework for R, version 1.0.5. CRAN.
  3.    Pebesma, E., Bivand, R., Racine, E., Sumner, M., Cook, I., Keitt, T., Lovelace, R., Wickham, H., Ooms, J., Müller, K., 2018. sf: Simple Features for R, R package, version 0.6-3. CRAN
  4.     Allaire, J., Horner, J., 2017. Package ‘markdown’, version 0.8. CRAN.
  5.    Xie, Yihui (2018). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.20. CRAN

Biography:

Ian Thomas (https://orcid.org/0000-0003-1372-9469) is a software developer and system administrator at the Research Capability Unit at RMIT University. He has worked in data curation for output of high-performance computing systems, microscopy data for materials, and screen media objects (film and television). His current work is in high-performance computing, containerized research workflows and in cloud-based platforms in support of eResearch applications.

Roles for eResearch

Nicholas May1, Sheila Mukerjee2, Samara Neilson3

1RMIT University, Melbourne, Australia, nicholas.may@rmit.edu.au

2La Trobe University, Melbourne, Australia, sheila.mukerjee@latrobe.edu.au

3Swinburne University, Hawthorn, Australia, sneilson@swin.edu.au

 

The position descriptions of roles within the eResearch industry are not consistent and are not standardized. As an example, AeRO [1] has collected over a hundred different role titles for position descriptions within the industry. This makes the recruitment of staff and career progression within the industry much harder. However, efforts are underway, as discussed at a recent AeRO Forum [2], to determine the scope of these positions and to describe the skills associated with common roles. The first step in any movement towards standardizing position descriptions, is to set some boundaries and identify some common eResearch roles.

In this ‘Birds of a Feather’ session, participants will collaborate to perform a simple role modelling process, in which they will classify and transform existing position titles into a more manageable collection. An appropriate framework, which will be presented at the start of the session, will provide a basis for participants to classify the roles. This may be based on the overlapping domains that eResearch spans (such as: Research, Information Technology, and Innovation) or the skill categories of SFIA [3]. The role modelling process, shown in Table 1., has been adapted from an existing ‘user role modelling’ process, as described by Cohn [4]. The steps of the process that will be performed in the session include: Discovery, Organization, and Consolidation.

Participants can submit their role titles, in advance of this session, via the following URL:        https://eresear.ch/roles

The resulting set of titles will subsequently be assigned appropriate skills and levels of responsibility using an appropriate framework, such as SFIA, as has already been done for various ICT roles [5].

Step Time (Mins) Goals
Introduction 15 Present the modelling process and classification framework.
Discovery 15 A visual representation of the framework is outlined on a whiteboard or wall.

Starting list of titles is shared amongst the participants.

Everyone writes role titles on sticky notes.

Notes are posted on the framework.

No discussion of the role names is allowed in this step.

Organization 15 Move the notes around the board to represent their relationships.

If roles overlap then overlap the notes, the degree of overlap represents the degree to which the roles overlap.

Consolidation 15 If notes overlap entirely,

·         remove a note, or

·         replace both with a consolidated name.

If notes overlap partially,

·         remove a note if the difference is not significant, or

·         replace one with a title that corresponds to the difference.

Remove any notes for roles that are not significant.

Rearrange the notes to show the important relationships and hierarchies between roles.

    Inputs: List of Role Titles, Classification Framework.

Outputs: Transformed and condensed set of Role Titles.

Table 1. Session Format.

REFERENCES

  1. Australian eResearch Organisations (AeRO), http://aero.edu.au/, accessed 6 June 2018.
  2. C3DIS, AeRO Forum – eResearch Workshop, http://www.c3dis.com/1903, accessed 6 June 2018.
  3. SFIA Foundation, The Skills Framework for the Information Age – SFIA, Available at: https://www.sfia-online.org/, accessed 6 June 2018.
  4. Cohn, M., User Stories Applied, Addison-Wesley, 2004, ISBN: 0-321-20568-5.
  5. ACS, Common ICT Job Profiles and Indicators of Skills Mobility, ICT Skills White Paper, 30 December, 2013,
    Available at: https://www.acs.org.au/content/dam/acs/acs-publications/ICT%20Skills%20White%20Paper%20-%20Common%20Job%20Profiles%20and%20Skills%20Mobility%2030%20Dec%202013.pdf, accessed 6 June 2018.

Biographies:

Nicholas May is a software developer in the Research Capability unit at RMIT University. He has over twenty-nine years of varied experience within the software engineering profession, across industries and domains, and holds the Certified Professional status with the Australian Computer Society. His current role includes the responsibility for promoting research data management across the research lifecycle at RMIT University. http://orcid.org/0000-0002-1298-1622

Samara is a computer scientist and technologist working in the Research Analytics Services team at Swinburne University. In addition to being a representative of FAVeR, she is also on the Melbourne Committee for Random Hacks of Kindness (RHoK), Australia’s longest running hackathon for social good, and a member of Girl Geek Academy, supporting women in STEMM.

Caroline Gauld is the Deputy Director, Research Information Management (RIM) at Defence Science and Technology (DST) Group. The Research Information Management team supports research data management, knowledge management and records management across DST Group and works in collaboration with other technology specialists to support DST Group researchers to manage and preserve their research outputs and data.

Derived through analysis: linking policy literature and datasets

Les Kneebone1, Steven McEachern2, Janet McDougal2

1Analysis & Policy Observatory, Melbourne, Australia

2Australian Data Archive, Canberra, Australia

 

The research community has been witnessing new and innovative approaches to making data objects discoverable, outside of traditional scholarly publishing contexts. Datasets referenced within scholarly publications can be made persistently identifiable using the same identifying approaches used for the publications themselves. Datasets can be stored, discovered and reused via specialist dataset repository platforms. Therefore, creating graph databases of interlinked research data and publications is now a reality.

Linking grey literature publications to datasets presents special challenges. Analysis & Policy Observatory (APO) has, since 2004, collected grey policy literature and organized its collection with ubiquitous and emerging metadata standards. APO now focuses on expanding the reach of its collection by establishing links with other research objects. Datasets, from which policy reports are derived, are of special interest. APO is therefore working with Australian Data Archive (ADA) to connect its datasets with APO grey literature.

The challenge for grey literature and dataset linking

Persistent identifiers (PIDs) for digital information objects is well recognized as a key data points needed as the basis for links between objects. PIDs such as Digital Object Identifiers (DOIs) have enjoyed significant update in traditional academic publishing contexts. Minting DOIs for grey literature, in contrast, is an exceptional practice in policy sector. APO is taking a lead in promoting use of DOIs for grey literature – nonetheless, DOI coverage remains sporadic in policy collections. A similar context exists for datasets – DOIs are often minted after original publication and only once harvested and curated within special data repositories. ADA, like APO, has undertaken the significant challenge of assigning PIDs to datasets. The challenge for linking grey literature, then is one in which structured publication data is not always available to work with.

The response from ada and apo

Researchers cannot wait for all research objects to become entities. As research repository custodians, we will miss opportunities to combine our collections in ways that helps researchers if we wait for complete, or near complete PID coverage. Therefore ADA and APO are piloting approaches to linking objects using a combination of unstructured, semi-structured and structured data:

  • Text mining and natural language processing, to help predict semantic and logical links
  • Leveraging metadata, such as controlled vocabularies, to improve link prediction
  • Locating and matching existing PIDs in each repository

From the pilot, APO and ADA hope to learn the following:

  • What is PID coverage in our repositories?
  • At what aggregation level should links be made, i.e. collection vs item level?
  • What commonalities, and opportunities exist in respective metadata approaches?
  • How can taxonomies be leveraged to improve predictions and matches?
  • What interfaces between the repositories are scalable, reusable and in scope?

This research was funded by the Australian Research Council Linkage Infrastructure, Equipment and Facilities grant Linked Semantic Platforms (LE180100094).


Biographies:

Les Kneebone has worked in information management roles in government, school, community and research sectors since 2002. He has mainly contributed to managing metadata, taxonomies and cataloging standards used in these sectors. Les is currently supporting the Analysis & Policy Observatory by developing and refining metadata standards and services that will help to link policy literature with datasets.

Steven McEachern is Director and Manager of the Australian Data Archive at the Australian National University, where he is responsible for the daily operations and technical and strategic development of the data archive. He has high-level expertise in survey methodology and data archiving, and has been actively involved in development and application of survey research methodology and technologies over 15 years in the Australian university sector. https://orcid.org/0000-0001-7848-4912

Janet McDougall is a Senior Data Archivist at the Australian Data Archive, with a background in systems IT, data management, GIS, and social research. Her role includes outreach and curation of research data for preservation, archiving and publication. She is also involved in the ongoing implementation of metadata and standards focussed mainly in the social sciences and humanities, but also has experience with long-term ecological data from curation and procedural perspectives.

12311

About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2017 - 2018 Conference Design Pty Ltd