From the soil sample jar to society: an example of collating and sharing scientific data

Hannah Mikkonen1, Ian Thomas2, Paul Bentley3, Andrew Barker4, Suzie Reichman5

1 RMIT University, Melbourne, Australia, hannah.mikkonen@student.rmit.edu.au

2 RMIT University, Melbourne, Australia, ian.edward.thomas@rmit.edu.au

3 CDM Smith, Melbourne, Australia, bentleypd@cdmsmith.com

4 CDM Smith, Melbourne, Australia, barkerao@cdmsmith.com

5 RMIT University, Melbourne, Australia, suzie.reichman@rmit.edu.au

 

Introduction

Background concentrations of metals and elements in soil are the natural geogenic concentrations. Soil data on background metal/element concentrations is important for assessments of agricultural health and productivity, ecological risk, mineral exploration, and assessment of pollution. However, soil surveys and the associated collection and chemical analysis of soil samples take a considerable amount of time and are financially expensive. Therefore, soil survey datasets are a valuable resource for other scientist’s, land assessors and policy makers.
A website “The Victorian Background Soil Database” (http://doi.org/10.4225/61/5a3ae6d48570c) and an interactive map titled “Soil Explorer” were developed to present and share the results of a Background Soil Survey for Victorian soils.  The database and map were developed by RMIT researchers in collaboration with Data Scientists at CDM Smith, the Environment Protection Authority, Victoria; the Australian Contaminated Land Consultants Association and with help from the RMIT eResearch team. Soil Explorer is a Shiny [6] web-based application to visualise data (based on the R language). The app provides an interactive platform that integrates individual soil data points, soil statistics and spatial groups of geology and region for the background soil data. The data collation process involved collection of soil samples from across Victoria, collation of soil sample data from publicly available environmental assessment reports, screening the quality of collated data, and calculation of summary statistics. The data communication process involved development of an interactive map using Shiny, licensing of the dataset, development of a DOI, placement of the Shiny application onto a secure and reliable server, launching of the website, and recording the use of the website using Google’s data analytics platform. This presentation will describe how soil scientists, e-research support and the environmental industry worked together to tackle the cross disciplinary barriers and challenges involved in collecting, analysing, visualising, and communicating data using a web-based Shiny dashboard, written using the R Language.

Understanding what the end user wants

The need for a background soil database was identified by the members of Australian Contaminated Land Consultants Association (ACLCA) who identified mis-classification of soil (due to lack of understanding of background concentrations of soil) as a potential cause of unsustainable disposal of natural soils to landfill. ACLCA approached the RMIT researchers to develop a HazWaste Fund proposal that was ultimately successful. Throughout the project, ACLCA and the EPA Victoria (as the HazWaste Fund administrator) played key roles in scoping the project and ensuring the methods and deliverables were relevant to industry and in forms that were usable. One of the advantages of this project was that the research was undertaken by a student who also worked in the environmental assessment industry, and supervised by a researcher who had previously worked in environmental regulatory industry.

Methods

The project development was handled using an agile development and deployment approach, with two-week “sprints” of allocated work tracked on an online task board. Changes to site source code during development were communicated between different collaborators using a source control repository.

The website, maps and summary-statistic sheets were scripted and automated using the R language [1]. R was adopted for several reasons. Firstly, all the statistical analysis could be automated, including the output of 126 separate summary-statistic sheets. Second, several R packages facilitate the generation of HTML dashboards and formatted reports; (e.g. leaflet [2], crosstalk, sf [3], rmarkdown [4], knitr [5]). Finally, R is open source, which allows for the code to be edited by people from different industries and institutions.

Following emerging best practices for dataset publication, steps were made to ensure that the data was both accessible and had potentially larger reach. A Digital Object Identifier (DOI) was made so that the dataset can easily be referenced in publications and can be discovered, through records to be created in Research Data Australia. Google Analytics have been used to assess the site traffic and allow for us to better understand the users’ interests.  Beyond these automated analytic metrics, an online form was embedded on the website to allow visitors to reach out to get further information and initiate conversations with the authors. These steps were based on the requirement that the site would serve as a starting point for further discussion and collaboration.

App deployment was managed using Bitbucket for version control. App hosting occurred on an RMIT server using Redhat with SSL browser security. This project was a ’proof of concept’ for research translation and for communication of environmental science using digital platforms, which are “bespoke” / customised to the research project. The R language (and packages there within) provide a complete coding environment from: data processing, analysis, through to data visualisation and reporting (both pdf and web based), providing researchers with a single environment to undertake and communicate their research. One outcome of this project is to share the roadmap for other researchers at RMIT, with the purpose of introducing researchers to new tools and techniques for enhancing and communicating their research practice.

Presentation of point data rather than models

Soil data is increasingly being presented as modelled spatial layers. There is often a lack of communication of the accuracy and confidence in predicted information. Background concentrations of metals in soil can vary by 100 fold within a single soil sample. Therefore, at this stage it was considered most relevant to simply present the results and then provide summary statistics that clearly describe the data variability.

Next Steps

There are key directions for further research. First, expand the dataset and user interface to meet needs of not just the environmental assessment industry but also agricultural, mining and research industries. Second, assess how to merge local data with national datasets. Third, to develop predictive spatial models for background concentrations.

Acknowledgements

The authors would like to acknowledge the financial support of the Hazwaste Fund (project S43-0065) and the Australian Contaminated Land Consultants Association (ACLCA) Victoria.  We also acknowledge and thank the R project for freely available statistical computing (http://www.r-project.org).

References

  1.    R Core Team, 2016. R: A language and environment for statistical computing, in: Computing, Vienna, Austria.
  2.    Chang, W., Cheng, J., Allaire, J., Xie, Y., McPherson, J., 2017. Package ‘shiny’, Web Application Framework for R, version 1.0.5. CRAN.
  3.    Pebesma, E., Bivand, R., Racine, E., Sumner, M., Cook, I., Keitt, T., Lovelace, R., Wickham, H., Ooms, J., Müller, K., 2018. sf: Simple Features for R, R package, version 0.6-3. CRAN
  4.     Allaire, J., Horner, J., 2017. Package ‘markdown’, version 0.8. CRAN.
  5.    Xie, Yihui (2018). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.20. CRAN

Biography:

Ian Thomas (https://orcid.org/0000-0003-1372-9469) is a software developer and system administrator at the Research Capability Unit at RMIT University. He has worked in data curation for output of high-performance computing systems, microscopy data for materials, and screen media objects (film and television). His current work is in high-performance computing, containerized research workflows and in cloud-based platforms in support of eResearch applications.

Categories