Research Data Stories: Struggles and Successes in Environmental Science

Mr Steven Androulakis1, Mr Hamish Holewa2, Dr Nigel Ward2, Dr Andrew Treloar3, Dr Michelle Barker4

1ANDS, Nectar, RDS, Parkville, Australia,

2QCIF, St Lucia, Australia,

3ANDS, Caulfield, Australia,

4Nectar, Parkville, Australia

 

DESCRIPTION

What do researchers want, anyway?

The experience of many researchers is too often one of time-consuming struggle: installing tools, fixing code, integrating data, uncovering inconsistencies and understanding complex formats. How can eResearch practitioners help?

While eInfrastructure projects such as ANDS, Nectar, and RDS are helping to ease collaboration and analysis, there’s so much more to be done.

This BOF presents reflections of environmental science researchers to the eResearch Australasia audience to:

1.       Provide the eResearch audience with a direct perspective from researchers of working with research data and software tools

·         Present and discuss patterns of success

·         Seed deeper engagement between researchers and eResearch practitioners going forward

·         Help the eResearch audience conceptually link data and tool activity in the environmental domain with other domains

Researchers will share stories of opportunity, success, and frustration in working with and understanding data and associated tools. They will offer suggestions of where eInfrastructure could help them be more efficient in their science, and invite the audience to provide their perspectives on addressing their most pressing issues.

Researchers giving presentations will be from a diverse set of scientific and organisational backgrounds, and at varying stages of their careers.

Duration: 60 minutes

Format: Short presentations, followed by a group discussion

 


Biography

Steve Androulakis is the Manager, Community Platforms for ANDS, NeCTAR and RDS. His role is responsible for facilitating the strategic development and operational implementation of research community engagement, including the development and delivery of community building and engagement programs for domain, technical and method stakeholders in research communities.

Operational high resolution oceanographic circulation models for Australian waters – Application to the Great Barrier Reef

Dr Jason Antenucci1, Ms Caroline Lai1, Mr Sam Dickson1, Mr Simon Mortensen2, Mr Paul Irving3, Dr Sabine Knapp4

1DHI Water And Environment, Perth, Australia,

2DHI Water And Environment, Gold Coast, Australia,

3Australian Maritime Safety Authority, Canberra, Australia,

4Seven Ocean Research, Melbourne, Australia

INTRODUCTION

Increasing computer power and data availability, combined with advanced numerical techniques, is opening up significant new opportunities in the development of oceanographic circulation models for operational use.

The development of forecasting and risk management systems in the ocean is heavily reliant on a solid modelling foundation, as errors accumulate rapidly. This is particularly the case when forecasting the path of stricken vessels and oil spills.

In this presentation we outline a new generation of coupled hydrodynamic and wave models being used in operational settings in Australian waters. The models provide simulations under a flexible mesh approach to avoid nesting and allow resolution down to 500 metres in all ports. We will highlight the role of IMOS in the calibration and validation of the models, along with other publically available data.

The models have been integrated with stricken drifting vessel models, oil spill models and mooring analysis models to provide the operational systems for government and industry outcomes. Probabilistic approaches are also incorporated where relevant to allow for uncertainty in forcing predictions and response. Applications of these models will be presented, including the forecasting of stricken vessels for the Australian Maritime Safety Authority, along with mooring system responses in harbours to improve operability and minimise risk.

METHODOLOGY

Standard techniques to build numerical models for oceanographic circulation generally involve the use of regular Cartesian grids. This limits the models to a uniform resolution across the model domain, requiring nesting of models if regions of higher resolution are required. We apply a flexible mesh approach using triangular elements that allows for regions of varying resolution depending on the local topographic features and model requirements. This means only a single model domain is required, allowing for significant efficiencies in computational and processing requirements.

The objective of the model constructed for the Great Barrier Reef was to resolve currents impacting stricken vessels. This requires particular focus on shipping lanes, and must include the effects of both large scale oceanographic currents and tidally driven currents. The bathymetric representation of the reef itself is paramount, as it heavily dictates the behavior of nearshore tidal currents and a number of shipping lanes pass directly through the reef.

The model was constructed with a spatial resolution varying from 500 metres to 9000 metres. The flexible mesh allowed for excellent definition of key bathymetric features such as shipping channels that are not included in other models available for the region. The domain extends from Fraser Island in the south through to Papua New Guinea in the north, and from Cape York in the west to 500 km offshore of Bundaberg in the east, covering an area of 1.8 million square kilometres using approximately 300,000 elements (Figure 1). The model is constructed using the MIKE 3 FM software, and contains 36 vertical layers down to a depth of approximately 4800 metres. The size of the model requires High Performance Computing infrastructure, with the model highly parallelized across 480 cores and typically running 90 times real-time on Magnus at the Pawsey SuperComputing Centre.

The model is calibrated against water levels and current measurements collected at ports along the coast as well as data from the Integrated Marine Observation System (IMOS – www.imos.org.au).

OUTCOMES

The project has delivered a high resolution, highly accurate hydrodynamic model to predict current velocities, primarily for use in assisting the Australian Maritime Safety Authority in managing shipping incidents in the vicinity of the reef. The model has been run to produce 6 years of hindcasts of currents at hourly resolution to be used in risk assessments associated with shipping. The model is also being operationalised, where 2 days forecasts (available after less than an hour of simulation) will be developed and fed into AMSA’s operational response system.

The high resolution model is available for use by other agencies, and can be coupled with water quality and agent-based models to develop risk assessments based on water quality and ecological factors. The availability of the 6 year hindcast opens up numerous opportunities for such studies, and is an important resource in ongoing management of the reef.

Figure 1: Great Barrier Reef model domain, showing sample current vectors and temperature contours.


Biography

Jason has a PhD in Environmental Fluid Mechanics and a BE(Hons) in Environmental Engineering, with 17 years of experience in surface water environments, including lakes, reservoirs, estuaries, and the coastal ocean. His modelling expertise extends to hydrodynamics, water quality, eutrophication, and aquatic ecology from zero to three dimensions. He has extensive experience in outfall design and assessment, including hydraulics and environmental mixing, and developed new business and technologies in water and environmental management. Technology development includes software, numerical models, and real-time decision support systems.

He has published 45 peer-reviewed papers in international journals (attracting over 2000 citations), more than 40 conference papers and one book chapter on various topics associated with water management and water quality from catchment to the coastal ocean.

The Three Legged Stool of Antarctic Data Management

Mr Dave Connell1

1Australian Antarctic Division, Kingston, Australia

 

At the Australian Antarctic Data Centre (AADC) data management rests on a “three legged stool” of applications designed for users and administrators. These three applications form the basis of a successful data management strategy, a process which has been gradually refined since the AADC came into existence over twenty years ago.

These three applications consist of a science project management tool, a metadata creation and discovery tool, and a data submission tool.

THE SCIENCE PROJECT MANAGEMENT TOOL – MYSCIENCE

In order to facilitate effective data management within the Australian Antarctic program (AAp), the AADC was established in 1995.  Since then the AADC has created infrastructure to support the archival of data and the creation of metadata records, as well as value-adding to Australian Antarctic data through the use of GIS and mapping tools, the creation of targeted databases, and data-analysis activities. However, until 2012 the AADC lacked the capability of directly managing the data archival needs of each AAp science project. At this time, the AADC launched the MyScience project management application (Finney 2014) [1], which allowed AADC staff to efficiently keep track of AAp science projects and ensured that all expected datasets were accounted for, and that AAp scientists were not unduly pestered for data before the expected due dates.

MyScience primarily achieved this goal through the use of Data Management Plans (DMPs).   Early on in the project application phase, scientists are required to complete a DMP for their project in order to inform the AADC what data to expect from the project and when. This creates a “shopping list” of data that can be expected from each project that the AADC can progressively check off. This also allows AADC staff to objectively rate each project/responsible scientist on its/their data management effectiveness. These ratings can be collated in large reports and presented to the AAp funding office for further evaluation.   Funding for future projects can then be prioritised to scientists who fulfil their data management obligations.

Furthermore, the MyScience application is linked to the other two legs of the stool, the metadata tool and the data submission tool, providing a “one stop shop” for scientists when it comes to the management of their data.

THE METADATA TOOL

Metadata is a crucial element of the AADCs data management strategy, for without well written metadata, many of the datasets stored at the AADC would be of little value. As such, providing a simple method for scientists and AADC staff to write metadata records was of great importance.  For many years, due to limitations with technology and metadata standards, this was not possible.  More often than not, a low-tech, labour intensive approach was required to produce metadata of an adequate standard. The current metadata tool used by the AADC was released in 2015, and has proven to be very successful at delivering high quality metadata records. It not only simplifies the metadata creation process for users,  but  still  retains all  the  required  complexity and detail in  order  to  minimise the  effort  required  by  AADC administrators to evaluate and process the records.

THE DATA SUBMISSION TOOL

The final leg of the stool, was to provide a way for users to reliably and safely upload their data to the AADC.  The first attempt to create a data submission tool was released in 2008, but while well intentioned, and well thought out, it was not well designed, was poorly developed, and was limited in capability. The tool did not integrate well with other AADC applications, and grew evermore buggy before it became irretrievably broken in 2015.  A replacement tool was finally released in 2017, and unlike its predecessor has thus far proven to be both well designed, well developed and very capable.

The new data submission tool has been tightly integrated into both the MyScience application and the metadata tool for ease of use and more reliable reporting by AADC staff.  The new tool also allows much greater file sizes to be uploaded to the AADC, and ensures that datasets do not become “lost in an inbox”.

IS THERE A CUSHION ON THE STOOL?

While these three applications form the basis of the administration of data management at the AADC, there are of course other factors which “sweeten” the user experience, and ease the burden on administrators. These sweeteners include, dataset DOIs for an increased citation presence; search tools for downloading publicly accessible data; value-adding to the data; integration with other metadata catalogues and data repositories for increased exposure on the world stage.

IS THE STOOL WOBBLY?

Despite the success of these three applications within the AADC with regard to enhancing data management practices, there is still room for improvement.  The data submission tool and the MyScience application need to be further linked so that when datasets are submitted to the AADC they are automatically checked off the DMP; reporting mechanisms need to be improved so that AADC data managers can more effectively manage the repository; thought needs to be given as to whether data access should be kept as primarily limited to downloading “flat files”, or whether to evolve to incorporate an integrated, multi-dataset service.

And while the AADC has made a very nice stool, the AADC can’t force people to sit on it – some scientists exploit a policy loophole in the AAp which allows them to collect data under a “non-science” project, which comes with no obligation to archive the data.  Furthermore, despite a concerted effort to make data management as easy as possible, the AADC has been unable to achieve 100% compliance when it comes to data archival.

REFERENCES

  1. Finney K (2014) Managing Antarctic Data – A Practical Use Case, Data Science Journal, 13 PDA8-PDA14, doi.org/10.2481/dsj.IFPDA-02

 


Biography

Dave Connell completed a Bachelor of Science (honours) degree at the University of Tasmania, and has been working at the Australian Antarctic Division since 1998 and as the metadata officer since 1999.  His role is to catalogue and archive all scientific data collected by the Australian Antarctic program – specifically to ensure that scientists write high quality metadata records and archive their data in a timely manner. During his time at the AAD, he has overseen the transition from ANZLIC metadata to DIF metadata, and also developed tools for converting DIF metadata into various profiles of the ISO 19115 metadata standard.  Dave is also very active in the Australian Government metadata space – reviewing and adapting ISO 19115 metadata standards for use in Australian scientific organisations. He has also worked with the Ocean Acidification – International Coordination Centre to develop an ocean acidification metadata profile.

Another Year in the Evolution of the Virtual Geophysics Laboratory and the Scientific Software Solutions Centre

Dr Carsten Friedrich1, Lelsey Wyborn2, David  Lescinsky3, Ryan Fraser1, Geoff Squire1, Stuart Woodman1, Peter Sienkowski3

1Csiro, Acton, Australia,

2National Computational Infrastructure, Canberra, Australia,

3Geoscience Australia, Canberra, Australia

INTRODUCTION

Much has happened in the evolution of the Virtual Geophysics Laboratory (VGL) and the Scientific Software Solution Centre (SSSC) in 2016-2017: for VGL we have added support for the Amazon Cloud and the NCI Raijin supercomputer, improved provenance support, simplified and improved the user interface, enhanced the result visualization capabilities, added support for new science codes and data repositories, as well as started applying the VGL platform to other science domains. For the associated SSSC we have added support for HPC environments such as NCI Raijin; user authentication and authorization; support for publication and peer review processes of entries; as well as digital signing of entries and reviews. This presentation will present these improvements and new features.

BACKGROUND

A virtual laboratory comprises 3 stages: selection of input data, selection of a tool to process that data and selection of the compute infrastructure to run the selected tool on the selected data.  VGL currently allows users to browse and visualize large repositories of data sets hosted at NCI and Geoscience Australia. After selecting a target dataset and region, users can select an analytical code/model that they want to run on the dataset from the SSSC (e.g., magnetic or gravity inversion for the purpose of understanding what is under the observable surface in an area). The analytic model is then submitted to an infrastructure provider of choice (e.g. Amazon cloud, NeCTAR, or NCI HPC facility) and results are made available to the user after completion. The SSSC was designed to be an app-store for analytic code and models that can automatically be discovered and executed in the cloud by any Virtual Laboratory. Researchers can submit science code they have developed to the SSSC and once the software has been reviewed and approved for release it will be automatically discoverable and executable the by SSSC clients such as the Virtual Geophysics Laboratory.

VGL Enhancements in 2016-2017

1.       Amazon AWS support
Users can now run VGL jobs on the Amazon Cloud using Amazon EC2 instances for execution and AWS S3 for storing results. To avoid VGL operators facing large bills from Amazon for executing user jobs and storing results, VGL can be configured to require users to provide their own Amazon accounts, via AWS cross account authorization, and be billed for job execution and data storage.

1.       NCI Raijin support
VGL now also supports running jobs on the NCI Raijin supercomputer. Users can select Raijin as an option provided the necessary dependencies are available on that platform and they can supply their NCI user name, key, as well as a valid project code. VGL can then generate the required PBS scripts and schedule the job for execution. VGL also supports monitoring job progression, as well as preview and retrieval of results. Since the technologies we use for this, namely ssh and PBS, are widely used in the HPC community it would be relatively easy to support other HPC facilities in the future.

2.       Improved provenance reporting
VGL now captures complete provenance for every job it executes. If configured, the provenance information will be automatically submitted to a PROV-O compliant provenance server such as PROMS.

3.       Improved User Interface and AAF Support
We have made some major and many small improvements in the Web user interface. Major improvements include faceted search in datasets based on keywords, special bounds, service types, publication dates, and others. We have also improved the jobs results page making it much easier to organize jobs in folders as well as monitor job progress, and preview and download job results. Further, users are now also able to log into VGL using the Australian Access Federation (AAF); and VGL can be configured to require AAF login to access NeCTAR resources for job execution.

 

Figure 1 New Data Discovery Page with faceted search capability

Figure 2 New Job monitoring and result page with enhanced result preview functionality

4.       Support for new Science Codes and Data Repositories
We have registered more science codes in the SSSC and thus made them available for job execution in VGL including support for escript on NCI Raijin, pyGplates on NeCTAR and AWS, as well as execution of CSIRO Workspace workflows on NeCTAR and AWS. For data, VGL users now have access to the Australian National Geophysics Data Collection on the NCI National Environmental Research Data Interoperability Platform (NERDIP – http://nci.org.au/data-collections/nerdip/). Application of VGL in other science domains

The VL platform underlying VGL is generic and thus lends itself to application beyond the geo-science domain. For example in the Earth Observation space, it can be repurposed to apply algorithms in areas such as crop monitoring, carbon accounting, and algal bloom monitoring based on satellite images from the CEOS DataCube.

SSSC Developments in 2016-2017

1.       Support for HPC environments such as NCI Raijin
We have extended and optimized the SSSC data model to cater for a wider range of execution environments: in particular we now support code dependencies on PBS based HPC facilities, such as NCI Raijin.

2.       User authentication and authorization
While read access for published entries remains for anonymous users, creating new entries or modifying entries now requires users to be registered, logged in, and properly authorized. Users can currently register with a valid email address and once the email has been verified can start creating content and participate in SSSC community activities such as applying for publication of entries or reviewing entries by other users.

3.       Digital signing of entries
The SSSC now supports digital signing of entries by content creators, which gives users enhanced assurances about content authorship and integrity. By verifying the signature an end-user can be assured that an entry has been signed by the actual author as well as verify that the entry has not been corrupted or otherwise been modified. The SSSC currently allows users to register their own public signature key with the SSSC and sign entries with the corresponding private signature key.

4.      Automatic Versioning
Previously, any modification to a SSSC entry would overwrite and replace the previous version, which is not ideal for reproducibility, provenance reporting, or backward compatibility. The SSSC now preserves all previous versions and when an entry is modified its version number is automatically increased and the new version becomes the current version of that entry. Older versions are still accessible by explicitly referring to their version number.

5.       Publication and peer review processes
The SSSC now supports a configurable review and publication process for entries and the SSSC can be configured to require an explicit publication step. When a user wants to make one of their entries available to other users, they can now request publication of that entry in the SSSC. Users with the appropriate authorizations can review entries where publication has been requested. If they are satisfied that the entry is compatible with content and quality guidelines for that SSSC instance, they can release the entry as published. Once published, an entry then becomes discoverable and accessible by other users; either directly in the web interface or through 3rd party clients such as VGL.

The SSSC also now supports creating and browsing reviews of entries. This can be used by moderators as part of a peer-review workflow to decide approval for publication requests, or more generally as a community tool to provide feedback and recommendations to other users. Optionally, reviews can be digitally signed by the reviewer which gives end users increased trust in the authorship and validity of the review.

EcoCloud: Towards the development of a ecosystem science community cloud

Dr Siddeswara Guru1, Hamish Holewa2, Gerhard Weis2, Hoylen Sue3, Sarah Richmond2

1 The University of Queensland, St Lucia, Australia, s.guru@uq.edu.au
2Griffith University, Brisbane, Australia, hholewa@quadrant.edu.au, g.weis@griffith.edu.au, sarah.richmond@griffith.edu.au
3QCIF Pty Ltd, Australia, h.sue@qcif.edu.au

 

INTRODUCTION

The EcoCloud is a customised National eResearch Collaboration Tools and Resources (NeCTAR) hosted cloud infrastructure for ecosystem science to enable more effective access to ecosystem science data, compute platform and resources for innovative research. One of the motivations to develop EcoCloud is to bring wider accessibility of data to compute resource to perform complex analysis. The EcoCloud is part of a three domain specific Science Clouds, namely, Australian Ecosystems Science Cloud (EcoCloud), Australian Biosciences Cloud and Australian Marine Sciences Cloud which were all started with the collaboration between NeCTAR and Terrestrial Ecosystem Research Network (TERN), BioPlatform Australia (BPA) and Integrated Marine Observing System (IMOS) respectively.  EcoCloud is built in partnership with TERN, NeCTAR, Queensland Cyber Infrastructure Foundation (QCIF), National Computing Infrastructure (NCI) and Atlas of Living Australia (ALA). The EcoCloud will be a building block towards the development of Australia’s open ecosystem science cloud for seamless access to data, tools and analysis to carry out transparent, reusable research and development across different ecosystem domains that can influence cutting edge research and evidence-based public policies.

RATIONALE

Ecosystem science is a multi-disciplinary science which studies the inter-relationship between living organisms, physical features, bio-chemical processes, natural phenomena, and human activities [1]. TERN plays a critical role in collecting and collating terrestrial ecosystem data, and ALA does the same for biodiversity data. But, there are several initiatives and organisations that collect and publish extensive collection of ecosystem data.  With the advancement of technology and emergence of cloud platform developed by NeCTAR and Research Data Services (RDS), the ecosystem science community wants to leverage easy and seamless access to services like network infrastructure, storage and compute to build their analysis pipelines. Furthermore, accessing data residing in different data centres, multiple interfaces to search and access data and lack of streamlined access to cloud-enabled data for compute is still challenging and most often not possible.

APPROACH

The EcoCloud was started with a strong user-centric approach to address some of the challenges faced by the ecosystem science community to get streamlined access to data closer to compute resource. Two sets of community engagements were started with one focusing on domain scientist and another focusing more on application developers. Several patterns emerged about the needs of the community. In general, users wanted managed scalable compute platforms with openly accessible data collections closer to compute for further analysis. Following are some of the drivers for EcoCloud:

  1. ensure that data is widely accessible for use across different science disciplines
  2. harmonise TERN data infrastructure and other similar data infrastructures to offer a common platform to perform a data-centric search, query and access data from different platforms and virtual labs
  3. provide scalable managed computing environment with easy access to distributed and data-intensive computation and technologies
  4. develop a support system for a cross-disciplines use of data.

The overall component architecture was drawn up to identify functional and non-functional needs for the EcoCloud. Further, the cloud-enabled architecture was also developed to identify OpenStack cloud services that would be used and required in the implementation.

As part of the implementation of phase 1, we have decided to develop three managed platforms with open data platform. Figure 1 shows components of EcoCloud under development. We explain each of the components, and its functionality.

Figure 1: components of the phase 1 implementation of EcoCloud

SYSTEM OVERVIEW:

EcoCloud will have three main components: user management, data storage and managed compute platforms. Users management is an essential part of a system with users need to register and login using Australian Access Federation (AAF) or Google account. User information will be stored in LDAP to create user accounts to access virtual desktop platform and file sync applications.

In the first phase, as a data platform, EcoCloud will offer EcoDrive and EcoStore. The EcoDrive will act as a user defined storage platform with sync functionality to bring their own data and access them from EcoCloud compute resource and analysis platform. A user should be able to write to an EcoDrive and able to access data from different EcoCloud analysis platforms. Users should be able to sync EcoDrive to their local desktop but size of an individual datasets that will be synced will be restricted to less than 500 MB to minimise large data copying. The EcoStore is a managed data store to host large-scale reference data. Typically, these datasets are continental scale and are ready for reuse. The EcoStore is primarily a managed storage to host reference quality controlled datasets that are available for use by the user community. The EcoStore don’t offer sync functionality to minimise large file copying. Instead, all the datasets will be accessible from managed platform. The OpenStack Swift storage will be used to implement EcoStore.

Managed platforms consist of RStudio Server, Jupyter Hub and virtual desktop. EcoCloud offers managed RStudio Server and Jupyter Hub as a managed service accessible from a web browser. These managed services will run as a container in kubernetes cluster orchestration system. A CoESRA virtual desktop environment will also be part of the platform [2]. All virtual desktops will include applications RStudio, Canopy, QGIS, Panoply, Kepler, KNIME, OpenRefine and options to sync user’s Dropbox drive.  All the managed application platforms could access EcoDrive and EcoCloud to enable data analysis and application development.

User support in eResearch is important to bring e-infrastructure closer to the domain research community. EcoCloud will have a dedicated online support system which includes sharing online tutorials, ehelpdesk to disseminate system expertise and helping users to accomplish their research by addressing problems arising in their eResearch needs.

REFERENCES

  1. National Ocean service. Available from: http://oceanservice.noaa.gov/facts/ecosci.html, accessed 28 June 2017.
  2. Guru, S.M., C.I. Hanigan, H.A. Nguyen, E. Burns, J. Stein, W. Blanchard, D. B. Lindenmayer, and T. Clancy, “Development of a cloud-based platform for reproducible science: the case study of IUCN Red List of Ecosystems Assessment” Ecological Informatics, Vol. 36. 2016.

 


Biographies

Siddeswara Guru is a Data Science Director for the Terrestrial Ecosystem Research Network (TERN). He initiates, coordinates and manages ecological data, e-infrastructure and synthesis projects apart from overseeing the data and information management activities across TERN.

Dr Hoylen Sue is a data specialist at the Queensland Cyber Infrastructure Foundation (QCIF). He has experience in research and development in the areas of data management, metadata, distributed systems and cloud computing.

Building a soils data community

Mr Paul Box1, Mr Peter Wilson1, Ms Julia Martin2, Ms Melanie Barlow2

1Commonwealth Scientific and Industrial Research Organisation, Sydney, Canberra, Australia,

2Australian National Data Service, Canberra, Australia

ABSTRACT

In this presentation we will detail approaches to building a soil data sharing community to broaden the base of contributors to and consumers of Australian soils data.

This ‘soil data sharing community’ seeks to develop equitable, transparent, trusted data sharing mechanisms that would benefit all contributors and potential users within a structured and agreed environment. As the community spans government, research and industry players as well as individual farmers, all of which have differing goals, business drivers, incentives for sharing data, developing arrangements that satisfy all players is challenging.  Unlike a ‘data commons’, the  ‘data sharing arrangements’ for a soil data community will need to provide a more secure and trusted sharing environment more akin to a farmers data market in which contributors are able to determine conditions for the reuse of their data. The sharing capabilities  to be developed are a socio-technical system, comprising technical infrastructure, standards and data, together with the polices, IP arrangements, contracts, governance and other institutional arrangements necessary  to build and operate a trusted market place comprising data that have public, private and club good characteristics.

This presentation will explore the Human-centered design approach used to explore a range of social, institutional and economic issues and perspectives of a range of stakeholders actively engaged in or wanting to build data sharing arrangements.  The presentation will also provide insights into the design of the socio-technical approach to building a community. A data sharing market, based on requirements identified by stakeholders in this project and patterns identified in Sukiato activities in other domains. These insights can provide guidance and potential learnings for other data community building efforts, particularly those related to the rapid expansion of digital agriculture.

BACKGROUND

Over the last 25 years CSIRO has been collecting and curating data about Australian soils. Much of this data collection has been funded by various state and federal government departments over the years through a large number of research projects. Each project in turn has delivered various research outcomes including research papers, government reports and databases that CSIRO researchers have brought together under the banner of the Australian Soils Research Information System (ASRIS – http://www.asris.csiro.au/).

Because the data collection and research projects were funded entirely by government the researchers tended to focus on meeting the needs of the departments they were working with at the time. The look, feel and functionality of the system was driven mostly by the contractual requirements for the various funding agreements. In order to be considered a truly national system the platform needs to broaden its appeal to a wider community across Australia interested in soils data.

Funding to maintain soils information is limited, and large volumes of soil data are collected by private sector players, (agronomists, soil testing labs, and farm machinery operators) as well as by farmers themselves increasingly through sensors deployed on their land. This data is privately owned data (private goods), by individual farmers or industry, made publicly available (public goods) and increasingly through collectives for benefits of data sharing communities (club goods).

The ownership of and access rights to data are often unclear especially in cases where third parties own sensors through which data are collected. Lack of clarity and the often contested nature of rights around data, compounded by individual farmers’ concerns about the dis-benefits to them of sharing with others (e.g. poor soil quality readings for land leading to reduced land valuation) act as disincentives to sharing. Insufficient clarity around the value proposition for sharing and the lack of Australian examples of beneficial data cooperatives act as further constraints.  However, the potential value of data sharing to farmers, agriculture industry and government is considered to be high, with spatially and temporally extensive data (notwithstanding issues of quality or completeness) of high utility for improved on-farm productivity, and a range of other third party uses.

The soils data community building project, aims to explore and understand this complex set of issues to inform the design of appropriate social, institutional and technical elements of a soils data sharing mechanism. Building a vibrant community around a capability of this nature is essential to ensure the sustained flow of data that is used to deliver the necessary benefits to incentivise data provision and at the same time offer a viable business model for market operation.


Biography

Paul Box leads a CSIRO research team developing interoperable systems of systems or ‘Information Infrastructure’. Paul has worked for more than 25 years in geospatial information technology field. Prior to joining CSIRO in 2009 worked for 15 years throughout Asia, Europe, and Africa for United Nations, government, and not for profit organizations designing, implementing and managing geospatial capability across a wide diversity of application areas in sustainable development and humanitarian response.

For the past 10 years, Paul has been actively involved in research, design and implementation of large scale cross-enterprise Information Infrastructure. This work has focused primarily on the design and delivery of integrated suites of geospatial information products and improving the efficiency of information supply chains.

More recently, Paul has focused attention on addressing the social rather than technical challenges of building Information Infrastructure. Coherent integrated approaches to addressing the social, institutional and economic challenges of infrastructure development are being elaborated through ‘social architecture’. This approach supplements traditional technical architecture led approaches and are being used to support the design and implementation of information infrastructure in multiple domains.

Bioenvironmental Data as a Web Service

Mr Lee Belbin1

1Blatant Fabrications Pty Ltd, Carlton, Australia,

THE ATLAS OF LIVING AUSTRALIA [1]

The Atlas of Living Australia is an Australian government funded project established in 2010 and managed by CSIRO to collect and integrate information on observations and specimens of species in the Australian region. There was a recognition that the diverse and valuable data held by Australian herbaria and museums needed to be integrated and exposed publicly in a consistent form. The Atlas currently holds more than 90 million records of over 100 thousand terrestrial, marine and freshwater species and a broad range of species attributes (e.g., images) [2]. To June 2017, there have been 11.3 billion records downloaded from the Atlas.

The geographic focus of the Atlas of Living Australia is naturally enough, the Australian region but how do we even define this region? Notionally, this has been defined as the bounding box that contains Australia’s Economic Exclusion Zone and Australia’s External territories. There are however no fixed spatial limits to the biological and bioenvironmental data within the Atlas. For example, the Atlas currently contains observations of species in 254 countries.

THE SPATIAL PORTAL [3]

The Atlas has several web portals, one of which, the Spatial Portal, has been designed to support ecological research and environmental management. As its name suggests, the Spatial Portal provides a map focus but also adds around 500 ‘environmental layers’ [4] and range of analytical tools to demonstrate the utility of integrated biological and environment data. The spatial extent of the Spatial Portal’s environmental layers are predominantly within the Australian region as defined above, but global extent layers are included to support the analyses of invasive species in their home ranges.

Tools such as scatterplots enable users to evaluate how species are related to their environments. If the axes on the scatterplots are replaced by classes of an environmental layer, a cross-tabulation of species occurrences, species diversity and area can be calculated for each combination of classes. More advanced tools such as MaxEnt [5], a species distribution modelling algorithm based on maximum entropy, examines the relationship between species observations and the environment to predict where species could occur.

BIOENVIRONMENTAL [6]

While the Atlas wasn’t an obvious home for these bioenvironmental layers, no alternative currently exists. The ~500 layers were sourced from 64 different agencies/departments. Layers were added to the Spatial Portal if they were evaluated as showing a potential relationship to species distributions or provided a context for species occurrences. For example, species distributions are controlled in part by temperature and precipitation. Similarly, the distribution of species across for example parks and reserves may contribute to how areas are managed.

The exposure of these bioenvironmental layers is via key words and a three-level hierarchical classification; layer type, classification 1 and classification 2. The layer type is either environmental implying gridded continuous values such as temperature or precipitation, or polygonal contextual layers implying a class value such as soil type or state/territory.  Classification level 1 includes area management, biodiversity, climate, distance, fire, hydrology, marine, political, sensitive data, social, substrate, topography and vegetation. Level 2 classification terms include age, biodiversity, biology, boundaries, chemistry, classification, culture, energy, evaporation, exclusion zones, farming, fpar, habitat, humidity, moisture, phenology, phylogenetic diversity, physics, precipitation, region, status, temperature, topography, turbidity and wind.

The Atlas negotiated various forms of CC-BY [7] licenses for the biological data but the environmental data layers range from various creative commons licenses to “contact the creator”. Therefore, while biological data can be freely downloaded, for simplicity, the environmental data has been restricted to ‘sampling’: providing layer values at geographic points corresponding to biological observations in the Atlas or as uploaded by users.

Access to environmental layer values can either be through the Spatial Portal’s web interface, via an R-library (ALA4R: http://www.ala.org.au/faq/spatial-portal/spatial-portal-case-studies/ala4r/) or via the Atlas web services [8]. For example, to determine what state or territory an observation is in, the URL http://spatial.ala.org.au/ws/intersect/{id}/{latitude}/{longitude} can be used. The ‘id’ is a local identifier that is assigned to each layer and the latitude and longitude are to be supplied in decimal degrees. For example, http://spatial.ala.org.au/ws/intersect/cl22/-23.1/149.1 will show that the location with a latitude -23.1 and a longitude of 149.1 is, on layer cl22 (contextual layer 22), located in state of Queensland. Similarly, http://spatial.ala.org.au/ws/intersect/el761/-34.43/145.12 will return that the mean annual temperature – diurnal range at the latitude -34.43 and longitude 145.12 is 13.575c.

DISCUSSION

The 500+ bioenvironmental layers in the in the Atlas of Living Australia requires approximately one f/t position to maintain. New layers and associated metadata can take days to locate, download, check licensing, classify and process into a consistent form for efficient use within the Atlas environment. The availability of new layers is an ad hoc manual process, and new layers may either replace older versions (deprecate), or add to older versions.

Ideally, a Federally-funded national committee within NCRIS [9] should be established that will identify relevant bioenvironmental layers for the Australian region, and establish a standard protocol for their delivery nationally and globally. Fundamental to that service should be a basic web service that takes the form-

http://bioenvironment.au/ws/intersect/{layerID}/{latitude}/{longitude}

A batch version of the above service would be a useful addition. The responsibility for the delivery of the service would then reside at the data source, errors in processing greatly minimized, and currency assured. Such a service would also avoid costly duplication of effort, storage and computation. Professor Henry Nix proposed an “Australian Environmental GIS (AEGIS)” back in 1986. Hopefully, this simpler recommendation won’t take another 30 years to implement.

REFERENCES

  1. The Atlas of Living Australia, http://www.ala.org.au, accessed 27 Jun 2017.
  2. The Atlas dashboard, http://dashboard.ala.org.au, accessed 27 Jun 2017.
  3. Belbin, L., The Atlas of Livings Australia’s Spatial Portal, in, Proceedings of the Environmental Information Management Conference 2011 (EIM 2011), Jones, M., B. & Gries, C. (eds.), 39-43. Santa Barbara, USA. http://spatial.ala.org.au, accessed 27 Jun 2017.
  4. Environmental layers, http://spatial.ala.org.au/layers, accessed 27 Jun 2017.
  5. Phillips, S.J., Anderson, R.P. and Schapire, R.E. Maximum entropy modeling of species geographic distributions. Ecological Modelling 2006 190, p. 231-259.
  6. Creative Commons, https://creativecommons.org/licenses/by/2.0/au/, accessed 26 Jun 2017.
  7. The Atlas of Living Australia’s web services, http://api.ala.org.au, access 27 Jun 2017.
  8. Belbin, L., Williams, K.J., Towards a national bio-environmental data facility: experiences from the Atlas of Living Australia. International Journal of Geographical Information Science 2016 30(1), p. 108-125 http://dx.doi.org/10.1080/13658816.2015.1077962.
  9. The National Collaborative Research Infrastructure Strategy (NCRIS), https://www.education.gov.au/national-collaborative-research-infrastructure-strategy-ncris, accessed 27 Jun 2017.

Biography

I am a geoscience graduate and IT postgraduate who has evolved from exploration geology, teaching and research, analytical ecology to management, standards and policy development.  For the past 15 years, I have provided project management for, and advice to international and national information management projects. I have been referred to as a surfer with a work problem. ORCID: 0000-0001-8900-6203

About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2017 - 2018 Conference Design Pty Ltd