Identifying, connecting and citing research with persistent identifiers

Natasha Simons1, Andrew Janke2, Jens Klump3, Lesley Wyborn4, Adrian Burton5, Siobhann McCafferty6, Gerry Ryder7

1Australian Research Data Commons, Brisbane, Australia, natasha.simons@ardc.edu.au

2National Imaging Facility, Centre for Advanced Imaging, UQ, Brisbane, Australia, andrew.janke@uq.edu.au

3CSIRO Mineral Resources, Perth, Australia, jens.klump@csiro.au

4National Computational Infrastructure, Canberra, Australia, lesley.wyborn@anu.edu.au

5Australian Research Data Commons, Canberra, Australia, adrian.burton@ardc.edu.au

6Australian Access Federation, Brisbane, Australia, siobhann.mccafferty@aaf.edu.au

7Australian Research Data Commons, Adelaide, Australia, gerry.ryder@ardc.edu.au

 

DESCRIPTION

Increasingly, the research community, including funders and publishers, is recognising the power of ‘connected up’ research to facilitate reuse, reproducibility and transparency of research. Persistent identifiers (PIDs) are critical enablers for identifying and linking related research objects including datasets, people, grants, concepts, places, projects and publications.   PID systems:

  • Provide social and technical infrastructure to identify and cite a research output over time
  • Enable machine readability and exchange
  • Collect and make available metadata that can provide further context and connections
  • Facilitate the linkage and discovery of research outputs, objects, related people and things

Join this BoF to learn about recent developments in PID services and infrastructure with a particular focus on DOI (research data), ORCID (people and organisations), RAID (research activities and projects) and IGSN (physical samples and specimens).

Find out how to maximise the return on your investment in PIDS through participation in global initiatives such as Scholix and the Research Data Switchboard which use PIDS to offer researchers, and research institutions a richer, more connected experience.

AUDIENCE

This BoF will be of interest to those implementing, maintaining and supporting PID services including repository managers, developers and librarians. Participants should come along prepared to exchange knowledge, share experiences and contribute to discussions about optimising the ‘power of PIDs’.

SESSION STRUCTURE

The session will kick off with brief lightning talks presented by those working at the cutting edge of global developments in PID services and infrastructure.  Following facilitated Q&A, participants will be encouraged to contribute to an open discussion to share experiences, explore ideas and ask questions.

OUTCOMES

Participants will leave the BoF with a fresh perspective on the opportunities PIDs can offer researchers and research organisations.  We envisage that many participants will be prompted to explore in greater depth, ideas raised during the session as they might apply to their organisation.
The BoF will also offer participants the opportunity to establish or strengthen connections with the broader PID community in Australia and internationally.


Biography:

Natasha Simons is Program Leader, Skills Policy and Resources with the Australian National Data Service.

IGSN: a persistent identifier for physical samples

Adrian Burton1, Jens Klump2, Lesley Wyborn3, Gerry Ryder4

1Australian Research Data Commons, Canberra, Australia, adrian.burton@ardc.edu.au

2CSIRO Mineral Resources, Perth, Australia, jens.klump@csiro.au

3National Computational Infrastructure, Canberra, Australia, lesley.wyborn@anu.edu.au

4Australian Research Data Commons, Adelaide, Australia, gerry.ryder@ardc.edu.au

 

OVERVIEW

The International Geo Sample Number (IGSN) is designed to provide an unambiguous globally unique persistent identifier for physical samples.  It facilitates the location, identification, and citation of physical samples used in research.  While applicable to any type of physical sample, impetus for the IGSN has come largely from the earth science community where IGSN are assigned to geologic and environmental samples such as rocks, drill cores, soils, water and gas as well as related sampling features such as sections, dredges, wells and drill holes.

The IGSN system is underpinned by the Handle System and is governed by an international organisation, the IGSN Implementation Organization e.V.

BENEFITS OF IGSN

There are numerous examples of the fundamental role persistent identifiers play in the global sharing of information, resources and objects.  The DOI is one widely known example, while others such as ORCID are rapidly gaining traction in the research community.

Assigning IGSN to samples:

  • facilitates the discovery, access, sharing and citation of samples
  • supports preservation and access of sample data
  • aids identification of samples in the literature
  • supports tracking of samples across laboratories and sample storage
  • advances the exchange of digital sample data among interoperable data systems, for example by enabling a sample to be linked to the:
    • data derived from it
    • literature where the sample and data are interpreted
    • curator or collector of the sample.

IGSN IN AUSTRALIA

There are four agencies in Australia implementing IGSN.  All have taken up membership of IGSN e.V. to become IGSN allocating agents for identified stakeholder groups that collect or curate earth science samples for research.

  • Curtin University: allocating agent for Curtin University facilities, staff and HDR students
  • CSIRO: allocating agent for CSIRO facilities and staff
  • Geoscience Australia: allocating agent for Geoscience Australia facilities and staff, and those associated with State Geological Surveys
  • Australian Research Data Commons (ARDC): allocating agent for University staff and those working in publicly funded research organisations not covered above

ARDC IGSN SERVICE

The ARDC IGSN service was developed in collaboration with AuScope as a key component of the Geoscience Data Enhanced Virtual Laboratory (GeoDEVL) project.  Released in July 2018, the ARDC IGSN service currently:

Criteria for using the ARDC IGSN service:

  • six mandatory metadata elements required for IGSN registration must be provided at the time of registration. Providing additional descriptive metadata will increase the potential for discovery, reuse and citation of the registered sample
  • the sample being identified should be associated with an Australian research activity
  • IGSN identifiers should resolve to a metadata record describing the sample
  • the sample being identified, and associated metadata, should be curated through the research and sample lifecycle

FUTURE DIRECTIONS

While the scope of the ARDC IGSN service is currently limited to earth science samples, the ARDC is interested in working with other communities in order to extend the service for use with other physical sample types such as vegetation, archaeological and biological specimens.  It is anticipated that development work to extend the service will commence Q1 2019 and the ARDC welcomes enquiries from prospective users.

Worth noting is that IGSN e.V. recently secured a Sloan Foundation Grant to enable further development of IGSN technical infrastructure and governance.  The ARDC will have an active role in this project which represents a significant investment in future sustainability of the IGSN system.


Biography:

Dr Adrian Burton is Director, Services with the Australian National Data Service

Reviving an old and valuable collection of microscope slides of physical samples through the use of Citizen Science

Mr John Pring1, Dr Lesley Wyborn2, Mr Neal Evans1

1Geoscience Australia, Canberra, Australia, john.pring@ga.gov.au, neal.evans@ga.gov.au

2Australian National University, Canberra, Australia, lesley.wyborn@anu.edu.au

 

The importance of Australia’s mineral wealth has been well recognised since at least Federation in 1901, however the perceived importance and value of the underlying data has fluctuated.

Through successive agencies the Australian Federal Government has collected a considerable quantity of physical samples and data over the last 100 years including historically significant samples, many of which cannot be replaced as the source locations are no longer accessible.  One of the more valuable collections now hosted by Geoscience Australia (GA) comprises 250,000+ microscope slide thin sections of these physical samples collected during hundreds of field mapping campaigns from across Australia, Papua New Guinea, Antarctica and beyond.

Figure 1: BMR Field Camp 1956

With the progress of time and technology, and the inherent human nature to only access things readily available, the largely paper based management system for the slide collection has seen the use of this public collection greatly reduced since its heyday in the latter half of the 20th century.

GA initiated a project to rescue the microscope collection and its metadata. Much of metadata was recorded on hand written cards or log books, and this needed to be captured and then updated to be compatible with current GA online management systems. With the tight fiscal constraints on the agency, there were insufficient geoscience experts available for the task, and this necessitated that the approach to capturing the large quantify of card and register based information in a usable digital form be done in a non-traditional way. The project decided to make extensive use of the DigiVol [1] citizen science portal to initially transcribe the paper based records, letter for letter, number for number using citizen scientists with no geological expertise.

However, because of the age of the collection, it was not just a simple matter of transcribing handwritten data and then making this information available as is. The legacy information had to be updated if it was it to be reusable and compatible with modern GA corporate databases, particularly for content that now follows international standards and specifications for digital data that were not in existence when the original samples and descriptive information was collected. A few subject matter experts (SMEs) (including volunteer retirees who collected some of the material) were then involved in a consultative manner for the data validation stage. Firstly, the location of each sample needed to be translated into modern datums and spatial referencing techniques. Some of the locations needed to be retrieved from pin holes in air photographs or text-based location descriptions (eg “Fullerton Gully 3.5M S.S.E. Gurrumba”[2]).  Because of uncertainty of many of the locations, care was also taken to record the accuracy of the position, which in some cases was +/- some kilometers. Secondly, the SMEs provided valuable expertise to help update the information to modern standards so that it could be seamlessly integrated into the GA databases. Once there, it will be possible to make this legacy data available to industry, research and the general public through the current GA data access mechanisms.

This combination of using citizen scientists to do the valuable initial transcription, made much more effective use of the few SMEs available to the project. The SMEs simply had to focus on improving the quality of the information and providing consulting support to the citizen scientists.

This presentation will explore the approach taken by Geoscience Australia and the benefits to the organisation, the roles of citizen science participants (without whom this legacy collection would not have been made accessible), and the untapped potential for this valuable new data collection.

References

  1. DigiVol citizen science transcription site available from https://volunteer.ala.org.au/ accessed 21 June 2018
  2. Geoscience Australia Rock Register #2, page 34 (Reg No. 16583)

Biography:

John Pring holds a Masters of Management Studies (Project Management/Technology and Equipment) from the University of New South Wales and an Electrical Engineering Degree from the University of Southern Queensland.

He has been Senior Project Manager within the Environmental Geoscience Division of Geoscience Australia for some 10 years and has run a number of projects associated with the management of the agencies data and physical collections over that time.

He has held similar roles within other government agencies prior to joining Geoscience Australia.

CSIRO Knowledge Network: supporting tailored data discovery and access

Jonathan Yu1, Benjamin Leighton2, Jevy Wang3, Hendra Wijaya4

1CSIRO L&W, Clayton, VIC, Australia, jonathan.yu@csiro.au

2CSIRO L&W, Clayton, VIC, Australia, ben.leighton@csiro.au

3CSIRO L&W, Black Mountain, ACT, Australia, jevy.wang@csiro.au

3CSIRO/Data61, North Ryde, NSW, Australia, hendra.wijaya@csiro.au

 

Discovery and access of data to support research projects and policy analysis is currently limited. While, many services are increasingly publishing data, for researchers and policy analysts, these are not easily discoverable and accessible, not comprehensive and not linked with tools and approaches that promotes their use. On the other hand, data providers are often disconnected with user groups and lack the ability to capture, attribute and accrue value to justify further business cases in improvements to allow the data to be more discoverable, accessible, interoperable and reusable. Therefore, this is a barrier that limits the ability to develop repeatable and evidence-based policy analysis and research in Australia.

CSIRO is developing the Knowledge Network (KN) platform (https://kn.csiro.au), which provides a gateway to data published via range of data initiatives, including NCRIS and open government data initiatives. KN harvests and indexes known data records from multiple data repositories in government and research. This is then made available to allow anyone to discover, access and share links to data at the collection level and at the individual file or service level all in the one platform.

By having datasets and file level information available in the KN platform, it provides opportunities for researchers to leverage these in online platforms, including data analytics environments (e.g. virtual laboratories or science gateways), as well as web applications tailored for specific communities. KN is currently being used in the ‘EcoScience Research Data Cloud and Data Enhanced Virtual Laboratory’ project (ecocloud for short) [1] to enable discovery and access to third-party data for use with the ecocloud compute platform. In particular, KN is powering discovery and access via the ecocloud explorer which displays a tailored set of search results of data relevant to the ecological science domain. This then allows ecocloud users, such as researchers or policy analysts, to discover and access relevant data in the ecocloud explorer, and provide code snippets for its use in ecocloud compute environments. However, the current APIs provide means for other projects and initiatives to provide a tailored view of data from a comprehensive superset which aims to have national coverage.

As information about the dataset and file level metadata is also indexed in KN, this provides opportunities for developing quantitative surveys of the data landscape, particularly in Australia to enable analysis and report on current state [2,3]. By understanding the current state of the data landscape, it allows greater data-driven insight and understanding of trends and gaps in data initiatives in general over time based on the metadata and datasets themselves. Specifically, it allows for a data-driven picture of emerging trends of topics and activities for specific scientific/research communities as well as public and private sector-based agencies. This then allows opportunities for assessment of improvements in future initiatives based on data-driven insights.

In this presentation, we provide an overview of the KN technical architecture, its use in a virtual laboratory context, and a discussion around data-driven insights that can be gained from the KN platform to inform a ‘state of the data’ picture for Australia.

REFERENCES

  1. EcoCloud, https://www.ecocloud.org.au, accessed 20 June 2018
  2. Yu, J., et al., Survey of open data and research data in the Australian context via the CSIRO Knowledge Network, eResearch Australasia, Brisbane, Australia, October 2017
  3. Yu, J., et al., Visualising the Australian open data and research data landscape, Collaborative Conference on Computational and Data Intensive Science, 2018 (C3DIS 2018), Melbourne, Australia, May 2018, DOI: 10.13140/RG.2.2.33826.32964

Biography:

Dr Jonathan Yu is a data scientist researching information and web architectures, data integration, Linked Data, data analytics and visualisation and applies his work in the environmental and earth sciences domain. He is part of the Environmental Informatics group in CSIRO Land and Water. He currently leads a number of initiatives to develop new approaches, architectures, methods and tools for transforming and connecting information flows across the environmental domain and the broader digital economy within Australia and internationally.

Changes in national ethics policy for managing and sharing human research data

Kate LeMay 1

1Australian Research Data Commons, Canberra, Australia, kate.lemay@ands.org.au

 

There is a strong national and international movement from both funders and publishers of research, and in particular medical research, towards requiring digital data outputs of research to be well managed and available for appropriate reuse by other researchers. Institutional ethics policies also play a key role in determining how long and where data should be retained, and if/how it can be shared. These ethics policies are based upon the National Statement of Ethical Conduct in Human Research, which is owned by the NHMRC.

This session will examine the new version of the National Statement on Ethical Conduct in Human Research, and ways in which institutions, ethics committees and researchers can comply with the new requirements for data management and sharing.

Managing access to shared data

The Five Safes[1] framework for managing access to data is an excellent basis for planning to manage access to sensitive data. By considering the five aspects of projects, people, data, settings and outputs it addresses the risks in each of these areas and provides choices for a variety of ways to manage access.

Research data can be openly described in data repositories, without making the data openly available., which is called mediated access. This concept is consistent with the approach of making data Findable, Accessible, Interoperable and Reusable (FAIR)[2], and can be part of the Five Safes framework. There are many ways of mediating access to sensitive data, and some examples will be given in this session.

Consent

Sufficient and voluntary consent for data sharing is vital. Controls around governance, access, use, release, confidentiality and privacy of the data should be made clear during the ethical approval process, and also to participants in the research when obtaining consent. Appropriate consent must be obtained from participants for the reuse of research data. Strategies to incorporate data sharing into the ethical approval and consent processes will be discussed.

When research data is reused it must comply with the consent agreement originally formed with the participants. It may be appropriate to provide levels of consent to the participants, e.g. levels of identifiability or aggregation of their data being made available for reuse.

Often researchers are concerned that participants will not consent to their research if they ask for permission to share the data after the conclusion of the project. However, there is a growing body of research around positive participant attitudes towards data about them, even medical data, being reused for research purposes.

Conclusion

The management, retention, and appropriate sharing of research data is increasingly recognised as an important part of the research lifecycle. This is being recognised in national policies, such as the new version of the National Statement on Ethical Conduct in Human Research. Ways in which institutions and researchers can appropriately manage and share human research data will be outlined.

REFERENCES

  • Desai, T. Ritchie, F., and Welpton, R. Five Safes: designing data access for research. 2016. DOI: 10.13140/RG.2.1.3661.1604
  1. FAIR data. Available from: http://www.ands.org.au/working-with-data/fairdata, accessed 30 May 2018.

Biography:

Kate LeMay began her career as a Pharmacist, working in both community and hospital settings. She moved on to the University of Sydney and Woolcock Institute of Medical Research, where she worked on community pharmacy based programs to assist patients with chronic disease management. Kate is now in Canberra, Australia, at the Australian Research Data Commons (ARDC) as a Senior Research Data Specialist, focusing on health and medical data.

Medical Imaging: Federation and Compute

Chris Albone1, Ryan P Sullivan1,2

1Information and Communications Technology, University of Sydney, Sydney Australia

2Core Research Facilities, DVC-R, University of Sydney, Sydney Australia

chris.albone, ryan.sullivan@sydney.edu.au

 

SUMMARY

XNAT is an imaging data platform that has been rapidly gaining popularity throughout Australian research institutions and facilities, and worldwide [1]. It has been adopted as part of the National Imaging Facilities (NIF) Trusted Data Repository (TDR) program to provide a standard framework on medical imaging and data provenance.

Similar efforts are underway on the computational component with the Characterization Virtual Labs (CVL) under the Data Enhanced Virtual Lab (DeVL) program funded by NRDC, providing a workbench dedicated to neuroimaging. NIF@UQ has also been working on a DICOM2Cloud project to facilitate automated anonymization of data for computation on public cloud environments.

The University of Sydney is using XNAT as a key component of our Imaging Data Service, and have combined it with compute on our HPC, and VRD, as well as the CVL and GUI informatic pipeline platforms. We are also a participant in the C-DeVL program developing a windows version of CVL workbenches. Research is inherently multi-institutional, and projects will be spanning multiple repositories and computation infrastructure. We would like to raise the natural question of federation of these aligned projects.

FORMAT

We propose a 60 min roundtable with representatives of institutions with XNAT, or who are looking at deploying XNAT systems. The roundtable will discuss the following:

  • What is the current status of deployments? Plans for the immediate future. (20 min)
  1. What might XNAT federation look like? Federated metadata search? Federated data search? (15 min)
  2. CVL is being federated. What about other characterization and informatics workflow platforms? Shared repository of Singularity/Docker pipelines to use in XNAT and/or HPC? (15 min)
  3. Should a standard anonymization toolset be adopted when transferring between these repositories and centers of compute? (10 min)

Biography:

Dr Sullivan is a biophysicist with an interest in neural implants. His research led him into software development for automatic characterization of implants and neural tissue. Dr Sullivan joined the University of Sydney in 2017 where he now works on eResearch projects focusing on characterization domains.

Imaging Data Service: Ingestion, Storage, and Compute

Ryan Sullivan1, Haofei Feng2, Vipul Patel3, Murray-Luke Peard4, Chris Albone5

1University of Sydney, Sydney, ryan.sullivan@sydney.edu.au

2University of Sydney, Sydney, haofei.feng@sydney.edu.au

3University of Sydney, Sydney, vipul.patel@sydney.edu.au

4University of Sydney, Sydney, murray-luke.peard@sydney.edu.au

5University of Sydney, Sydney, chris.albone@sydney.edu.au

 

Introduction

Imaging research, clinical, preclinical, or otherwise, is often multisite, multimodal, and compute intensive. XNAT is an imaging data platform that has been rapidly gaining popularity both worldwide and throughout Australian research institutions and facilities [1]. As part of the University of Sydney’s Core Research Facility program, we have developed our Imaging Data Service (IDS) using XNAT as one of the core technologies. IDS is able to ingest, store, and analyse data in an automated and compliant manner to facilitate clinical workflows.

We have connected instruments in the Sydney Imaging Core Facility and I-Med, a local clinical site. We will be expanding to cover instruments in three schools along with additional clinical sites over the coming year. We will discuss challenges we’ve encountered in terms of developing these systems, as well as hurdles in dealing with patient privacy and vendor software.

PROCESS

Acquired images are passed directly from equipment to a Clinical Trials Processor (CTP) or Research Automated Project Allocator & Anonymiser (RAPPA) on site, where direct patient identifiers are stripped in a compliant manner before the data is sent to the XNAT repository. The direct identifiers are stored on-site in such a way that allows automated re-association of derivative data and analysis results on site to facilitate clinical workflows. Other patient data not captured at the instrument are stored in a separate REDCap system, linked with a common anonymised key. This allows a higher granularity of control to address the different needs of a variety of projects and sites based on patient consent. Data from other repositories, such as historical data on our Research Data Share (RDS), may also be batch uploaded to the new system.

Once stored in XNAT, researchers may access their data using AAF authentication via web browser or though multiple clients and connected platforms using the REST API. We have implemented XNAT’s pipeline engine using containerised workflows run on Artemis, our HPC, as the backend, with the future aim of being able to run on private and public clouds. Alternatively, they may use resources such as Argus, Sydney’s Virtual Research Desktop, or the ARDC’s curated Characterization Virtual Lab (CVL). Finally, we look at integration with two informatics platforms, Jupyter Hub and Nipype, through which workflows may be developed. This gives researchers the freedom to choose the desired technologies for their particular workflows.

From the user’s perspective, this provides a “big green button” solution to analysing their data using tested and curated pipelines, while also providing tools for power users who wish to delve deeper into informatics development.

Figure 1: High level overview of data flow in our Imaging Data Service. Orange are systems belonging to the University of Sydney directly, Blue belong to partner institutions, Green belong to NCRIS capabilities. Partially transparent items are planned over the coming year, but not yet in production.

FUTURE WORK

We continue to look at developing sustainable DevOps frameworks for pipelines to allow the system to be self-sustaining and get ICT “out of the way.”  Next steps are continued rollout to appropriate faculties, improved auditing and reporting frameworks for research integrity, operations ROI, and data provenance. We are also interested in discussing interfacing with other similar systems meeting the TDR standard.

REFERENCES

  1. Marcus, D. S., Olsen, T. R., Ramaratnam, M., Buckner, R. L., The extensible neuroimaging archive toolkit. Neuroinformatics, 2007. DOI: 10.1385/NI:5:1:11

Biography:

Dr Sullivan is a biophysicist with an interest in neural implants. His research led him into software development for automatic characterization of implants and neural tissue. Dr Sullivan joined the University of Sydney in 2017 where he now works on eResearch projects focusing on characterization domains.

Collecting and publishing dataset usage and citations at the ALA

Nick dos Remedios1, Javier Molina 2, Simon Bear 3, Patricia Koh4

1Atlas of Living Australia, Canberra, Australia, nick.dosremedios@csiro.au

2Atlas of Living Australia, Canberra, Australia, javier.molina@csiro.au

3Atlas of Living Australia, Canberra, Australia, simon.bear@csiro.au

4Atlas of Living Australia, Canberra, Australia, patricia.koh@csiro.au

 

The Atlas of Living Australia (ALA) [1] is an NCRIS-funded national biodiversity data aggregator. Founded on the principle of open data sharing – collect it once, share it, use it many times – the ALA provides free, online access to over 70 million occurrence records to form the most comprehensive and accessible dataset on Australia’s biodiversity ever produced. The dataset owners and providers are an important stake holder group for the ALA and one of the benefits of sharing their data with the ALA is we are able to provide data usage and citation statistics back to them. Each dataset has a metadata web page on the ALA that provides details about the institution, research, contacts and description for that dataset. On this page, there is a detailed breakdown of how many user-generated downloads contained records from their dataset, covering the past month, 6 months, 12 months and all-time. Recently the ALA has added a new feature to the data downloads, whereby a DOI is automatically generated for every user download event. Researchers are encouraged to link this data DOI to any publication DOI that uses this data. In addition, ALA has collaborated with the Global Biodiversity Information Facility (GBIF) [2] to allocate a DOI to a large percentage of datasets with the aim of covering all datasets in the near future. By using citation linking tools, download DOIs can be linked to their dataset DOIs and thus it will be possible to track publications via the DOI chain back to each dataset.

REFERENCES

  1. Atlas of Living Australia (ALA) – https://www.ala.org.au/
  2. Global Biodiversity Information Facility (GBIF) – https://gbif.org/

Biography:

Nick competed a PhD in comparative immunology at UTS before taking up software development in the airline industry. He then worked for a  IP focused, not-for-profit research NGO called CAMBIA before taking up a role as senior developer at the Atlas of Living Australia (CSIRO) where he now works.

Promoting Data Management “best practice” at UNSW

Jacky Kwun Lun Cho1, Adrian W. Chew2, Cecilia Stenstrom2, Luc Betbeder-Matibet2

1UNSW Sydney, NSW, 2052, Australia, E-mail: k.cho@unsw.edu.au

2 UNSW Sydney, NSW, 2052, Australia

 

Introduction

Good data management is recognized by many as a key to good research (e.g. de Waard, 2016). However, achieving good data management practices (e.g. Corti, Van den Eynden, Bishop, & Woollard, 2014) at the University of New South Wales (UNSW) has proved challenging over the last five years, and UNSW’s data management approach is being modified to respond to this challenge. Two systems were initially developed to facilitate good data management practices: the ResData platform, which includes RDMP and Data publication capabilities; and Data Archive, UNSW’s institutional archival storage platform. These two systems, implemented from 2013, were designed by UNSW IT and Library to address specific concerns around data management (e.g. lack of RDM planning), alignment with UNSW obligations, and evolving requirements from both local and global regulatory and funding bodies.

A recent UNSW data management review identified that the two systems, though functional, lacked crucial integrations with the various storage platforms regularly used by researchers, and resulted in inconsistent data management practices. In addition, the initial approach was largely informationally and expert-driven, and did not have support structures (e.g. training, and dedicated helpdesk) that were user-driven. As a result, there was poor engagement with UNSW data management approach. In light of the review, UNSW changed its data management approach to implement a data management “best practice” strategy over the next few years with a focus on designing and developing data management systems and support structures with end-users in mind. Pro-active engagement with end-users will be crucial to the success of this strategy (see Dierkes & Wuttke, 2016).

This paper will present some findings from the review and outline plans for researcher development in data management practice. UNSW aims to adopt a user-driven approach to identify the key pain points of UNSW researchers and promote a behaviour of “best practice” through a synergistic blend of training, engagement and support, with the aid of tools to streamline the handling of data throughout the lifecycle of a research project.

References

Corti, L., Van den Eynden, V., Bishop, L., & Woollard, M. (2014). Managing and sharing research data: A guide to good practice. Thousand Oaks, CA: SAGE Publications Inc.

de Waard, A. (2016). Research data management at Elsevier: Supporting networks of data and workflows. Information Services & Use, 36(1/2), 49-55. doi:10.3233/ISU-160805

Dierkes, J., & Wuttke, U. (2016). The Göttingen eResearch Alliance: A case study of developing and establishing institutional support for research data management. ISPRS International Journal of Geo-Information, 5(8), 133.


Biography:

Jacky Cho is a project officer in the office of PVC-Research Infrastructure at the University of New South Wales. Prior to this role, he was a researcher in surface and physical chemistry with a PhD in chemistry. In his current role, he is responsible for delivering services and support to researchers at UNSW  to enhance research activities which include promoting a holistic research data management.  https://orcid.org/0000-0001-7591-100X

From monoliths to clusters: transforming ALA’s infrastructure

Matt Andrews1, David Martin2, Mahmoud Sadeghi3, Peter Ansell4, Adam Collins5, Miles Nicholls6

1Atlas of Living Australia, Canberra, Australia, matt.andrews@csiro.au

2Atlas of Living Australia, UK, david.martin@csiro.au

3Atlas of Living Australia, Canberra, Australia, mahmoud.sadeghi@csiro.au

4Atlas of Living Australia, Canberra, Australia, peter.ansell@csiro.au

5Atlas of Living Australia, Canberra, Australia, adam.collins@csiro.au

6Atlas of Living Australia, Canberra, Australia, miles.nicholls@csiro.au

 

The Atlas of Living Australia, as a project being committed to the principles of open source software and open data, has made all of its operating code available to the public.  ALA software is used by many projects around the world to build national databases of biodiversity[1]. Running production infrastructure to meet the needs of researchers and the general public is a significant challenge, with an ever-growing dataset both in terms of the number of records and the width of the data, particularly with the prospect of adding trait information.

BEGINNINGS

In early 2017, the Atlas of Living Australia faced a series of problems with its core infrastructure.  While the Atlas had been well designed from the start to operate as an ecosystem of independent services, several of the key services were looking increasingly unsustainable.  Almost all were running on single large servers: a fragile position which meant that any significant maintenance or performance issue would take down the whole service. One of the primary components of the Atlas, which we call “biocache”, is responsible for the search, display, and download of occurrence records. It was running on very old versions of Cassandra and Solr.  A full update of data, to process and index the full dataset and make it available for search, was taking several days to complete, and if it hit an error, we had to run the whole thing again. This reduced the number of times that new data quality rules, spatial information, and new taxonomic names were applied to the Atlas records. We were in an uncomfortable position, and hitting the limits of what we had.

A PLAN

To solve this combination of problems – old unsupported software versions, single monolithic servers, very slow data processing – we decided to make a fundamental change in the internal architecture of each of the services within the biocache component.

With the two primary data stores, Cassandra and Solr, this meant not only migrating from old unsupported versions to more recent versions, but also moving each to a clustered pool of servers. This would add significant resilience to the system by adding data replication, so the pool can tolerate the loss of a server, and allow for much less disruptive system maintenance which makes the system as a whole much more robust.  The splitting of data into smaller segments meant that processing and indexing can be performed using more parallelism due to the increased IO bandwidth, making these tasks dramatically faster.

Apart from the large data stores, the biocache component contains user-facing web applications.  These had been running on single servers, which was inflexible, more difficult to maintain and entailed a higher risk of downtime.  We decided to move to a model where each of these web applications would be run in a pool of servers behind load balancers. With this approach, a server can be removed from a pool for maintenance with no loss of service, and the system is resilient to a significant range of disaster scenarios.

MOVING MOUNTAINS

The development effort in building these new systems largely focused on the new clustered versions of Cassandra and Solr.   For Cassandra, the move from a single large server running the long-unsupported version 1 of Cassandra to an up-to-date setup running in a replicated cluster, across several nodes, represented a fundamental shift.  We started in Britain, with the Atlas closely involved in building the infrastructure for the new National Biodiversity Network (NBN)[2], which uses ALA software to host data for the UK.

After months of development effort, a new version of the core Cassandra and Solr systems was in place.  With a cluster of four Cassandra nodes, and a Solr Cloud environment with eight nodes, the UK infrastructure was performing exceptionally well, querying a large dataset successfully, and with far better fault tolerance, system resilience and much faster data ingestion and processing.

Once the new approach had been proven in the UK development, an effort began in the Australian team to build an equivalent set of systems for the ALA, and implementing the new approach of running web applications as pools behind load balancers.  Around twelve months of effort from several developers went into planning, developing and testing the project, using the UK work as a reference point.

FLEXIBILITY AND RESILIENCE: BETTER SUPPORT FOR RESEARCH

With the new clustered infrastructure in place, this core component of the Atlas is now considerably more resilient to unexpected outages or spikes in demand, while delivering dramatically better results for us in the data ingestion, processing and indexing area.

In the old systems, indexing our full dataset of around 75 million records previously took 24 hours or more. In the new clustered infrastructure, this operation takes around three and a half hours.  Similarly, full processing and sampling of the ingested data used to take up to six days; in the new system it takes about 11 hours. These spectacular performance improvements give us the space to speed up our data ingestion cycle.

By running all our public-facing services in the biocache component as pools of servers behind load balancers, we are considerably more robust in being able to keep the services running even if a single server goes down, or indeed even if a whole data centre goes down.

The other significant advantage of running a clustered infrastructure is scalability: adding capacity to handle more records, or a wider range of fields for each record, can be achieved without pain and with little or no disruption of service. This will be valuable as we prepare to add many new fields to handle trait information. There are many more areas of improvement that lie ahead for the Atlas’ infrastructure, but overall this fundamental shift in how we run our biocache component has been a success.

  1. National Biodiversity Network website, About the NBN Atlas page: https://nbnatlas.org/about-nbn-atlas/ Accessed 7 June 2018.

Biography:

Matt Andrews has been involved in herding software for several years now.  His past lives include being chauffered between cities handcuffed to a briefcase, making TV ads for contraceptives in Papua New Guinea, training programmers in French bars, being escorted around China by a posse of secret police, and going to school in an Italian village.  After running tech operations for a network of European dating sites, and then for various corners of the Federal Government, he’s now managing DevOps for the Atlas of Living Australia.

Recent Comments

    About the conference

    eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

    Conference Managers

    Please contact the team at Conference Design with any questions regarding the conference.

    © 2018 - 2019 Conference Design Pty Ltd