Promoting Data Management “best practice” at UNSW

Jacky Kwun Lun Cho1, Adrian W. Chew2, Cecilia Stenstrom2, Luc Betbeder-Matibet2

1UNSW Sydney, NSW, 2052, Australia, E-mail: k.cho@unsw.edu.au

2 UNSW Sydney, NSW, 2052, Australia

 

Introduction

Good data management is recognized by many as a key to good research (e.g. de Waard, 2016). However, achieving good data management practices (e.g. Corti, Van den Eynden, Bishop, & Woollard, 2014) at the University of New South Wales (UNSW) has proved challenging over the last five years, and UNSW’s data management approach is being modified to respond to this challenge. Two systems were initially developed to facilitate good data management practices: the ResData platform, which includes RDMP and Data publication capabilities; and Data Archive, UNSW’s institutional archival storage platform. These two systems, implemented from 2013, were designed by UNSW IT and Library to address specific concerns around data management (e.g. lack of RDM planning), alignment with UNSW obligations, and evolving requirements from both local and global regulatory and funding bodies.

A recent UNSW data management review identified that the two systems, though functional, lacked crucial integrations with the various storage platforms regularly used by researchers, and resulted in inconsistent data management practices. In addition, the initial approach was largely informationally and expert-driven, and did not have support structures (e.g. training, and dedicated helpdesk) that were user-driven. As a result, there was poor engagement with UNSW data management approach. In light of the review, UNSW changed its data management approach to implement a data management “best practice” strategy over the next few years with a focus on designing and developing data management systems and support structures with end-users in mind. Pro-active engagement with end-users will be crucial to the success of this strategy (see Dierkes & Wuttke, 2016).

This paper will present some findings from the review and outline plans for researcher development in data management practice. UNSW aims to adopt a user-driven approach to identify the key pain points of UNSW researchers and promote a behaviour of “best practice” through a synergistic blend of training, engagement and support, with the aid of tools to streamline the handling of data throughout the lifecycle of a research project.

References

Corti, L., Van den Eynden, V., Bishop, L., & Woollard, M. (2014). Managing and sharing research data: A guide to good practice. Thousand Oaks, CA: SAGE Publications Inc.

de Waard, A. (2016). Research data management at Elsevier: Supporting networks of data and workflows. Information Services & Use, 36(1/2), 49-55. doi:10.3233/ISU-160805

Dierkes, J., & Wuttke, U. (2016). The Göttingen eResearch Alliance: A case study of developing and establishing institutional support for research data management. ISPRS International Journal of Geo-Information, 5(8), 133.


Biography:

Jacky Cho is a project officer in the office of PVC-Research Infrastructure at the University of New South Wales. Prior to this role, he was a researcher in surface and physical chemistry with a PhD in chemistry. In his current role, he is responsible for delivering services and support to researchers at UNSW  to enhance research activities which include promoting a holistic research data management.  https://orcid.org/0000-0001-7591-100X

From monoliths to clusters: transforming ALA’s infrastructure

Matt Andrews1, David Martin2, Mahmoud Sadeghi3, Peter Ansell4, Adam Collins5, Miles Nicholls6

1Atlas of Living Australia, Canberra, Australia, matt.andrews@csiro.au

2Atlas of Living Australia, UK, david.martin@csiro.au

3Atlas of Living Australia, Canberra, Australia, mahmoud.sadeghi@csiro.au

4Atlas of Living Australia, Canberra, Australia, peter.ansell@csiro.au

5Atlas of Living Australia, Canberra, Australia, adam.collins@csiro.au

6Atlas of Living Australia, Canberra, Australia, miles.nicholls@csiro.au

 

The Atlas of Living Australia, as a project being committed to the principles of open source software and open data, has made all of its operating code available to the public.  ALA software is used by many projects around the world to build national databases of biodiversity[1]. Running production infrastructure to meet the needs of researchers and the general public is a significant challenge, with an ever-growing dataset both in terms of the number of records and the width of the data, particularly with the prospect of adding trait information.

BEGINNINGS

In early 2017, the Atlas of Living Australia faced a series of problems with its core infrastructure.  While the Atlas had been well designed from the start to operate as an ecosystem of independent services, several of the key services were looking increasingly unsustainable.  Almost all were running on single large servers: a fragile position which meant that any significant maintenance or performance issue would take down the whole service. One of the primary components of the Atlas, which we call “biocache”, is responsible for the search, display, and download of occurrence records. It was running on very old versions of Cassandra and Solr.  A full update of data, to process and index the full dataset and make it available for search, was taking several days to complete, and if it hit an error, we had to run the whole thing again. This reduced the number of times that new data quality rules, spatial information, and new taxonomic names were applied to the Atlas records. We were in an uncomfortable position, and hitting the limits of what we had.

A PLAN

To solve this combination of problems – old unsupported software versions, single monolithic servers, very slow data processing – we decided to make a fundamental change in the internal architecture of each of the services within the biocache component.

With the two primary data stores, Cassandra and Solr, this meant not only migrating from old unsupported versions to more recent versions, but also moving each to a clustered pool of servers. This would add significant resilience to the system by adding data replication, so the pool can tolerate the loss of a server, and allow for much less disruptive system maintenance which makes the system as a whole much more robust.  The splitting of data into smaller segments meant that processing and indexing can be performed using more parallelism due to the increased IO bandwidth, making these tasks dramatically faster.

Apart from the large data stores, the biocache component contains user-facing web applications.  These had been running on single servers, which was inflexible, more difficult to maintain and entailed a higher risk of downtime.  We decided to move to a model where each of these web applications would be run in a pool of servers behind load balancers. With this approach, a server can be removed from a pool for maintenance with no loss of service, and the system is resilient to a significant range of disaster scenarios.

MOVING MOUNTAINS

The development effort in building these new systems largely focused on the new clustered versions of Cassandra and Solr.   For Cassandra, the move from a single large server running the long-unsupported version 1 of Cassandra to an up-to-date setup running in a replicated cluster, across several nodes, represented a fundamental shift.  We started in Britain, with the Atlas closely involved in building the infrastructure for the new National Biodiversity Network (NBN)[2], which uses ALA software to host data for the UK.

After months of development effort, a new version of the core Cassandra and Solr systems was in place.  With a cluster of four Cassandra nodes, and a Solr Cloud environment with eight nodes, the UK infrastructure was performing exceptionally well, querying a large dataset successfully, and with far better fault tolerance, system resilience and much faster data ingestion and processing.

Once the new approach had been proven in the UK development, an effort began in the Australian team to build an equivalent set of systems for the ALA, and implementing the new approach of running web applications as pools behind load balancers.  Around twelve months of effort from several developers went into planning, developing and testing the project, using the UK work as a reference point.

FLEXIBILITY AND RESILIENCE: BETTER SUPPORT FOR RESEARCH

With the new clustered infrastructure in place, this core component of the Atlas is now considerably more resilient to unexpected outages or spikes in demand, while delivering dramatically better results for us in the data ingestion, processing and indexing area.

In the old systems, indexing our full dataset of around 75 million records previously took 24 hours or more. In the new clustered infrastructure, this operation takes around three and a half hours.  Similarly, full processing and sampling of the ingested data used to take up to six days; in the new system it takes about 11 hours. These spectacular performance improvements give us the space to speed up our data ingestion cycle.

By running all our public-facing services in the biocache component as pools of servers behind load balancers, we are considerably more robust in being able to keep the services running even if a single server goes down, or indeed even if a whole data centre goes down.

The other significant advantage of running a clustered infrastructure is scalability: adding capacity to handle more records, or a wider range of fields for each record, can be achieved without pain and with little or no disruption of service. This will be valuable as we prepare to add many new fields to handle trait information. There are many more areas of improvement that lie ahead for the Atlas’ infrastructure, but overall this fundamental shift in how we run our biocache component has been a success.

  1. National Biodiversity Network website, About the NBN Atlas page: https://nbnatlas.org/about-nbn-atlas/ Accessed 7 June 2018.

Biography:

Matt Andrews has been involved in herding software for several years now.  His past lives include being chauffered between cities handcuffed to a briefcase, making TV ads for contraceptives in Papua New Guinea, training programmers in French bars, being escorted around China by a posse of secret police, and going to school in an Italian village.  After running tech operations for a network of European dating sites, and then for various corners of the Federal Government, he’s now managing DevOps for the Atlas of Living Australia.

Applying Lessons Learned from a Global Climate Petascale Data Repository to other Scientific Research Domains

Clare Richards1, Kate Snow2, Chris Allen3, Matt Nethery4, Kelsey Druken5, Ben Evans6

1Australian National University, Canberra, Australia, Clare.Richards@anu.edu.au

2Australian National University, Canberra, Australia, Kate.Snow@anu.edu.au

3Australian National University, Canberra, Australia, Chris.Allen@anu.edu.au

4Australian National University, Canberra, Australia, Matthew.Nethery@anu.edu.au

5Australian National University, Canberra, Australia, Kelsey.Druken@anu.edu.au

6Australian National University, Canberra, Australia, Ben.Evans@anu.edu.au

 

INTRODUCTION

NCI has established a combined Big Data/Compute repository for National Reference Data Collections supported through the former Research Data Services (RDS) project in which NCI led the management of the Earth System, Environmental and Geoscience Data Services, comprised primarily of Climate and Weather, Satellite Earth Observations and Geophysics data. NCI has over 15 years of experience working with these multiple domains and building the capacity, infrastructure and skills to manage this, and making the data suitable for use across these domains, as well as for uses beyond those of the domain that generated the data. Over recent years it has become apparent that as data volumes grow, then discovery and reproducibility of research and workflows have to be efficient, and internationally agreed standards, data services, data accessibility and data management practices all become critically important.

One major driver for developing this type of combined Big Data/Compute infrastructure has been to address the needs of the Australian Climate community. This national priority research area is one of the most computationally demanding and data-intensive in the environmental sciences. Such data needs to reside within a focused national centre to handle the scale and dimensions of the requirements, including computation to generate the data, the computational capacity to analyse this data and the expertise to manage very large and complex data collections. A large proportion of climate data comes from the World Climate Research Programme’s Coupled Model Intercomparison Project (CMIP) and is managed and shared by an international and collaborative infrastructure called the Earth Systems Grid Federation (ESGF).

In the not too distant past the CMIP data was shipped around the globe on several hard drives. The data was difficult to keep up to date and share with all the researchers who needed access. Indeed, the data generated for CMIP has always outstripped the capacity to share. However, the volumes of data quickly grew beyond a capacity to distribute in a timely fashion.  For example, for CMIP3 in 2001 the data was the order of 10TB but by 2013 CMIP5 required 1 PB, and CMIP6 is predicted to be at least 20PB. The sheer size and complexity of the CMIP data collection makes it impossible for repositories to manage in isolation, as it is both difficult and costly to manage multiple copies of data for individual users. For climate researchers, being able to access and search such large volumes of globally distributed data for individual files that match specific criteria can be like finding several needles in many haystacks!

INTERNATIONAL COLLABORATION FOR MANAging CLIMATE DATA

To solve this problem the ESGF, an international collaboration led by the US Department of Energy, was set up to improve the storage and sharing of the rapidly expanding petascale datasets used in Climate research globally. Since its establishment more than 10 years ago, the ESGF has continued to grow and now manages tens of petabytes of climate science and other data at dozens of sites distributed around the globe. NCI has been a Tier 1 node of the ESGF since 2009 and has invested significantly in the development of the infrastructure, the evolution of Big Data management practices, and expertise to support this international collaboration. Developing such a capability requires a long-term commitment to harmonising and maintaining a ‘system for the management, access and analysis of climate data’ that meets global community protocols whilst respecting local access policies. [1]

As an international coordinated activity, the ESGF requires intense collaboration between the data generators and repository/node partners to ensure adherence to standards and protocols that create a robust, dependable and sustainable infrastructure that improves data discovery and use across the whole global network while supporting use in the local environment. To make the system work, each node/repository is managed independently but each agrees to adhere to common ESGF protocols, software and interfaces including:

  • Data and metadata complies with agreed standards and conventions;
  • Version control protocols, common vocabularies, and ‘Data Reference Syntax’; and
  • Consistent publication requirements across the distributed network.

APPLYING the Benefits to other Domains

One of the benefits of all participants adhering to the ESGF protocols is that a Climate researcher in Australia can search and access data locally or internationally across all participating repositories, and be confident that the data can be reliably used. This is important for scientific collaboration and sharing, verification of research, and reproducibility of results – increasingly important for the publication of research papers.  The ESGF model also demonstrates that the depth of expertise required to ensure that climate data keeps up with international standards and trends cannot be replicated across all participating nodes/repositories. However, smaller repositories can benefit from the collaborations which not only help define the standards and protocols required, but also provide expertise to develop the tools to manage this important globally distributed peta-scale data collection.

With the Big Data problem increasingly affecting other research domains there is so much that can be learned from the ESGF model of international collaboration to deliver value to other major research communities, particularly those that are seeking to be part of a shared global data services network and move beyond managing individual stores of downloadable data files.

REFERENCES

  1. The Earth System Grid Federation, Design. https://esgf.llnl.gov/federation-design.html [Last accessed 22 June 2018].
  2. The Coupled Model Intercomparison Project (CMIP) https://www.wcrp-climate.org/wgcm-cmip [Last accessed 22 June 2018].
  3. The Earth System Grid Federation. https://esgf.llnl.gov/index.html [Last accessed 22 June 2018].

Biography:

Clare Richards is the Senior HPC Innovations Project Manager at the National Computational Infrastructure (NCI). She manages several projects and activities within the Research, Engagement and Initiatives area, including the Climate Science Data Enhanced Virtual Laboratory and other collaborative projects with NCI partners. Prior to joining ANU in 2015, she had a lengthy and diverse career at the Bureau of Meteorology and has also dabbled in marketing and media.

Trusted Data Repositories: From Pilot Projects to National Infrastructure

Keith Russell1Andrew Mehnert2,3 , Heather Leasor4, Mikaela Lawrence5

1Australian Research Data Commons

2National Imaging Facility, Australia, andrew.mehnert@uwa.edu.au

3Centre for Microscopy, Characterisation and Analysis, The University of Western Australia, Perth, Australia

4Australian National University

5CSIRO

 

DESCRIPTION

In FY 2016/17, ANDS funded the Trusted Data Repository program. This aimed to look at how to provide more trusted storage through three projects chosen to examine a number of dimensions:

  • NIF: multi-institutional (UQ, MU, UNSW, UWA), image/non-image instrument data, data generating facilities
  • ADA: single institution (ANU), social science data, data holding facility with a national role
  • CSIRO: single institution (not a university), range of data types, institutional data store

The primary focus of the program was on the trustedness of the repository containers, not on the data they contained. In other words Trusted (Data Repositories) not (Trusted Data) Repositories. However In the case of the NIF project they did consider both aspects: (1) Requirements necessary and sufficient for a basic NIF trusted data repository service; and (2) NIF Agreed Process (NAP) to obtain trusted data from NIF instruments.

The main challenges addressed across the program were how to:

In this BoF, the projects will present what they learned by undertaking this journey and reflect on how to generalize what they learned to the national context (noting that NIF is a national facility, ADA is a national repository, and CSIRO is a national agency).

Following this there will be an open discussion about next steps, including how to expand this initial set of projects to a national infrastructure of trusted data repositories serving a range of domains.

Format

The BoF will be a mix of presentation of content via slides (contributed by Love, McEachern and Mehnert), followed by an open discussion among all those presenting (facilitated by Treloar).

Timing

0-10: Overview of Trusted Data Repository (TDR) program run in 2017 and international relevance

10-40: 3 ten minute presentations from each of the pilot TDR projects

50-60: Role of Trusted Data Repositories in the NRDC

60-75: Open discussion

75-80: Next steps

4 years on: Evaluating the quality and effectiveness of Data Management Plans

Janice Chan1, Amy Cairns2, John Brown3

1Curtin University, Perth, Australia, Janice.chan@curtin.edu.au

2Curtin University, Perth, Australia, Amy.Cairns@curtin.edu.au

3Curtin University, Perth, Australia, John.Brown@curtin.edu.au

 

Introduction

The Data Management Planning (DMP) Tool[1] at Curtin University was developed in 2014. Since its launch, Curtin staff and students have created over 4,800 Data Management Plans (DMPs). While the high number of DMPs created is encouraging, it may or may not have a direct correlation to improved data management practice. This presentation outlines the actions being taken at Curtin University to evaluate the effectiveness of DMPs, and how the analysis of DMP data is being used to improve provision of services to support research data management.

DMPs at Curtin

The DMP Tool is embedded in the research life cycle at Curtin University. DMPs are mandatory for researchers requiring human and animal ethics approval. Higher Degree by Research students must submit a DMP on candidacy application. Researchers who require access to the Curtin research data storage facility (R drive) must also complete a DMP[2].

DMP analysis

In 2018, 4 years since the launch of the DMP Tool, the Library analysed DMP data and gained some useful insights. The data answered questions such as which faculty produced the most DMPs? How many DMPs have been updated since creation? How much storage was actually used as opposed to storage requested? When are the peak times for DMP creation?  This information has been useful for providing support services.

As of 23 May, 4,843 DMPs have been created, of which 44% were created by staff, and 56% by students. Figure 1 shows the breakdown by faculty and DMP owner type.

Figure 1: Data management plans by faculty and owner type

Figure 2 shows the number of DMPs created by month in the full years between 2015 and 2017. This maps out the peaks and troughs of DMPs creation which helped the Library to plan for RDM service and schedule new DMP support workshops based on point of need.

Figure 2: DMPs created by month Jan 2015 – Dec 2017

Questions in the DMP Tool are mostly optional, and DMPs created are not reviewed except for DMPs submitted for ethics or candidacy applications. While the DMP data indicated that the majority of optional questions were not left blank, this in iteslf is not an indicator of quality metadata in DMPs, nor does it demonstrate that DMPs have improved research data management practice at Curtin. This requires further investigation.

Evaluating the effectiveness of DMPs

Based on the research question “Do DMPs improve research data management practice?” the Library collaborated with the School of Media, Creative Arts, and Social Inquiry at Curtin University to address this question through the work of a Masters student. Library staff are now working with the research student and her supervisor to develop the scope of the research project, methodology, and expected outputs.

Survey questions have been developed with input from the Library and the Research Office. Invitations to complete the survey will be sent out to researchers who have completed at least one DMP, and will be sent from the Research Office in order to maximize response rate. Respondents can opt in to participate in focus groups to discuss the survey questions further with the researcher.

The research project will be completed by the end of 2018, with an expected output of a Masters by Coursework thesis and a report outlining the findings and recommendations.         

REFERENCES

  1. Research Data Management Planning. Available from: https://dmp.curtin.edu.au/, accessed 28 May 2018.
  2. Research data management: Data management plan. Available from: http://libguides.library.curtin.edu.au/c.php?g=202401&p=1333108, accessed 28 May 2018.

Biography:

Janice Chan is Coordinator, Research Services at Curtin University, Perth, Western Australia. Janice’s experience is in repository management and scholarly communications. She is interested in open research, metrics and impact assessment, research data management, library-led publishing, data analysis and visualisation, and innovative practice in library service delivery. https://orcid.org/0000-0001-7300-3489

Amy Cairns is a Master of Information Management student in the Libraries, Archives, Records and Information Science Program at Curtin University in Perth, Australia. https://orcid.org/0000-0002-7656-5361

John Brown is a librarian in the Research Services team at Curtin University. John is currently involved in providing Research Data Management services and training to the researchers of Curtin. https://orcid.org/0000-0002-6118-577X

The Bioplatforms Australia Data Portal

Adam Hunter1, Grahame Bowland2, Samuel Chang3, Tamas Szabo4, Kathryn Napier5, Mabel Lum6, Anna MacDonald7, Jason Koval8, Anna Fitzgerald9, Matthew Bellgard10

1Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, aahunter@ccg.murdoch.edu.au

2Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, gbowland@ccg.murdoch.edu.au

3Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, schang@ccg.murdoch.edu.au

4Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, tszabo@ccg.murdoch.edu.au

5Curtin Institute for Computation, Curtin University, Bentley, Australia, kathryn.napier@curtin.edu.au

6Bioplatforms Australia, Sydney, Australia, mlum@bioplatforms.com

7Centre for Biodiversity Analysis, Australian National University, Canberra, Australia, anna.macdonald@anu.edu.au

8Ramaciotti Centre for Genomics, University of New South Wales, Sydney, Australia, j.koval@unsw.edu.au

9Bioplatforms Australia, Sydney, Australia, afitzgerald@bioplatforms.com

10Office of eResearch, Queensland University of Technology, Brisbane, Australia, matthew.bellgard@qut.edu.au

 

Background

Innovative life science research requires access to state of the art infrastructure, ideally developed through a strategic investment plan that promotes technology development and builds expertise for the benefit of all Australian researchers. Bioplatforms Australia enables innovation and collaboration in life science research by investing in world class infrastructure and associated expertise in molecular platforms and informatics, such as genomics, proteomics, metabolomics, and bioinformatics. Through collaborative research projects, Bioplatforms Australia creates open-data initiatives that build critical ‘omics datasets that support scientific challenges of national significance [1].

Investment funding for Bioplatforms Australia has been provided through the Commonwealth Government National Collaborative Research Infrastructure Strategy (NCRIS) with co-investments made by State Governments, research institutions and commercial entities. Infrastructure investments are hosted across Australia by a network of leading universities and research institutions, which ensures broad access through contracted services and research collaborations.

To date, Bioplatforms Australia has invested in nine collaborative open-data projects to generate biological datasets of national significance such as the Australian Microbiome Database [2, 3], the Oz Mammals Genomics Initiative [4], and Antibiotic Resistant Sepsis Pathogens [5]. The specific collective aims of these open-data projects are to: i) integrate scientific infrastructure with researchers; ii) build new data resources capturing essential meta data and integrating generated –omics data with other scientific data; iii) encourage, promote and facilitate multi-institutional, cross-discipline collaboration; iv) leverage co-investment from scientific, government, philanthropic and commercial partners; and v) enable participation in, and proactive engagement with, international research consortiums.

While the collaborative open-data projects are aligned with national research priorities that seek to improve Australia’s health and well-being, the datasets are contributing to increasing knowledge on issues of global significance. For example, the Antibiotic Resistant Sepsis Pathogens project brings together multidisciplinary teams to identify common pathogenic pathways in order to ultimately develop new approaches to disease management. The appropriate management of such research data is therefore of critical importance to ensure the data remains a valuable asset.

In order to appropriately manage researcher and public access to raw and analysed data and associated contextual metadata from numerous collaborative open-data projects and to bring research communities together, a sustainable and scalable digital platform solution was needed. The Bioplatforms Australia Data Portal (‘Data Portal’) was thereby created through a collaboration between Bioplatforms Australia and the Centre for Comparative Genomics at Murdoch University. The Data Portal provides online access to datasets and associated metadata, empowers research communities to curate and manage the data, and is built upon open source, best-of-breed technology using international standards.

development of the bioplatforms australia data portal

The Data Portal is a data archive repository that houses raw sequence data, analysed data, and associated contextual metadata for each of the nine collaborative open-data projects. In the development of this Data Portal, we identified several key criteria to be addressed to ensure the deployment of a sustainable and scalable digital platform that can be applicable for a broad community of users: i) open-source software adopting leading technology; and ii) purposeful application of data and adoption of the FAIR data principles [6].

Open-source software adopting CKAN

The Data Portal was originally developed by bespoke software development and deployed on traditional, on-premises data storage systems. However, software development of a bespoke system is generally time consuming, expensive, and is not sustainable long-term. In order to leverage off other national  investments and ensure long-term sustainability, the Data Portal adopted the Comprehensive Knowledge Archive Network (CKAN) as the core technology to replace bespoke software code [7]. CKAN is the world’s leading open-source data portal platform, and is used by numerous federal governments and public institutions to share data with the general public, including the Australian federal government [8] and Western Australian government [9] data portals, and the United Kingdom’s Natural History Museum data portal [10]. As the on-premises data storage reached its end of life, the Data Portal was migrated to Amazon Web Services with the support of Bioplatforms Australia. The code for the Data Portal and associated tools, such as extensions to the CKAN project, is open-source [11].

Purposeful application of data and adoption of FAIR data principles

The Data Portal provides researchers access to data and associated contextual metadata. Metadata is necessarily in a state of flux, from collection by field scientists through to PCR and sequencing, and needs to be inherently updated in a reproducible manner from authorised data services. Standards have been developed and adopted for both data and processes in order to automatically ingest large amounts of sequencing data and associated metadata. The Data Portal has also established robust functionality in regard to the FAIR data principles of Findable, Interoperable, Accessible, Reusable [6]. For example, all data in the Data Portal can be accessed via its identifier, using a standardised, open and documented API, subject to authentication. Bulk data access, allowing researchers to download data en-masse subject to user-defined search terms, is also available. Established international and national ontologies are also used in databases.

conclusions

To date, the Data Portal has directly managed the ingestion of tens of thousands of samples constituting over 60 terabytes of data. Bioplatforms Australia enables a broad scope of research endeavours through investment in nationally collaborative programs that fund the building of new datasets and ultimately offering them as a public resource. By employing large scale consortia to build and analyse datasets, existing academic and end-user knowledge is combined with leading ‘omics capabilities to create distinctive sample collections of national and international importance. The Bioplatforms Australia Data Portal, built upon open-source, best-of-breed technology using the same underlying technology deployed by numerous governments and organizations worldwide [12], enables effective management and access to these valuable data resources to ensure their perpetual value.

REFERENCES

  1. Bioplatforms Australia. Available from: http://www.bioplatforms.com/what-we-do/, accessed 4 June 2018.
  2. Australian Microbiome. Available from: https://data.bioplatforms.com/organization/about/australian-microbiome, accessed 4 June 2018.
  3. Bissett, A., et al. Introducing BASE: the Biomes of Australian Soil Environments soil microbial diversity database. GigaScience, 2016. 5(1): p. 21.
  4. Oz Mammals Genomics Initiative. Available from: https://data.bioplatforms.com/organization/about/bpa-omg, accessed 3 June 2018.
  5. Antibiotic Resistant Sepsis Pathogens. Available from: https://data.bioplatforms.com/organization/about/bpa-sepsis, accessed 3 June 2018.
  6. Wilkinson, M.D., et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 2016. 3: p. 160018.
  7. CKAN. Available from: https://ckan.org/, accessed 4 June 2018.
  8. Australian Government Data Portal. Available from: https://data.gov.au/, accessed June 3 2018.
  9. Western Australian Government Data Portal. Available from: https://data.wa.gov.au/, accessed June 3 2018.
  10. Natural History Museum Data Portal. Available from: http://data.nhm.ac.uk/, accessed June 3 2018.
  11. Bioplatforms Australia. Available from: https://github.com/BioplatformsAustralia, accessed on June 3 2018.
  12. CKAN. Available from: https://ckan.org/about/instances/, accessed on June 4 2018.

Biography:

Grahame Bowland is a Software Developer at the  Centre for Comparative Genomics at Murdoch University. Grahame is a senior member of the software development team which develops, deploys, and maintains eResearch software solutions such as the Bioplatforms Australia Data Portal, electronic biobank solutions, and disease registries.

Dr Kathryn Napier recently joined the Data Science team at the Curtin Institute for Computation at Curtin University. Kathryn previously worked at the Centre for Comparative Genomics at Murdoch University as a Research Associate in the areas of Bioinformatics and Health Informatics. Kathryn worked with the CCG’s software development team who develop and deploy eResearch software solutions such as disease and patient registries and the Bioplatforms Australia Data Portal.

Archives in Flight: Delivering Australian Archives to Researchers

Associate Professor Gavan McCarthy1, Mr PeterTonoli1

1eScholarship Research Centre, University of Melbourne, Parkville, Australia, gavanjm@unimelb.edu.au

2eScholarship Research Centre, University of Melbourne, Parkville, Australia, peterct@unimelb.edu.au

 

The eScholarship Research Centre (ESRC) at University of Melbourne has linked its archival services with AARNet’s storage service “CloudStor” by using the Filesender API. As the digital transformation of research expands so too do the research support and collection services evolve through integration with national research infrastructure. Rather than attempting to provide access to privacy and rights-compromised materials online, the ESRC are working to deliver these types of unpublishable materials directly to the researcher under explicitly articulated conditions. The service is designed to allow archival collection materials to be delivered via request by an online form to researchers anywhere in the world.  A policy framework (where the user supplies information and agrees to the conditions by click-through) combined with the file sending and notification functions in CloudStor support utility and efficacy interests of the ESRC and the research community.

A researcher searching the web (the ESRC finding aids and item descriptions are indexed by search engines), will locate the archival item they want (that has defined conditions of use), click the link which takes them to the online form to complete, and click deliver.  The process called the Digital Archives Delivery Service (DADS) is envisaged as an extensible web service and operates as follows.

  1. The researcher’s request conveys to DADS the key metadata describing the materials, the associated obligations and the location of the material for transfer.
  2. In response to the request the DADS then batches and loads the files into CloudStor.
  3. A notification is sent through the Filesender function in CloudStor to the researcher that reiterates the obligations associated with having a copy of the materials.
  4. The researcher, if happy to comply with those obligations, can then download the files.

The request for the supply of archives process happens in a matter of seconds through lightweight technical integration and because CloudStor operates on the national high speed network provided by AARNet that meets the needs of research and education. The transaction is documented both as records (emails) and data and is provided to the archive enabling both accountability and continuous reporting.

The DADS source code, and documentation is located on GitHub (https://github.com/esrc-unimelb/DADS) for those interested in what happens technically through the use of the Filesender API, e.g. get/send commands.  The ESRC have completed the first two cycles of proof-of-concept with DADS and the capacity to report on the implementation phase is anticipated by 2018.

This presentation will cover the journey of implementing DADS, from ideation, working with the different perspectives in a multi-disciplinary environment, to creating a minimum viable product (MVP). The benefits of working in an agile manner, and developing a MVP, as opposed to a ‘perfect product’ in the academic environment will also be discussed.


Biography:

Associate Professor Gavan McCarthy is Director of the University of Melbourne eScholarship Research Centre in the University Library founded in 2007. His research is in the discipline of social and cultural informatics with expertise in archival science and a long-standing interest in the history of Australian science. He contributes to research in information infrastructure development within the University and his projects highlight strong engagement with community. His distinctive cross-disciplinary research reaches into other fields such as education, social work, linguistics, anthropology, population health and history. He re-examines theoretical foundations and tests new theories through practical interventions with a focus on public knowledge domains, contextual information frameworks and knowledge archives.
https://orcid.org/0000-0001-8411-3173

Peter Tonoli is an Information Technology Consultant at the eScholarship Research Centre University of Melbourne, with a Masters Degree in Information Technology Management majoring in Information Security, Graduate level qualifications in Project Management and Management, and industry certifications in IT Project Management, Information Security and IT Service Management. He is a freelance information security, digital rights and online privacy consultant, servicing individuals and organisations across the government, media, community, non-profit and medical research sectors. His involvement in information technology security and management dates back to the establishment of the Internet in Australia. He has a long history of voluntary community involvement including holding committee positions on various IT industry peak bodies. Peter is currently a Board Member of Electronic Frontiers Australia, and a Non-executive Director of Internet Australia
https://orcid.org/0000-0003-1164-5632

How to Choose the ‘Right’ Repository for Research Data

Shawn A Ross1, Steven McEachern2, Peter Sefton3, Brian Ballsun-Stanton4

1Macquarie University, Sydney, Australia, shawn.ross@mq.edu.au

2Australian National University, Canberra, Australia,steven.mceachern@anu.edu.au

3University of Technology Sydney, Australia, peter.sefton@uts.edu.au

4Macquarie University, Sydney, Australia, brian.ballsun-stanton@mq.edu.au

DESCRIPTION

In Australia, multi-institutional, domain-specific information infrastructure projects, such as those funded through the Australian National eResearch Collaboration Tools and Resources (NeCTAR) program, are typically open-source software (OSS). National infrastructure such as AARNet’s Cloudstor, built on OwnCloud, is also OSS. Even publications repositories and data management planning tools are often OSS (DSpace, DMPOline, DMPTool, RedBox, etc.). The trend in institutional research data repository software amongst institutions who prefer not to build an in-house solution, however, appears to favour proprietary software (e.g., Figshare). In comparison to Europe and North America, OSS is much less popular in Australia (e.g., Dataverse, CKAN). Dataverse, for example, has 33 installations on five continents containing 76 institutional ‘Dataverses’ (some installations house more than one) – but Australia has only one installation or institutional Dataverse, (the Australian Data Archive) [1]. By contrast, Figshare has been or is being implemented by at least five Australian universities [2], with others actively considering it.

This BoF session examines the reasons why institutions choose proprietary versus OSS for research data infrastructure. We compare the practical advantages, disadvantages, and considerations around each approach. We propose for discussion the idea that the advantages of proprietary software are overstated, as is the burden of implementing and administering OSS. For example, costs like requirements analysis, systems integration and engagement, outreach, and training – which together likely account for the majority of a software project’s budget – are similar whether proprietary or OSS. Deployment and maintenance of modern OSS platforms, facilitated by approaches like containerisation and automation, is lower than in the past. SaaS options for OSS are also sometimes overlooked. Proprietary software, moreover, is not always an ‘out-of-the-box’ turn-key solution for software at universities, especially regarding specialised software for research (as opposed to commodity). As such, it may require the creation of separate but interoperable systems to fill gaps in capacity, dramatically raising costs. Conversely, the flexibility and capabilities of an OSS solution are neglected: if a feature is missing or inadequate, it can be built (often with support from the community) and made available for reuse, without having to work around the edges of a proprietary system. The Australian Data Archive, for example, has added significant new features to Dataverse to support mediated access to sensitive data, which are available to other users. However, a deeper exploration of the tradeoffs and demands of both approaches in the context of specialised academic software is warranted. The focus of the discussion will be practical, but it may extend to the potential impact of various software business models on research data, a core output and asset of universities.

The session will be 60 minutes in duration. It will include brief presentations by the organisers based on their experience, followed by open discussion. Audience participation is essential – we encourage a candid exchange of experience with either proprietary or OSS for research data management at various institution, so that we can learn from each others’ successes and challenges. The outcome will be information to guide decision making around repository platform procurement at universities.

REFERENCES

  1. The Dataverse Project. Available from: https://dataverse.org/ accessed 22 June 2018. See also https://dataverse.org/metrics accessed 22 June 2018.
  2. The University of Adelaide. Available from: https://www.adelaide.edu.au/figshare/. The University of Melbourne. Available from: https://melbourne.figshare.com/. Monash University. Available from: https://monash.figshare.com/. La Trobe University. Available from: https://latrobe.figshare.com/. Federation University Australia (planned). Available from: https://federation.edu.au/staff/governance/projects/current-projects.

Biographies:

Shawn Ross (Ph.D. University of Washington, 2001) is Associate Professor of History and Archaeology and Director of Data Science and eResearch at Macquarie University. A/Prof Rossʼs research interests include the history and archaeology of pre-Classical Greece and the Balkans, and the application of information technology to research. He supervises a large-scale landscape archaeology and palaeo-environmental study in central and southeast Bulgaria. Since 2012, he has also directed a large information infrastructure project developing data capture and management systems for field research. Previously, A/Prof Ross worked at the University of New South Wales (Syndey, Austrlalia) and William Paterson University (Wayne, New Jersey).

Steve McEachern is Director and Manager of the Australian Data Archive at the Australian National University, where he is responsible for the daily operations and technical and strategic development of the data archive. He has high-level expertise in survey methodology and data archiving, and has been actively involved in development and application of survey research methodology and technologies over 15 years in the Australian university sector. Steve holds a PhD in industrial relations from Deakin University, as well as a Graduate Diploma in Management Information Systems from Deakin University, and a Bachelor of Commerce with Honours from Monash University. He has research interests in data management and archiving, community and social attitude surveys, organisational surveys, new data collection methods including web and mobile phone survey techniques, and reproducible research methods. Steve has been involved in various professional associations in survey research and data archiving over the last 10 years.

Peter Sefton is the Manager, eResearch Support at the University of Technology, Sydney (UTS). Before that he was in a similar role at the university of Western Sydney (UWS). Previously he ran the Software Research and development Laboratory at the Australian Digital Futures Institute at the University of Southern Queensland. Following a PhD in computational linguistics in the mid-nineties he has gained extensive experience in the higher education sector in leading the development of IT and business systems to support both learning and research. At UTS Peter is leading a team which is working with key stakeholders to implement university-wide eResearch infrastructure, including an institutional data repository, as well as collaborating widely with research communities at the institution on specific research challenges. His research interests include repositories, digital libraries, and the use of The Web in scholarly communication.

Brian Ballsun-Stanton is Solutions Architect (Digital Humanities) for the Macquarie University Faculty of Arts with a PhD from UNSW in Philosophy. He is working with researchers from across Australia to deploy digital technologies and workflows for their research projects. He has developed a new methodology (The Social Data Flow Network) to explore how individuals in the field understand the nature of data. Brian’s current research interests are in exploring the Philosophy of Science’s interactions with the Open Data movement, and building tools for rapid analysis and bulk manipulation of large ancient history corpora.

Soup to Nuts: A Panel Discussion on Implementing a Research Life Cycle in the Design of Research Services

Gavin Kennedy1, Vicki Picasso2, Dr Cameron McLean3, Louise Wheeler4, Siobhann McCafferty5

1Queensland Cyber Infrastructure Foundation, Brisbane, Australia gavin.kennedy@qcif.edu.au

2University of Newcastle, Newcastle, Australia, vicki.picasso@newcastle.edu.au

3University of Auckland, Auckland, New Zealand, ca.mclean@auckland.ac.nz

4University of Technology Sydney, Sydney, Australia, louise.wheeler@uts.edu.au  

5Australian Access Federation and Data Life Cycle Framework, Brisbane, Australia, siobhann.mccafferty@aaf.edu.au

 

This is a joint presentation and panel discussion that envisions an implementation of a “Research Life Cycle Framework” based on the capture of minimal research activity descriptions, the integration of multiple information systems and repositories and the provision and orchestration of research services. With a ‘researcher-centric’ design, such a platform would provide users with a complete set of services from research planning (the soup) , through research execution, to reporting,  publishing and archival (the nuts). It would provide a consistent access point that supports the basic actions of research and inquiry, while simultaneously providing better management and reporting tools – resulting in improved quality, integrity, impact, and effectiveness/efficiency of research. That is, it works for all the stakeholders in the research ecosystem – a natural benefit of adopting a systemic/holistic life cycle view.

This is more than a thought experiment or an aspirational vision as this panel has come together through the realisation that the working components of such a platform already exist: University of Auckland’s Research Hub(1) is an established actionable service directory; University of Technology Sydney’s Provisioner(2) is a pluggable service catalog framework; ReDBox and Mint(3) provide the integration tools, workflows and user workspaces to support the researcher; and DLCF RAiDs and Group IDs(4) provide an identity and resource sharing framework to tie multiple disparate services to different groups of people. Universities such as University of Auckland, University of Newcastle and University of Technology Sydney are already considering how to develop this model and leverage these existing services. A further opportunity is for a national level platform through which national services can be provisioned, larger institutions can integrate at the level they require, and smaller institutions could benefit from accessing the complete spectrum of Research Life Cycle services “in the cloud”.

As a panel session this will provide the eResearch community the opportunity to reflect on this framework and assist the proposers to validate some of their assumptions. For example, a key question to be addressed through this forum is how do we ensure we are truly enabling research activities with the design of our services rather than just creating research management systems, or systems for managing disconnected fragments of the research process?

Intended audience

This panel will be of interest to national eResearch strategists, eResearch managers, university research office managers, data librarians and technical staff. It will describe how the architecture of the proposed platform supports the research life cycle and how it could integrate with institutional systems. It does not assume any technical knowledge, so we encourage developers, administrators and strategists to attend.

REFERENCES

  1. Introducing Provisioner: https://eresearch.uts.edu.au/2018/04/05/provisioner_1.htm Accessed 22 June, 2018
  2. http://www.redboxresearchdata.com.au Accessed 22 June, 2018
  3. https://www.dlc.edu.au/ Accessed 22 June, 2018

Biographies:

Gavin Kennedy (https://orcid.org/0000-0003-3910-0474) is the manager of Data Innovation Services at Queensland Cyber Infrastructure Foundation and leads the team responsible for ReDBox development and support.

Vicki Picasso (https://orcid.org/0000-0001-9422-5021) is Senior Librarian, Research Support Services at the University of Newcastle.

Dr Cameron McLean (https://orcid.org/0000-0002-9836-3824) is a Research and Education Enabler and Research IT Specialist at the Centre for eResearch, University of Auckland.

Louise Wheeler manages the research integrity environment for academics at UTS. This includes the development and implementation of policy, education and training, and the procedural framework for upholding responsible research practices. Louise’s role also extends to the investigation of integrity breaches, as required under the Australian Code for the Responsible Conduct of Research.

Siobhann McCafferty (https://orcid.org/0000-0002-2491-0995) is Project Manager for the Data Life Cycle Framework project, supported by ARDC, AARNet and the Australian Access Federation.

A framework for integrated research data management, with services for planning, provisioning research storage and applications and describing and packaging research data

Michael Lynch1 Peter Sefton2 Sharyn Wise3

1University of Technology Sydney, Michael.Lynch@uts.edu.au

2University of Technology Sydney, Peter.Sefton@uts.edu.au

3University of Technology Sydney, Sharyn.Wise@uts.edu.au

 

Research data management is critical for the integrity of scholarship, the ability of researchers and institutions to re-use and share data, and for IT support staff and data librarians to be able to plan, maintain and curate data collections. It’s also time-consuming and daunting for researchers and runs the risks of becoming yet another bureaucratic hurdle to research work.

Provisioner is an open framework for integrating research data management into research tools and workflows, allowing researchers to select applications from a service catalogue and create workspaces which are linked to research data management plans, and supporting data archiving and publication.

This presentation will cover:

  • The ideas underlying the Provisioner framework, which is a loosely-coupled distributed system designed for high-resilience against inevitable organisational and technical change
  • ReDBox 2.0, the platform in which this work is implemented
  • Using DataCrates for integrated metadata and a file-based repository
  • A case study of a research data workflow from microscopes to data modelling and simulation to an immersive visualisation

The provisioner

The Provisioner is based on two key ideas. The idea of a “workspace” is used to design a limited set of basic operations – create, share, import and export – which can be executed via APIs on a wide range of research data applications and storage services. The idea of redundant, machine- and human-readable metadata stored with the data is used to build a system which isn’t a monolith where identifying the creator, owner and funding agency of a dataset would depend on a centralised database.

The Provisioner framework allows us to connect diverse research applications using DataCrates as a common interchange format and manage automated pipelines of data management tasks such as exporting data, crosswalking metadata and requesting DOIs from minting services.

ReDBox 2.0

The principle way in which researchers interact with the Provisioner is through the service catalogue in the University’s research data management tool and data catalogue, Stash, implemented in ReDBox.

As part of the Provisioner project, ReDBox has been redeveloped as a modern web application, and now includes a service catalogue from which researchers can select workspaces in a range of research applications, from OMERO (an open microscopy environment), GitLab (for maintaining and publishing software as a research output) to research fileshares.

DataCrates

Provisioner uses the DataCrate standard to store metadata in human- and machine-readable formats with the accompanying datasets at different stages of the research life-cycle, from creation and analysis through to archiving and publication. DataCrates are directories on a filesystem with a conventional layout based on the BagIt standard and linking to contextual metadata with JSON-LD, and are suitable for both archiving and publication.

Case Study

We present a case study of a research data workflow which starts with video microscopy of bacteria, through to a code repository and mathematical modelling of bacterial movement, to 3D visualization of simulated bacterial movement in the UTS Data Arena.

References

  1. Lynch, M. “Introducing Provisioner”, https://eresearch.uts.edu.au/2018/04/05/provisioner_1.htm, 2018, accessed 21 June 2018.
  2. Sefton, P.  “DataCrate: Formalising ways of packaging research data for re-use and dissemination”, Presentation, eResearch Australasia 2017,  https://conference.eresearch.edu.au/2017/08/datacrate-formalising-ways-of-packaging-research-data-for-re-use-and-dissemination/, accessed 22 June 2018.

Biography:

Mike Lynch is an eResearch Analyst in the eResearch Support Group at UTS. His work involves solution design, information architecture and software development supporting research data management. His other interests include data visualisation and functional programming languages.

Peter Sefton is the Manager, eResearch Support at the University of Technology, Sydney (UTS).

At UTS Peter is leading a team which is working with key stakeholders to implement university-wide eResearch infrastructure, including an institutional data repository, as well as collaborating widely with research communities at the institution on specific research challenges. His research interests include repositories, digital libraries, and the use of The Web in scholarly communication.

Recent Comments

    Categories

    About the conference

    eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

    Conference Managers

    Please contact the team at Conference Design with any questions regarding the conference.

    © 2018 - 2020 Conference Design Pty Ltd