Applying Lessons Learned from a Global Climate Petascale Data Repository to other Scientific Research Domains

Clare Richards1, Kate Snow2, Chris Allen3, Matt Nethery4, Kelsey Druken5, Ben Evans6

1Australian National University, Canberra, Australia, Clare.Richards@anu.edu.au

2Australian National University, Canberra, Australia, Kate.Snow@anu.edu.au

3Australian National University, Canberra, Australia, Chris.Allen@anu.edu.au

4Australian National University, Canberra, Australia, Matthew.Nethery@anu.edu.au

5Australian National University, Canberra, Australia, Kelsey.Druken@anu.edu.au

6Australian National University, Canberra, Australia, Ben.Evans@anu.edu.au

 

INTRODUCTION

NCI has established a combined Big Data/Compute repository for National Reference Data Collections supported through the former Research Data Services (RDS) project in which NCI led the management of the Earth System, Environmental and Geoscience Data Services, comprised primarily of Climate and Weather, Satellite Earth Observations and Geophysics data. NCI has over 15 years of experience working with these multiple domains and building the capacity, infrastructure and skills to manage this, and making the data suitable for use across these domains, as well as for uses beyond those of the domain that generated the data. Over recent years it has become apparent that as data volumes grow, then discovery and reproducibility of research and workflows have to be efficient, and internationally agreed standards, data services, data accessibility and data management practices all become critically important.

One major driver for developing this type of combined Big Data/Compute infrastructure has been to address the needs of the Australian Climate community. This national priority research area is one of the most computationally demanding and data-intensive in the environmental sciences. Such data needs to reside within a focused national centre to handle the scale and dimensions of the requirements, including computation to generate the data, the computational capacity to analyse this data and the expertise to manage very large and complex data collections. A large proportion of climate data comes from the World Climate Research Programme’s Coupled Model Intercomparison Project (CMIP) and is managed and shared by an international and collaborative infrastructure called the Earth Systems Grid Federation (ESGF).

In the not too distant past the CMIP data was shipped around the globe on several hard drives. The data was difficult to keep up to date and share with all the researchers who needed access. Indeed, the data generated for CMIP has always outstripped the capacity to share. However, the volumes of data quickly grew beyond a capacity to distribute in a timely fashion.  For example, for CMIP3 in 2001 the data was the order of 10TB but by 2013 CMIP5 required 1 PB, and CMIP6 is predicted to be at least 20PB. The sheer size and complexity of the CMIP data collection makes it impossible for repositories to manage in isolation, as it is both difficult and costly to manage multiple copies of data for individual users. For climate researchers, being able to access and search such large volumes of globally distributed data for individual files that match specific criteria can be like finding several needles in many haystacks!

INTERNATIONAL COLLABORATION FOR MANAging CLIMATE DATA

To solve this problem the ESGF, an international collaboration led by the US Department of Energy, was set up to improve the storage and sharing of the rapidly expanding petascale datasets used in Climate research globally. Since its establishment more than 10 years ago, the ESGF has continued to grow and now manages tens of petabytes of climate science and other data at dozens of sites distributed around the globe. NCI has been a Tier 1 node of the ESGF since 2009 and has invested significantly in the development of the infrastructure, the evolution of Big Data management practices, and expertise to support this international collaboration. Developing such a capability requires a long-term commitment to harmonising and maintaining a ‘system for the management, access and analysis of climate data’ that meets global community protocols whilst respecting local access policies. [1]

As an international coordinated activity, the ESGF requires intense collaboration between the data generators and repository/node partners to ensure adherence to standards and protocols that create a robust, dependable and sustainable infrastructure that improves data discovery and use across the whole global network while supporting use in the local environment. To make the system work, each node/repository is managed independently but each agrees to adhere to common ESGF protocols, software and interfaces including:

  • Data and metadata complies with agreed standards and conventions;
  • Version control protocols, common vocabularies, and ‘Data Reference Syntax’; and
  • Consistent publication requirements across the distributed network.

APPLYING the Benefits to other Domains

One of the benefits of all participants adhering to the ESGF protocols is that a Climate researcher in Australia can search and access data locally or internationally across all participating repositories, and be confident that the data can be reliably used. This is important for scientific collaboration and sharing, verification of research, and reproducibility of results – increasingly important for the publication of research papers.  The ESGF model also demonstrates that the depth of expertise required to ensure that climate data keeps up with international standards and trends cannot be replicated across all participating nodes/repositories. However, smaller repositories can benefit from the collaborations which not only help define the standards and protocols required, but also provide expertise to develop the tools to manage this important globally distributed peta-scale data collection.

With the Big Data problem increasingly affecting other research domains there is so much that can be learned from the ESGF model of international collaboration to deliver value to other major research communities, particularly those that are seeking to be part of a shared global data services network and move beyond managing individual stores of downloadable data files.

REFERENCES

  1. The Earth System Grid Federation, Design. https://esgf.llnl.gov/federation-design.html [Last accessed 22 June 2018].
  2. The Coupled Model Intercomparison Project (CMIP) https://www.wcrp-climate.org/wgcm-cmip [Last accessed 22 June 2018].
  3. The Earth System Grid Federation. https://esgf.llnl.gov/index.html [Last accessed 22 June 2018].

Biography:

Clare Richards is the Senior HPC Innovations Project Manager at the National Computational Infrastructure (NCI). She manages several projects and activities within the Research, Engagement and Initiatives area, including the Climate Science Data Enhanced Virtual Laboratory and other collaborative projects with NCI partners. Prior to joining ANU in 2015, she had a lengthy and diverse career at the Bureau of Meteorology and has also dabbled in marketing and media.

Trusted Data Repositories: From Pilot Projects to National Infrastructure

Keith Russell1Andrew Mehnert2,3 , Heather Leasor4, Mikaela Lawrence5

1Australian Research Data Commons

2National Imaging Facility, Australia, andrew.mehnert@uwa.edu.au

3Centre for Microscopy, Characterisation and Analysis, The University of Western Australia, Perth, Australia

4Australian National University

5CSIRO

 

DESCRIPTION

In FY 2016/17, ANDS funded the Trusted Data Repository program. This aimed to look at how to provide more trusted storage through three projects chosen to examine a number of dimensions:

  • NIF: multi-institutional (UQ, MU, UNSW, UWA), image/non-image instrument data, data generating facilities
  • ADA: single institution (ANU), social science data, data holding facility with a national role
  • CSIRO: single institution (not a university), range of data types, institutional data store

The primary focus of the program was on the trustedness of the repository containers, not on the data they contained. In other words Trusted (Data Repositories) not (Trusted Data) Repositories. However In the case of the NIF project they did consider both aspects: (1) Requirements necessary and sufficient for a basic NIF trusted data repository service; and (2) NIF Agreed Process (NAP) to obtain trusted data from NIF instruments.

The main challenges addressed across the program were how to:

In this BoF, the projects will present what they learned by undertaking this journey and reflect on how to generalize what they learned to the national context (noting that NIF is a national facility, ADA is a national repository, and CSIRO is a national agency).

Following this there will be an open discussion about next steps, including how to expand this initial set of projects to a national infrastructure of trusted data repositories serving a range of domains.

Format

The BoF will be a mix of presentation of content via slides (contributed by Love, McEachern and Mehnert), followed by an open discussion among all those presenting (facilitated by Treloar).

Timing

0-10: Overview of Trusted Data Repository (TDR) program run in 2017 and international relevance

10-40: 3 ten minute presentations from each of the pilot TDR projects

50-60: Role of Trusted Data Repositories in the NRDC

60-75: Open discussion

75-80: Next steps

4 years on: Evaluating the quality and effectiveness of Data Management Plans

Janice Chan1, Amy Cairns2, John Brown3

1Curtin University, Perth, Australia, Janice.chan@curtin.edu.au

2Curtin University, Perth, Australia, Amy.Cairns@curtin.edu.au

3Curtin University, Perth, Australia, John.Brown@curtin.edu.au

 

Introduction

The Data Management Planning (DMP) Tool[1] at Curtin University was developed in 2014. Since its launch, Curtin staff and students have created over 4,800 Data Management Plans (DMPs). While the high number of DMPs created is encouraging, it may or may not have a direct correlation to improved data management practice. This presentation outlines the actions being taken at Curtin University to evaluate the effectiveness of DMPs, and how the analysis of DMP data is being used to improve provision of services to support research data management.

DMPs at Curtin

The DMP Tool is embedded in the research life cycle at Curtin University. DMPs are mandatory for researchers requiring human and animal ethics approval. Higher Degree by Research students must submit a DMP on candidacy application. Researchers who require access to the Curtin research data storage facility (R drive) must also complete a DMP[2].

DMP analysis

In 2018, 4 years since the launch of the DMP Tool, the Library analysed DMP data and gained some useful insights. The data answered questions such as which faculty produced the most DMPs? How many DMPs have been updated since creation? How much storage was actually used as opposed to storage requested? When are the peak times for DMP creation?  This information has been useful for providing support services.

As of 23 May, 4,843 DMPs have been created, of which 44% were created by staff, and 56% by students. Figure 1 shows the breakdown by faculty and DMP owner type.

Figure 1: Data management plans by faculty and owner type

Figure 2 shows the number of DMPs created by month in the full years between 2015 and 2017. This maps out the peaks and troughs of DMPs creation which helped the Library to plan for RDM service and schedule new DMP support workshops based on point of need.

Figure 2: DMPs created by month Jan 2015 – Dec 2017

Questions in the DMP Tool are mostly optional, and DMPs created are not reviewed except for DMPs submitted for ethics or candidacy applications. While the DMP data indicated that the majority of optional questions were not left blank, this in iteslf is not an indicator of quality metadata in DMPs, nor does it demonstrate that DMPs have improved research data management practice at Curtin. This requires further investigation.

Evaluating the effectiveness of DMPs

Based on the research question “Do DMPs improve research data management practice?” the Library collaborated with the School of Media, Creative Arts, and Social Inquiry at Curtin University to address this question through the work of a Masters student. Library staff are now working with the research student and her supervisor to develop the scope of the research project, methodology, and expected outputs.

Survey questions have been developed with input from the Library and the Research Office. Invitations to complete the survey will be sent out to researchers who have completed at least one DMP, and will be sent from the Research Office in order to maximize response rate. Respondents can opt in to participate in focus groups to discuss the survey questions further with the researcher.

The research project will be completed by the end of 2018, with an expected output of a Masters by Coursework thesis and a report outlining the findings and recommendations.         

REFERENCES

  1. Research Data Management Planning. Available from: https://dmp.curtin.edu.au/, accessed 28 May 2018.
  2. Research data management: Data management plan. Available from: http://libguides.library.curtin.edu.au/c.php?g=202401&p=1333108, accessed 28 May 2018.

Biography:

Janice Chan is Coordinator, Research Services at Curtin University, Perth, Western Australia. Janice’s experience is in repository management and scholarly communications. She is interested in open research, metrics and impact assessment, research data management, library-led publishing, data analysis and visualisation, and innovative practice in library service delivery. https://orcid.org/0000-0001-7300-3489

Amy Cairns is a Master of Information Management student in the Libraries, Archives, Records and Information Science Program at Curtin University in Perth, Australia. https://orcid.org/0000-0002-7656-5361

John Brown is a librarian in the Research Services team at Curtin University. John is currently involved in providing Research Data Management services and training to the researchers of Curtin. https://orcid.org/0000-0002-6118-577X

The Bioplatforms Australia Data Portal

Adam Hunter1, Grahame Bowland2, Samuel Chang3, Tamas Szabo4, Kathryn Napier5, Mabel Lum6, Anna MacDonald7, Jason Koval8, Anna Fitzgerald9, Matthew Bellgard10

1Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, aahunter@ccg.murdoch.edu.au

2Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, gbowland@ccg.murdoch.edu.au

3Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, schang@ccg.murdoch.edu.au

4Centre for Comparative Genomics, Murdoch University, Murdoch, Australia, tszabo@ccg.murdoch.edu.au

5Curtin Institute for Computation, Curtin University, Bentley, Australia, kathryn.napier@curtin.edu.au

6Bioplatforms Australia, Sydney, Australia, mlum@bioplatforms.com

7Centre for Biodiversity Analysis, Australian National University, Canberra, Australia, anna.macdonald@anu.edu.au

8Ramaciotti Centre for Genomics, University of New South Wales, Sydney, Australia, j.koval@unsw.edu.au

9Bioplatforms Australia, Sydney, Australia, afitzgerald@bioplatforms.com

10Office of eResearch, Queensland University of Technology, Brisbane, Australia, matthew.bellgard@qut.edu.au

 

Background

Innovative life science research requires access to state of the art infrastructure, ideally developed through a strategic investment plan that promotes technology development and builds expertise for the benefit of all Australian researchers. Bioplatforms Australia enables innovation and collaboration in life science research by investing in world class infrastructure and associated expertise in molecular platforms and informatics, such as genomics, proteomics, metabolomics, and bioinformatics. Through collaborative research projects, Bioplatforms Australia creates open-data initiatives that build critical ‘omics datasets that support scientific challenges of national significance [1].

Investment funding for Bioplatforms Australia has been provided through the Commonwealth Government National Collaborative Research Infrastructure Strategy (NCRIS) with co-investments made by State Governments, research institutions and commercial entities. Infrastructure investments are hosted across Australia by a network of leading universities and research institutions, which ensures broad access through contracted services and research collaborations.

To date, Bioplatforms Australia has invested in nine collaborative open-data projects to generate biological datasets of national significance such as the Australian Microbiome Database [2, 3], the Oz Mammals Genomics Initiative [4], and Antibiotic Resistant Sepsis Pathogens [5]. The specific collective aims of these open-data projects are to: i) integrate scientific infrastructure with researchers; ii) build new data resources capturing essential meta data and integrating generated –omics data with other scientific data; iii) encourage, promote and facilitate multi-institutional, cross-discipline collaboration; iv) leverage co-investment from scientific, government, philanthropic and commercial partners; and v) enable participation in, and proactive engagement with, international research consortiums.

While the collaborative open-data projects are aligned with national research priorities that seek to improve Australia’s health and well-being, the datasets are contributing to increasing knowledge on issues of global significance. For example, the Antibiotic Resistant Sepsis Pathogens project brings together multidisciplinary teams to identify common pathogenic pathways in order to ultimately develop new approaches to disease management. The appropriate management of such research data is therefore of critical importance to ensure the data remains a valuable asset.

In order to appropriately manage researcher and public access to raw and analysed data and associated contextual metadata from numerous collaborative open-data projects and to bring research communities together, a sustainable and scalable digital platform solution was needed. The Bioplatforms Australia Data Portal (‘Data Portal’) was thereby created through a collaboration between Bioplatforms Australia and the Centre for Comparative Genomics at Murdoch University. The Data Portal provides online access to datasets and associated metadata, empowers research communities to curate and manage the data, and is built upon open source, best-of-breed technology using international standards.

development of the bioplatforms australia data portal

The Data Portal is a data archive repository that houses raw sequence data, analysed data, and associated contextual metadata for each of the nine collaborative open-data projects. In the development of this Data Portal, we identified several key criteria to be addressed to ensure the deployment of a sustainable and scalable digital platform that can be applicable for a broad community of users: i) open-source software adopting leading technology; and ii) purposeful application of data and adoption of the FAIR data principles [6].

Open-source software adopting CKAN

The Data Portal was originally developed by bespoke software development and deployed on traditional, on-premises data storage systems. However, software development of a bespoke system is generally time consuming, expensive, and is not sustainable long-term. In order to leverage off other national  investments and ensure long-term sustainability, the Data Portal adopted the Comprehensive Knowledge Archive Network (CKAN) as the core technology to replace bespoke software code [7]. CKAN is the world’s leading open-source data portal platform, and is used by numerous federal governments and public institutions to share data with the general public, including the Australian federal government [8] and Western Australian government [9] data portals, and the United Kingdom’s Natural History Museum data portal [10]. As the on-premises data storage reached its end of life, the Data Portal was migrated to Amazon Web Services with the support of Bioplatforms Australia. The code for the Data Portal and associated tools, such as extensions to the CKAN project, is open-source [11].

Purposeful application of data and adoption of FAIR data principles

The Data Portal provides researchers access to data and associated contextual metadata. Metadata is necessarily in a state of flux, from collection by field scientists through to PCR and sequencing, and needs to be inherently updated in a reproducible manner from authorised data services. Standards have been developed and adopted for both data and processes in order to automatically ingest large amounts of sequencing data and associated metadata. The Data Portal has also established robust functionality in regard to the FAIR data principles of Findable, Interoperable, Accessible, Reusable [6]. For example, all data in the Data Portal can be accessed via its identifier, using a standardised, open and documented API, subject to authentication. Bulk data access, allowing researchers to download data en-masse subject to user-defined search terms, is also available. Established international and national ontologies are also used in databases.

conclusions

To date, the Data Portal has directly managed the ingestion of tens of thousands of samples constituting over 60 terabytes of data. Bioplatforms Australia enables a broad scope of research endeavours through investment in nationally collaborative programs that fund the building of new datasets and ultimately offering them as a public resource. By employing large scale consortia to build and analyse datasets, existing academic and end-user knowledge is combined with leading ‘omics capabilities to create distinctive sample collections of national and international importance. The Bioplatforms Australia Data Portal, built upon open-source, best-of-breed technology using the same underlying technology deployed by numerous governments and organizations worldwide [12], enables effective management and access to these valuable data resources to ensure their perpetual value.

REFERENCES

  1. Bioplatforms Australia. Available from: http://www.bioplatforms.com/what-we-do/, accessed 4 June 2018.
  2. Australian Microbiome. Available from: https://data.bioplatforms.com/organization/about/australian-microbiome, accessed 4 June 2018.
  3. Bissett, A., et al. Introducing BASE: the Biomes of Australian Soil Environments soil microbial diversity database. GigaScience, 2016. 5(1): p. 21.
  4. Oz Mammals Genomics Initiative. Available from: https://data.bioplatforms.com/organization/about/bpa-omg, accessed 3 June 2018.
  5. Antibiotic Resistant Sepsis Pathogens. Available from: https://data.bioplatforms.com/organization/about/bpa-sepsis, accessed 3 June 2018.
  6. Wilkinson, M.D., et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 2016. 3: p. 160018.
  7. CKAN. Available from: https://ckan.org/, accessed 4 June 2018.
  8. Australian Government Data Portal. Available from: https://data.gov.au/, accessed June 3 2018.
  9. Western Australian Government Data Portal. Available from: https://data.wa.gov.au/, accessed June 3 2018.
  10. Natural History Museum Data Portal. Available from: http://data.nhm.ac.uk/, accessed June 3 2018.
  11. Bioplatforms Australia. Available from: https://github.com/BioplatformsAustralia, accessed on June 3 2018.
  12. CKAN. Available from: https://ckan.org/about/instances/, accessed on June 4 2018.

Biography:

Grahame Bowland is a Software Developer at the  Centre for Comparative Genomics at Murdoch University. Grahame is a senior member of the software development team which develops, deploys, and maintains eResearch software solutions such as the Bioplatforms Australia Data Portal, electronic biobank solutions, and disease registries.

Dr Kathryn Napier recently joined the Data Science team at the Curtin Institute for Computation at Curtin University. Kathryn previously worked at the Centre for Comparative Genomics at Murdoch University as a Research Associate in the areas of Bioinformatics and Health Informatics. Kathryn worked with the CCG’s software development team who develop and deploy eResearch software solutions such as disease and patient registries and the Bioplatforms Australia Data Portal.

Archives in Flight: Delivering Australian Archives to Researchers

Associate Professor Gavan McCarthy1, Mr PeterTonoli1

1eScholarship Research Centre, University of Melbourne, Parkville, Australia, gavanjm@unimelb.edu.au

2eScholarship Research Centre, University of Melbourne, Parkville, Australia, peterct@unimelb.edu.au

 

The eScholarship Research Centre (ESRC) at University of Melbourne has linked its archival services with AARNet’s storage service “CloudStor” by using the Filesender API. As the digital transformation of research expands so too do the research support and collection services evolve through integration with national research infrastructure. Rather than attempting to provide access to privacy and rights-compromised materials online, the ESRC are working to deliver these types of unpublishable materials directly to the researcher under explicitly articulated conditions. The service is designed to allow archival collection materials to be delivered via request by an online form to researchers anywhere in the world.  A policy framework (where the user supplies information and agrees to the conditions by click-through) combined with the file sending and notification functions in CloudStor support utility and efficacy interests of the ESRC and the research community.

A researcher searching the web (the ESRC finding aids and item descriptions are indexed by search engines), will locate the archival item they want (that has defined conditions of use), click the link which takes them to the online form to complete, and click deliver.  The process called the Digital Archives Delivery Service (DADS) is envisaged as an extensible web service and operates as follows.

  1. The researcher’s request conveys to DADS the key metadata describing the materials, the associated obligations and the location of the material for transfer.
  2. In response to the request the DADS then batches and loads the files into CloudStor.
  3. A notification is sent through the Filesender function in CloudStor to the researcher that reiterates the obligations associated with having a copy of the materials.
  4. The researcher, if happy to comply with those obligations, can then download the files.

The request for the supply of archives process happens in a matter of seconds through lightweight technical integration and because CloudStor operates on the national high speed network provided by AARNet that meets the needs of research and education. The transaction is documented both as records (emails) and data and is provided to the archive enabling both accountability and continuous reporting.

The DADS source code, and documentation is located on GitHub (https://github.com/esrc-unimelb/DADS) for those interested in what happens technically through the use of the Filesender API, e.g. get/send commands.  The ESRC have completed the first two cycles of proof-of-concept with DADS and the capacity to report on the implementation phase is anticipated by 2018.

This presentation will cover the journey of implementing DADS, from ideation, working with the different perspectives in a multi-disciplinary environment, to creating a minimum viable product (MVP). The benefits of working in an agile manner, and developing a MVP, as opposed to a ‘perfect product’ in the academic environment will also be discussed.


Biography:

Associate Professor Gavan McCarthy is Director of the University of Melbourne eScholarship Research Centre in the University Library founded in 2007. His research is in the discipline of social and cultural informatics with expertise in archival science and a long-standing interest in the history of Australian science. He contributes to research in information infrastructure development within the University and his projects highlight strong engagement with community. His distinctive cross-disciplinary research reaches into other fields such as education, social work, linguistics, anthropology, population health and history. He re-examines theoretical foundations and tests new theories through practical interventions with a focus on public knowledge domains, contextual information frameworks and knowledge archives.
https://orcid.org/0000-0001-8411-3173

Peter Tonoli is an Information Technology Consultant at the eScholarship Research Centre University of Melbourne, with a Masters Degree in Information Technology Management majoring in Information Security, Graduate level qualifications in Project Management and Management, and industry certifications in IT Project Management, Information Security and IT Service Management. He is a freelance information security, digital rights and online privacy consultant, servicing individuals and organisations across the government, media, community, non-profit and medical research sectors. His involvement in information technology security and management dates back to the establishment of the Internet in Australia. He has a long history of voluntary community involvement including holding committee positions on various IT industry peak bodies. Peter is currently a Board Member of Electronic Frontiers Australia, and a Non-executive Director of Internet Australia
https://orcid.org/0000-0003-1164-5632

How to Choose the ‘Right’ Repository for Research Data

Shawn A Ross1, Steven McEachern2, Peter Sefton3, Brian Ballsun-Stanton4

1Macquarie University, Sydney, Australia, shawn.ross@mq.edu.au

2Australian National University, Canberra, Australia,steven.mceachern@anu.edu.au

3University of Technology Sydney, Australia, peter.sefton@uts.edu.au

4Macquarie University, Sydney, Australia, brian.ballsun-stanton@mq.edu.au

DESCRIPTION

In Australia, multi-institutional, domain-specific information infrastructure projects, such as those funded through the Australian National eResearch Collaboration Tools and Resources (NeCTAR) program, are typically open-source software (OSS). National infrastructure such as AARNet’s Cloudstor, built on OwnCloud, is also OSS. Even publications repositories and data management planning tools are often OSS (DSpace, DMPOline, DMPTool, RedBox, etc.). The trend in institutional research data repository software amongst institutions who prefer not to build an in-house solution, however, appears to favour proprietary software (e.g., Figshare). In comparison to Europe and North America, OSS is much less popular in Australia (e.g., Dataverse, CKAN). Dataverse, for example, has 33 installations on five continents containing 76 institutional ‘Dataverses’ (some installations house more than one) – but Australia has only one installation or institutional Dataverse, (the Australian Data Archive) [1]. By contrast, Figshare has been or is being implemented by at least five Australian universities [2], with others actively considering it.

This BoF session examines the reasons why institutions choose proprietary versus OSS for research data infrastructure. We compare the practical advantages, disadvantages, and considerations around each approach. We propose for discussion the idea that the advantages of proprietary software are overstated, as is the burden of implementing and administering OSS. For example, costs like requirements analysis, systems integration and engagement, outreach, and training – which together likely account for the majority of a software project’s budget – are similar whether proprietary or OSS. Deployment and maintenance of modern OSS platforms, facilitated by approaches like containerisation and automation, is lower than in the past. SaaS options for OSS are also sometimes overlooked. Proprietary software, moreover, is not always an ‘out-of-the-box’ turn-key solution for software at universities, especially regarding specialised software for research (as opposed to commodity). As such, it may require the creation of separate but interoperable systems to fill gaps in capacity, dramatically raising costs. Conversely, the flexibility and capabilities of an OSS solution are neglected: if a feature is missing or inadequate, it can be built (often with support from the community) and made available for reuse, without having to work around the edges of a proprietary system. The Australian Data Archive, for example, has added significant new features to Dataverse to support mediated access to sensitive data, which are available to other users. However, a deeper exploration of the tradeoffs and demands of both approaches in the context of specialised academic software is warranted. The focus of the discussion will be practical, but it may extend to the potential impact of various software business models on research data, a core output and asset of universities.

The session will be 60 minutes in duration. It will include brief presentations by the organisers based on their experience, followed by open discussion. Audience participation is essential – we encourage a candid exchange of experience with either proprietary or OSS for research data management at various institution, so that we can learn from each others’ successes and challenges. The outcome will be information to guide decision making around repository platform procurement at universities.

REFERENCES

  1. The Dataverse Project. Available from: https://dataverse.org/ accessed 22 June 2018. See also https://dataverse.org/metrics accessed 22 June 2018.
  2. The University of Adelaide. Available from: https://www.adelaide.edu.au/figshare/. The University of Melbourne. Available from: https://melbourne.figshare.com/. Monash University. Available from: https://monash.figshare.com/. La Trobe University. Available from: https://latrobe.figshare.com/. Federation University Australia (planned). Available from: https://federation.edu.au/staff/governance/projects/current-projects.

Biographies:

Shawn Ross (Ph.D. University of Washington, 2001) is Associate Professor of History and Archaeology and Director of Data Science and eResearch at Macquarie University. A/Prof Rossʼs research interests include the history and archaeology of pre-Classical Greece and the Balkans, and the application of information technology to research. He supervises a large-scale landscape archaeology and palaeo-environmental study in central and southeast Bulgaria. Since 2012, he has also directed a large information infrastructure project developing data capture and management systems for field research. Previously, A/Prof Ross worked at the University of New South Wales (Syndey, Austrlalia) and William Paterson University (Wayne, New Jersey).

Steve McEachern is Director and Manager of the Australian Data Archive at the Australian National University, where he is responsible for the daily operations and technical and strategic development of the data archive. He has high-level expertise in survey methodology and data archiving, and has been actively involved in development and application of survey research methodology and technologies over 15 years in the Australian university sector. Steve holds a PhD in industrial relations from Deakin University, as well as a Graduate Diploma in Management Information Systems from Deakin University, and a Bachelor of Commerce with Honours from Monash University. He has research interests in data management and archiving, community and social attitude surveys, organisational surveys, new data collection methods including web and mobile phone survey techniques, and reproducible research methods. Steve has been involved in various professional associations in survey research and data archiving over the last 10 years.

Peter Sefton is the Manager, eResearch Support at the University of Technology, Sydney (UTS). Before that he was in a similar role at the university of Western Sydney (UWS). Previously he ran the Software Research and development Laboratory at the Australian Digital Futures Institute at the University of Southern Queensland. Following a PhD in computational linguistics in the mid-nineties he has gained extensive experience in the higher education sector in leading the development of IT and business systems to support both learning and research. At UTS Peter is leading a team which is working with key stakeholders to implement university-wide eResearch infrastructure, including an institutional data repository, as well as collaborating widely with research communities at the institution on specific research challenges. His research interests include repositories, digital libraries, and the use of The Web in scholarly communication.

Brian Ballsun-Stanton is Solutions Architect (Digital Humanities) for the Macquarie University Faculty of Arts with a PhD from UNSW in Philosophy. He is working with researchers from across Australia to deploy digital technologies and workflows for their research projects. He has developed a new methodology (The Social Data Flow Network) to explore how individuals in the field understand the nature of data. Brian’s current research interests are in exploring the Philosophy of Science’s interactions with the Open Data movement, and building tools for rapid analysis and bulk manipulation of large ancient history corpora.

Soup to Nuts: A Panel Discussion on Implementing a Research Life Cycle in the Design of Research Services

Gavin Kennedy1, Vicki Picasso2, Dr Cameron McLean3, Louise Wheeler4, Siobhann McCafferty5

1Queensland Cyber Infrastructure Foundation, Brisbane, Australia gavin.kennedy@qcif.edu.au

2University of Newcastle, Newcastle, Australia, vicki.picasso@newcastle.edu.au

3University of Auckland, Auckland, New Zealand, ca.mclean@auckland.ac.nz

4University of Technology Sydney, Sydney, Australia, louise.wheeler@uts.edu.au  

5Australian Access Federation and Data Life Cycle Framework, Brisbane, Australia, siobhann.mccafferty@aaf.edu.au

 

This is a joint presentation and panel discussion that envisions an implementation of a “Research Life Cycle Framework” based on the capture of minimal research activity descriptions, the integration of multiple information systems and repositories and the provision and orchestration of research services. With a ‘researcher-centric’ design, such a platform would provide users with a complete set of services from research planning (the soup) , through research execution, to reporting,  publishing and archival (the nuts). It would provide a consistent access point that supports the basic actions of research and inquiry, while simultaneously providing better management and reporting tools – resulting in improved quality, integrity, impact, and effectiveness/efficiency of research. That is, it works for all the stakeholders in the research ecosystem – a natural benefit of adopting a systemic/holistic life cycle view.

This is more than a thought experiment or an aspirational vision as this panel has come together through the realisation that the working components of such a platform already exist: University of Auckland’s Research Hub(1) is an established actionable service directory; University of Technology Sydney’s Provisioner(2) is a pluggable service catalog framework; ReDBox and Mint(3) provide the integration tools, workflows and user workspaces to support the researcher; and DLCF RAiDs and Group IDs(4) provide an identity and resource sharing framework to tie multiple disparate services to different groups of people. Universities such as University of Auckland, University of Newcastle and University of Technology Sydney are already considering how to develop this model and leverage these existing services. A further opportunity is for a national level platform through which national services can be provisioned, larger institutions can integrate at the level they require, and smaller institutions could benefit from accessing the complete spectrum of Research Life Cycle services “in the cloud”.

As a panel session this will provide the eResearch community the opportunity to reflect on this framework and assist the proposers to validate some of their assumptions. For example, a key question to be addressed through this forum is how do we ensure we are truly enabling research activities with the design of our services rather than just creating research management systems, or systems for managing disconnected fragments of the research process?

Intended audience

This panel will be of interest to national eResearch strategists, eResearch managers, university research office managers, data librarians and technical staff. It will describe how the architecture of the proposed platform supports the research life cycle and how it could integrate with institutional systems. It does not assume any technical knowledge, so we encourage developers, administrators and strategists to attend.

REFERENCES

  1. Introducing Provisioner: https://eresearch.uts.edu.au/2018/04/05/provisioner_1.htm Accessed 22 June, 2018
  2. http://www.redboxresearchdata.com.au Accessed 22 June, 2018
  3. https://www.dlc.edu.au/ Accessed 22 June, 2018

Biographies:

Gavin Kennedy (https://orcid.org/0000-0003-3910-0474) is the manager of Data Innovation Services at Queensland Cyber Infrastructure Foundation and leads the team responsible for ReDBox development and support.

Vicki Picasso (https://orcid.org/0000-0001-9422-5021) is Senior Librarian, Research Support Services at the University of Newcastle.

Dr Cameron McLean (https://orcid.org/0000-0002-9836-3824) is a Research and Education Enabler and Research IT Specialist at the Centre for eResearch, University of Auckland.

Louise Wheeler manages the research integrity environment for academics at UTS. This includes the development and implementation of policy, education and training, and the procedural framework for upholding responsible research practices. Louise’s role also extends to the investigation of integrity breaches, as required under the Australian Code for the Responsible Conduct of Research.

Siobhann McCafferty (https://orcid.org/0000-0002-2491-0995) is Project Manager for the Data Life Cycle Framework project, supported by ARDC, AARNet and the Australian Access Federation.

A framework for integrated research data management, with services for planning, provisioning research storage and applications and describing and packaging research data

Michael Lynch1 Peter Sefton2 Sharyn Wise3

1University of Technology Sydney, Michael.Lynch@uts.edu.au

2University of Technology Sydney, Peter.Sefton@uts.edu.au

3University of Technology Sydney, Sharyn.Wise@uts.edu.au

 

Research data management is critical for the integrity of scholarship, the ability of researchers and institutions to re-use and share data, and for IT support staff and data librarians to be able to plan, maintain and curate data collections. It’s also time-consuming and daunting for researchers and runs the risks of becoming yet another bureaucratic hurdle to research work.

Provisioner is an open framework for integrating research data management into research tools and workflows, allowing researchers to select applications from a service catalogue and create workspaces which are linked to research data management plans, and supporting data archiving and publication.

This presentation will cover:

  • The ideas underlying the Provisioner framework, which is a loosely-coupled distributed system designed for high-resilience against inevitable organisational and technical change
  • ReDBox 2.0, the platform in which this work is implemented
  • Using DataCrates for integrated metadata and a file-based repository
  • A case study of a research data workflow from microscopes to data modelling and simulation to an immersive visualisation

The provisioner

The Provisioner is based on two key ideas. The idea of a “workspace” is used to design a limited set of basic operations – create, share, import and export – which can be executed via APIs on a wide range of research data applications and storage services. The idea of redundant, machine- and human-readable metadata stored with the data is used to build a system which isn’t a monolith where identifying the creator, owner and funding agency of a dataset would depend on a centralised database.

The Provisioner framework allows us to connect diverse research applications using DataCrates as a common interchange format and manage automated pipelines of data management tasks such as exporting data, crosswalking metadata and requesting DOIs from minting services.

ReDBox 2.0

The principle way in which researchers interact with the Provisioner is through the service catalogue in the University’s research data management tool and data catalogue, Stash, implemented in ReDBox.

As part of the Provisioner project, ReDBox has been redeveloped as a modern web application, and now includes a service catalogue from which researchers can select workspaces in a range of research applications, from OMERO (an open microscopy environment), GitLab (for maintaining and publishing software as a research output) to research fileshares.

DataCrates

Provisioner uses the DataCrate standard to store metadata in human- and machine-readable formats with the accompanying datasets at different stages of the research life-cycle, from creation and analysis through to archiving and publication. DataCrates are directories on a filesystem with a conventional layout based on the BagIt standard and linking to contextual metadata with JSON-LD, and are suitable for both archiving and publication.

Case Study

We present a case study of a research data workflow which starts with video microscopy of bacteria, through to a code repository and mathematical modelling of bacterial movement, to 3D visualization of simulated bacterial movement in the UTS Data Arena.

References

  1. Lynch, M. “Introducing Provisioner”, https://eresearch.uts.edu.au/2018/04/05/provisioner_1.htm, 2018, accessed 21 June 2018.
  2. Sefton, P.  “DataCrate: Formalising ways of packaging research data for re-use and dissemination”, Presentation, eResearch Australasia 2017,  https://conference.eresearch.edu.au/2017/08/datacrate-formalising-ways-of-packaging-research-data-for-re-use-and-dissemination/, accessed 22 June 2018.

Biography:

Mike Lynch is an eResearch Analyst in the eResearch Support Group at UTS. His work involves solution design, information architecture and software development supporting research data management. His other interests include data visualisation and functional programming languages.

Peter Sefton is the Manager, eResearch Support at the University of Technology, Sydney (UTS).

At UTS Peter is leading a team which is working with key stakeholders to implement university-wide eResearch infrastructure, including an institutional data repository, as well as collaborating widely with research communities at the institution on specific research challenges. His research interests include repositories, digital libraries, and the use of The Web in scholarly communication.

Implementing a Model for Integrated Research Data Services Management

Gavin Kennedy1, Andrew Brazzatti1, Shilo Banihit1, Matthew Mullholland1, Andrew White1, Peter Sefton2

1Data Innovation Services, Queensland Cyber Infrastructure Foundation, Brisbane, Australia, gavin.kennedy@qcif.edu.au , andrew@redboxresearchdata.com.au , andrew.white@qcif.edu.au , shilo@redboxresearchdata.com.au , matt@redboxresearchdata.com.au .

2eResearch Services, University of Technology Sydney, Sydney, Australia, Peter.Sefton@uts.edu.au

 

Summary

In this presentation we will demonstrate how the integration capabilities of ReDBox 2.0[2] supports the whole of the data life cycle taking a research and  researcher centric focus, from research activity initiation and planning, to infrastructure resourcing, to publication/discovery and finally to the archival of research outputs.

ABSTRACT

Research data and other non-traditional research outputs are rapidly increasing in ascendancy to be treated as equally valuable assets as publications. Recognising this, in combination with the overwhelming shift to born-digital research data, has helped drive development and maturity within the eResearch industry. In particular the research data repository has moved from being developed as an eResearch cottage industry to a significant commercial activity that could rival the $USD25 billion[1] academic publishing market. However, while research data managers are spoilt for choice for platforms to manage activities, resources and assets at various stages of the research data life cycle, few options have emerged for integrating these multiple platforms to provide true end-to-end research data management capability. Using the established Open Source ReDBox and Mint platforms, QCIF’s Data Innovation Services collaborated with the University of Technology Sydney’s eResearch Group and the ARDC Data Life Cycle Framework project to transform core ReDBox functionality from metadata store to service integrator. In this new model, ReDBox 2.0 can integrate and interoperate with existing institutional systems including research management systems, academic repositories and with shared service platforms such as national storage providers, to create an integrated research data services management platform. Taking advantage of the new ReDBox architecture, this platform utilises a schema free linked-data design that facilitates transformation of metadata across standards and creates machine readable and actionable microservices.

This presentation will be of interest to repository managers, data librarians and technical staff, as it will describe how the architecture of the platform supports research integrity through the research data lifecycle, highlighting innovative features including the mint namespace authority tool, integrated research data management planning and an actionable service catalogue in which services are accessed through the workspace concept.

REFERENCES

  • Ware, M. and Mabe, M., 2015. The STM report: An overview of scientific and scholarly journal publishing. https://www.stm-assoc.org/2015_02_20_STM_Report_2015.pdf. Accessed 22 June, 2018.
  1. ReDBox Data Life Cycle, https://www.redboxresearchdata.com.au/rbdlc/. Accessed 22 June 2018

Biographies:

Gavin Kennedy (https://orcid.org/0000-0003-3910-0474) is an IT research and solutions expert with over 30 years experience in ICT with the past 18 years in eResearch and ICT research. Gavin is the head of Data Innovation Services at the Queensland Cyber Infrastructure Foundation (QCIF), where he is responsible for business development as well as leading QCIF’s Software Engineering team, who are the key developers of ReDBox, the popular research data management and publishing platform.

Andrew Brazzatti is the Technical Lead and Solutions Architect with QCIF’s Data Innovation Services.

Describe, Manage and Discover Research Software

Dr Mingfang Wu1, Dr Jens Klump2, Ms Sue Cook2, Dr Carsten Friedrich2, Dr David Lescinsky3, Dr Lesley Wyborn4Paola Petrelli5Margie Smith3Geoffrey Squire2

1 Australian Research Data Commons, mingfang.wu@ardc.edu.au

2 CSIRO, jens.klump@csiro.au, Sue.Cook@csiro.au, Carsten.Friedrich@data61.csiro.au

3 Geoscience Australia, David.Lescinsky@ga.gov.au

4 National Computational Infrastructure, lesley.wyborn@anu.edu.au

5CLEX, Centre of Excellence for Climate Extremes

 

DESCRIPTION

Software is pervasive in research. A UK Research Software Survey of 1000 randomly chosen researchers [2] shows: more than 90% of researchers acknowledge software is important for their own research, about 70% of say their research would not be possible without it. In a separate study, Carver et al [3] examined 40 papers published in Nature from Jan to March 2016, 32 of them explicitly mentioned software. These surveys provide evidence that software plays an important role in research, and hence, software should be treated in the same way as other research inputs and outputs that are part of the record of research such as research data and paper publications. But of greatest importance, to enable research reproducibility, any software that underpins research should be discoverable and accessible.

Beyond making software discoverable and accessible, best practice in open source software also recommends choosing an Open Source licence that complies with third-party dependencies to clarify the legal framework for reuse and distribution of the source code.   Furthermore, the long-term sustainability of an Open Source project is supported by clear and transparent communication and processes describing how developers can contribute to the project and how these contributions are governed. It is important that the community (both developers and software users) is involved early in the software development process, to ensure that developed software is more reusable and sustainable [4].

Current International initiatives working on to make research software reproducible and reusable can be summarized in three areas:

  1. Open research and scholarly communication.  Working groups/projects (e.g. the FORCE11 Software Citation Implementation WG, the RDA Software Source Code Interest Group and the CodeMeta Project), repositories and catalogues (e.g. DataCite, Zenodo and Code Ocean), as well as publishers (e.g. Journal of Open Source Software, Nature, Elsevier), are setting up software dissemination, cataloguing, discovery and review processes.
  2. Sustainable software. Working towards Sustainable Software for Science (WSSSPE) and Research Software Sustainability Institutes in UK, US and elsewhere are encouraging, exchanging experiences or providing training courses for software development to ensure it is sustainable.
  3. Sustainable community. Research Software Engineering Association (and their chapters) has been working on advocating career path and funding for research software engineers. Parallel initiatives in communities such as the FORCE11 software citation implementation working group, research groups and publishers on citation metrics and credit models to research software engineers should ensure appropriate accreditation for contributions to software [1].

We propose a 60-minute BoF session that will mix presentations with round-table discussions. We will first provide an overview of international initiatives and activities along the above three areas, and three lighting talks on software description, curation and publishing workflow.  The presentations will be followed by a round-table group discussion on current practices and barriers people are facing in managing and describing software. The outcome from this discussion will be actions for various software interest or working groups, including an Australian software citation Interest Group.

This work is being done in partnership with the Earth Systems Information Partners (ESIP) of the US, in particular the ESIP Software and Services Cluster.  ESIP is supported by NASA, NOAA, USGS and 110+ member organizations.

REFERENCES

  1. Smith A. M., Katz D. S., Niemeyer K. E., FORCE11 Software Citation Working Group. (2016) Software Citation Principles. PeerJ Computer Science 2:e86. DOI:10.7717/peerj-cs.86.
  2. S. J., et al. (2014). UK Research Software Survey 2014 [Data set]. doi:10.5281/zenodo.14809
  3. Carver, J.C., Gesing, S., Katz, D. S., Ram, K., and Weber, N., (2018). Conceptualization of a US Research Software Sustainability Institute (URSSI), in Computing in Science & Engineering, vol. 20, no. 3, pp. 4-9, May./Jun. 2018. doi:10.1109/MCSE.2018.03221924
  4. Jiménez RC, Kuzak M, Alhamdoosh M et al., (2017). Four simple recommendations to encourage best practices in research software [version 1; referees: 3 approved]. F1000Research 2017, 6:876 (doi:12688/f1000research.11407.1)

Biographies:

Mingfang Wu is a senior business analyst at ANDS/Nectar/RDS.  https://orcid.org/0000-0003-1206-3431

Jens Klump is a geochemist by training and OCE Science Leader Earth Science Informatics in CSIRO Mineral Resources.  Follow him on Twitter as @snet_jklump.

Sue Cook is a Data Librarian with the Research Data Support team of CSIRO Information Management and Technology.

Carsten Friedrich is a Research Team Leader at CSIRO Data61.  At CSIRO he worked in a variety of areas including Cloud Computing, Cyber Security, Virtual Laboratories, and Scientific Software Registries.

David Lescinsky is currently the team lead of GA’s Informatics Team and is responsible for facilitating and managing GA’s eResearch projects.

Lesley Wyborn currently has a joint adjunct fellowship with NCI.  She is Chair of the Australian Academy of Science ‘Data for Science Committee’ and on the AGU Data Management Advisory Board and the Steering Committee of the AGU-led FAIR Data Publishing Project.

Recent Comments

    About the conference

    eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

    Conference Managers

    Please contact the team at Conference Design with any questions regarding the conference.

    © 2018 - 2019 Conference Design Pty Ltd