Research Integrity and Ethics in the Cloud

Paul Wong1, Karen Mecoles2, Lien Le3, Gary Allen4, Jeff Christiansen5, Hamish Holewa6, Helen Morgan7, Nicholas Smale8

1 Australian National Data Service (ANDS), Canberra, Australia, paul.wong@ands.org.au

2 National eResearch Collaboration Tools and Resources (Nectar), Melbourne, Australia, karen.mecoles@nectar.org.au

3 Research Data Services (RDS), Brisbane, Australia, l.le2@uq.edu.au

4 Office for Research, Griffith University, Brisbane, Australia, g.allen@griffith.edu.au

5 Queensland Cyber Infrastructure Foundation (QCIF), Brisbane, Australia, j.christiansen@uq.edu.au

6 Quadrant, Brisbane, Australia, hholewa@quadrant.edu.au

7 Scholarly Communication and Repository Services, University of Queensland, Australia, helen.morgan@uq.edu.au

8 Research Ethics and Integrity, University of Melbourne, Melbourne, Australia, nicholas.smale@unimelb.edu.au

 

GENERAL INFORMATION

  • This is a half day workshop
  • This is a facilitated workshop including a panel from research offices, eResearch service providers and research supports.

DESCRIPTION

As cloud technologies are becoming cheaper and more accessible, it is becoming more attractive and cost effective to conduct research in the cloud – from the collection, processing, analysis and storage of data to the dissemination and sharing of research publications, software and data.  However, this also raises a number of research integrity and ethics issues, some old and some new, in the use of cloud for research.  As the Australian Code for the Responsible Conduct of Research is currently under review, it is a timely occasion to revisit the broad relationship between research integrity, ethics and the cloud.

In this half day facilitated workshop, ANDS, Nectar and RDS will bring together practitioners from research offices, research supports, and eResearch service providers to consider key research integrity and ethics issues around the use of cloud for research.  Over the course of the workshop, we’ll raise pertinent issues such as

  • Privacy (of human data)
  • Confidentiality (of sensitive data)
  • Utility (ease of access driving research efficiency)
  • Shareability (data discoverability and accessibility allowing reuse)
  • Stability (long term accessibility)
  • Copyright, licensing and ownership
  • Jurisdictional and policy issues across institutional, state and national boundaries

The structure of the workshop will include:

  1. Perspectives on the relationship between research integrity, ethics and the cloud – presentations by panel

        12 minutes x 5

  1. Facilitated panel discussions: 30 minutes
  2. Q&A from audience to panel: 30 minutes
  3. Scenario based breakout group work: 60 minutes

WHO SHOULD ATTEND

This is a workshop designed for researchers, eResearch service providers and research support staff to raise awareness of key research integrity and ethics issues around the use of cloud based technologies for research.

WHAT TO BRING

Bring your laptops or mobile devices as there’ll be a breakout session involving group works based on several work case scenarios.


Biographies

Paul Wong is Senior Data Management Specialist with ANDS. In the last 18 months, Paul has been working closely with the ARC, NHMRC and Research Integrity Offices around Australia in the delivering a national workshop series at the intersection between research data management and research integrity. Paul was former Director of Office of Research Excellence at the ANU.

Karen Mecoles is Communication and Project Coordinator with NeCTAR. With a varied background in humanities, teaching and IT, Karen has held many roles in the University sector including academic, student advisor and IT developer. Karen has been a member of the NeCTAR team for over 6 years.

Lien Le is currently Deputy Director of RDS and responsible for the technical and strategic directions of RDS domain projects. Lien was previously the Senior Bioinformatics Team Leader at the Research Computing Centre, University of Queensland. She formed part of the Queensland EMBL Bioinformatics Resource node in conjunction with the hub at the University of Melbourne to gather bioinformatics expertise. She instigated and led the successful setup of an Australian based data repository.

Archiving, Finding and Accessing Data for Secondary Use in the Social, Behavioural and Economic Sciences: An Introduction to the Australian Data Archive

Steven McEachern1

1 Australian National University, Acton, ACT, Australia, steven.mceachern@anu.edu.au

 

 

DESCRIPTION

This workshop will provide an introduction to standards and practices for managing, storing and disseminating research and administrative data in the social sciences and related disciplines in Australia. The workshop will provide an overview of current data management practice in Australia and internationally, and discussion of recent Australian developments in data sharing and open data.

Topics to be covered will include:

  1. Data Archiving in the Social Sciences (45 minutes)

– What is data management and data archiving?

– Data sharing policies in Australia: ARC, NHMRC and government data

– Standards for data archiving: OAIS, DDI and beyond

  1. Archiving and disseminating data with the Australian Data Archive (60 minutes)

– Managing active projects

– Archiving completed projects

– Producing data

– Metadata and documentation

Coffee break (15 minutes)

  1. Accessing data in the social sciences (60 minutes)
  • Finding data
  • Accessing data
  • Analysing data
  • Data formats and storage practices

Question and Answer session (15 minutes)

– Follow up, queries and issues

WHO SHOULD ATTEND

The workshop will be of interest to those researchers with responsibilities for management of quantitative and qualitative research projects, staff from government agencies interested in disseminating data, and others interested in data access methods for sensitive data.

WHAT TO BRING

Attendees will need to bring a laptop with a web browser installed. No particular skills or knowledge are required for this workshop.

 


Biography

Dr. Steven McEachern is Director and Manager of the Australian Data Archive at the Australian National University, where he is responsible for the daily operations and technical and strategic development of the archive. He has high-level expertise in survey methodology and data archiving, and for over fifteen years has been actively involved in the development and application of survey research methodology and technologies in the Australian university sector.

Facilitating national data services discovery

Dr Adrian Burton1, Mr Joel Benn2, Ms Catherine Brady3

1 Australian National Data Service, Canberra, Australia, adrian.burton@ands.org.au

2 Australian National Data Service, Canberra, Australia, joel.benn@ands.org.au

3 Australian National Data Service, Canberra, Australia, catherine.brady@ands.org.au

OVERVIEW

Services have become an integral part of the research domain. They provide automated functions for the creation, access, processing and analysis of data. The development of data- focused services is steadily increasing in Australia, however, the means of discovering the existence of these services is often challenging for the end consumer. ANDS is addressing this challenge by expanding the scope of our national discovery portal Research Data Australia, to support both human and machine-to-machine discovery of data services. Through this expansion, Research Data Australia will be able to improve the visibility and discoverability of a broad range of services across NCRIS facilities, the science agencies, and university research sector.

ISSUES

The primary focus of Research Data Australia has been discovery of data collections with information about associated parties and activities intended to give context to the collections, and information about services intended to link data to services that can be used to act upon or access data described in collection records. That is, it is a data-centric environment.

Increasingly, however, Research Data Australia’s contributors are publishing their data through services. Computing facilities, like those provided by Nectar’s Virtual Labs, provide services for data processing and visualization. Data consumers (human or machines) may seek services to access relevant data, or look for services and platforms they can use to process their data on hand. A data collector may wish to identify if specific data has already been collected by others to avoid collecting the same collection again.

Currently, no comprehensive research data services registry (or catalogue) exists in the Australian research ecosystem. There is a risk then, of duplicating development and under-using services and data. The proposed extension to the functionality provided by Research Data Australia, seeks to fill that gap. It will help to promote use and reuse by making it easy to discover and access the right services for a data access or processing task.

SOLUTION

As demand for finding and utilising data services is emerging from the research sector (that is, separately from data collection discovery), ANDS is responding by examining how that functionality may be integrated and delivered in a national service to meet those needs.

ANDS has a natural role in cross-sector data infrastructure especially across NCRIS facilities, science agencies, universities, and other publicly funded research agencies, and as such is taking the lead on developing a registry for data services, including machine-to-machine services, as an extension to the existing functionality provided by Research Data Australia.

To be a valuable national service which is research domain agnostic, the Research Data Australia Registry needs to provide broad coverage of services across the Australian research sector. With service providers and consumers using a variety of methods and standards to describe, expose and consume services, the Registry aims to support both the ingest and publication of service descriptions across a variety of protocols and metadata formats.

In order to limit the scope for the initial phase of the project, ANDS chose to focus on Open Geospatial Consortium (OGC) data access services. These services have a broad application in the research sector; are standardised through the International Organisation for Standardisation (ISO); and are relatively mature.

ANDS completed an environmental scan of current practice around OGC data service provision and consumption in the Australian research environment and established pathfinder projects with a small set of service providers and consumers to elicit requirements. Through these activities, it became apparent that the implementation of an OGC Catalog Service for the Web (CSW) component, which is commonly understood and used by both OGC service consumers and providers, would be a valuable extension to the Research Data Australia Registry. With the addition of this functionality, ANDS can harvest data and service descriptions from other OGC catalogues and services, using OGC protocols. This has enabled the discovery of services through the Research Data Australia interface as well as through the machine-to-machine protocols, Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and CSW.

PRESENTATION OUTLINE

This presentation will cover:

  1. Progress to date
  2. System overview
  3. Lessons learned
  4. What’s next

CONCLUSION

Working with the data services community, ANDS has identified and responded to a national, sector-wide need for central information about available data services. Beginning with pathfinder projects to explore service provider and consumer needs, ANDS has delivered an extension to the Research Data Australia Registry that enables the discovery of OGC services through machine-to-machine services as well as the Research Data Australia portal.


BIOGRAPHY

Dr Adrian Burton is Director of Services at the Australian National Data Service (ANDS). In this capacity he has a keen interest in national services that enable data publication, data discovery and data citation as well as the human support services that build the capability of researchers and research organisations to take advantage of data infrastructure. Adrian has provided strategic input into several national infrastructure initiatives, including Towards an Australian Research Data Commons, The National eResearch Architecture Taskforce, and the Australian Research Data Infrastructure Committee. Adrian is active in building national policy frameworks to unlock the value in the research data outputs of publicly funded research.

Data Reviews Online: managing the proposal and review process for data repository submissions.

Mr Nicholas May1, Dr Ian Thomas2

1RMIT University, Melbourne, Australia, nicholas.may@rmit.edu.au

2RMIT University, Melbourne, Australia, ian.edward.thomas@rmit.edu.au

INTRODUCTION

The submission of datasets to open data repositories needs to be managed, to assure the quality of data and to ensure that only appropriate datasets are accepted. However, this review process can be both onerous and inefficient. Therefore, a system that supports and manages the submission and review of proposals would speed the review of proposals and promote the openness of data repositories. Data Reviews Online is a step in this direction.

BACKGROUND

At RMIT University, the eResearch Office and the Library has been providing resources to researchers for the promotion, publication, and sharing of their datasets, through a merit based allocation of resources called the Research Data Grants Program (RDGP). In this program, the selection of proposals is based on two factors: significance, and strategic alignment. The significance of the dataset is assessed on criteria as defined by Russell and Winkworth [1], such as: completeness, rarity, research potential, and artistic merit. The alignment is assessed against RMIT University’s research strategy, which is embodied in its Enabling Capability Platforms (ECP) [2]. Given these diverse criteria, identifying reviewers with the appropriate expertise is essential.

The proposal submission and the ingestion planning for the RDGP has been partially implemented, via a Google form and a Microsoft Word document, respectively. Despite the initial number of submissions being low, the overall management of this process was manual and found to be onerous, especially the review and selection processes. Hence a system that manages the submission and review process would be a boon for the RDGP and will allow the program to expand the number of calls issued and submissions processed. In addition, it could attract further deployments by the wider eResearch community, since this sort of program is expected to grow across Australia and around the globe.

The review of submissions into data repositories is important to maintain quality and focus. Lawrence et al. [3] surveyed example procedures and proposed a generic checklist for data reviews, which included: quality of the data and its metadata, availability and access of the data, the reliability of the data source, and the potential user community. Whilst this checklist is more complete than the review criteria established for the RDGP, some criteria are not required. For instance, the quality of metadata is not an issue for proposals to the RDGP, because resources are specifically provided to help with the extraction and refinement of metadata for the selected datasets.

There is currently no generic solution for the submission and review of data repository proposals. A system to manage calls, submissions, and peer review of conference papers, called EasyChair [4], has been available since 2002, and there are solutions for specific repositories (several examples provided by Lawrence et al. [3, p12]). But no generic solution, comparable to EasyChair, exist for research data. A generic solution would allow repository owners to expand, or restrict, the community of available data sources and reviewers, and to match them based on multiple criteria, such as Field of Research (FoR) classification [5], etc. However, the review requirements of specific domains, such as required metadata formats, and the automatic ingestion and verification of the data, could not be accommodated, given the need to support a wide range of repositories.

THE SYSTEM: DAREON

The initial development goals were to establish the framework for the overall system, including: the development infrastructure and core functionality. In addition, we aimed to lay the foundation for an open-source project, through a deployment and testing infrastructure, and thus support the ongoing development of the platform, using established best practices for community driven open source software.

The result, Data Reviews Online (Dareon) [6] is a web-based application that assists in the process of submission and review of proposals for the inclusion of datasets into a data repository. It helps with the management of calls for proposals and the associated proposal review process. An initial and high-level use case diagram is show in Figure 1. This shows the three main user roles as Repository Owner, Dataset Owner, and Domain Reviewer.  Screenshots provided show sample details for:

  1. a repository [Figure 2],
  2. a call for proposals [Figure 3],
  3. and a proposal [Figure 4].

Figure 2: a sample Repository

Figure 3: a sample Call for Proposals

Figure 4: a sample Proposal

A feature of this system is the ability to classify repositories, datasets, and reviewers, using multiple, concurrent classification schemes. In the RDGP example, the classification schemes used are the significance criteria and ECP alignment for datasets and the ECP alignment for reviewers. In the sample repository details [shown in Figure 2] the classification scheme used is the ANZSRC FOR codes [5]. In future development, these classifications will enable the smart matching of reviewers with dataset.

SUMMARY

The outcome of this project is a system, called Dareon [6], which manages the processes that govern soliciting proposals for the inclusion of new datasets into institutional research data repositories.  The system oversees the submission of proposals, review, selection, and ingestion planning processes, and supports the workflows for the three principle roles. The project has been established as an open-source platform that provides a generic solution and will support the future development across the eResearch community.

REFERENCES

 


Biography

Nicholas May is a software developer in the eResearch Office of RMIT University. He has over twenty-eight years of varied experience within the software engineering, across industries and domains, and holds the Certified Professional status with the Australian Computer Society. His current role includes the responsibility for promoting research data management across the research lifecycle. http://orcid.org/0000-0002-1298-1622

Survey of open data and research data in the Australian context via the CSIRO Knowledge Network

Dr Jonathan Yu1, Dr Simon Cox2, Mr Benjamin Leighton3, Mr Hendra Wijaya4, Mr Qifeng Bai5

1CSIRO Land and Water, Clayton, Victoria, Australia, jonathan.yu@csiro.au

2CSIRO Land and Water, Clayton, Victoria, Australia, simon.cox@csiro.au

3CSIRO Land and Water, Clayton, Victoria, Australia, ben.leighton@csiro.au

4CSIRO/Data61, North Ryde, NSW, Australia, hendra.wijaya@csiro.au

5CSIRO Land and Water, Black Mountain, ACT, Australia, qifeng.bai@csiro.au

ABSTRACT

Australia is currently ranked 2nd place according to the OKFN Global Open Data Index [1] and since 2013, over 7000 datasets have been published through data.gov.au[2]. Increasing amounts of data is being published through state based open data initiatives too through data portals, such as data.nsw.gov.au[3], data.vic.gov.au[4]. Recently thematic or agency based data portals have been established such as the Sharing and Enabling Environmental Data (SEED) data portal[5], NSW Office of Environment and Heritage (OEH) Data portal[6]. Various NCRIS facilities also provide many data collections alongside these open government data initiatives, including AuScope, TERN, ALA, IMOS, and NCI covering primarily earth and environmental science data. A number of institutional repositories provide access to research datasets (CSIRO’s Data Access Portal, Research Data Australia through ANDS). Given the range of data being published online and through the various government and NCRIS initiatives, a challenge is to understand the current state of the data landscape in Australia and measure the complexity. Questions such as: how much data is available, how varied are they, are they interoperable, and which data is being used where?

Through the OzNome initiative[7], our team has been developing tools to (a) understand information infrastructures across Australia in greater detail and (b) enable researchers, industry and key partners to achieve productivity gains around their discovery, access and use of data. As part of this initiative, the CSIRO Knowledge Network (KN) provides a gateway to data across a wide range of initiatives in Australia. KN links to data held across multiple, heterogeneous data repositories. KN harvests, indexes, and registers each resource with KN identifiers, which enable improved linkages between data resources adding significant value to the information in the various source repositories and systems. KN currently provides search and discovery over 70k data collections and 175k spatial objects from 16 open government and research data repositories in Australia. Figure 1 shows a screenshot of analytics for information about data.gov.au and the top 100 datasets by keywords (available here: http://kn.csiro.au/about-dataset-list/data-gov-au).

Figure 1.  Top 100 datasets by keyword for datasets in data.gov.au

Using KN, we carried out a preliminary survey of open data and research data in the Australian context across 18 initiatives (9 CKAN, 2 Socrata, 2 Geonetwork instances, plus CSIRO DAP, eReefs, OzNome). These include federal, state and capital city data initiatives, NCRIS facilities and CSIRO. The data formats from these catalogues are quite heterogeneous, with CKAN portals providing the most straightforward source of information. Across the 9 CKAN instances, there were 182 different formats. Figure 2 shows the distribution of these formats and the long tail of lesser published formats. Table 1 shows the top 5 data providers by number of data resources published (Table 1a) and the top 5 data formats published by resource in CKAN based data portals (Table 1b).

Figure 2. Distribution of CKAN based data resource formats

Table 1: a) Top 5 CKAN resources per provider; b) Top 5 formats from CKAN instances

Data provider No. Resources   Format Count
data.gov.au 103212 HTML 43216
data.qld.gov.au 8505 PDF 25207
NSW OEH 7755 zip 12271
data.vic.gov.au 6673 WMS 9808
data.sa.gov.au 2871 CSV 9358

Most open data portals provide semantic annotation through subject or keyword level metadata, but using different labels and with values sourced from a variety of non-aligned subject vocabularies. For example, what CKAN calls “tags” (uncontrolled vocabulary), Geonetwork calls “subjects” (many from the GCMD keyword list) – see Table 2. It is unclear whether these keywords are granular and adequate enough to describe the resource.

Table 2: a) Top 5 CKAN tags; b) Top 5 Geonetwork Subject Keywords

CKAN tags Geonetwork subjects
“Earth Sciences”, 15868 “environment”, 346
“Oceans”, 15591 “EARTH SCIENCES”, 192
“GA Publication”, 13636 “National Computational Infrastructure (NCI)”, 168
“Ocean Temperature” ,12052 “climatologyMeteorologyAtmosphere”, 99
“Water Temperature”, 11827 “ATMOSPHERIC SCIENCES”, 94

While Australia is ranked high on the open data scale, semantic annotation is relatively anarchic, and therefore indexing is incomplete. For this preliminary scan a large proportion of the data came from environmental and earth sciences, but even here there is a large variation in formats (at least in the metadata itself). Future work is needed to gain better insight into the variety of data being published and tools for understanding how the data is actually used as that is not well understood.

REFERENCES

  1. OKFN Global Open Data Index, https://index.okfn.org/place, accessed 16 June 2017
  2. data.gov.au, http://data.gov.au/, accessed 16 June 2017
  3. data.nsw.gov.au, https://data.nsw.gov.au/, accessed 16 June 2017
  4. data.vic.gov.au, https://www.data.vic.gov.au, accessed 16 June 2017
  5. SEED data portal, https://www.seed.nsw.gov.au/, accessed 16 June 2017
  6. NSW OEH data portal, http://data.environment.nsw.gov.au/, accessed 16 June 2017
  7. OzNome initiative, https://research.csiro.au/oznome/, accessed 16 June 2017

BIOGRAPHY

Dr Jonathan Yu is an information specialist and is part of the Environmental Informatics group in CSIRO Land and Water. He’s currently leading and supporting the development of new approaches, methods and tools for transforming and connecting information flows across the environmental domain and the broader digital economy within Australia and internationally. His particular research interests include understanding information supply chains in various environmental domains to developing new methods and tools for streamlining and enhancing interoperability between them. http://orcid.org/0000-0002-2237-0091

12

Recent Comments

    About the conference

    eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

    Conference Managers

    Please contact the team at Conference Design with any questions regarding the conference.

    © 2018 - 2019 Conference Design Pty Ltd