TURNING BIG SHIPS (DATA) ON A DIME: CHANGE MANAGEMENT AND DATA SENTENCING

Convenor: Rhys Francis

Presenters: Mr Dave Connell1, Mrs Sandra Ennor2Mr Nicholas McPhee4Ms Jaye Weatherburn5

1Australian Antarctic Division, Kingston, Australia, dave.connell@aad.gov.au

2Monash University, Clayton, Australia, Sandra.Ennor@monash.edul

4Monash University, Clayton, Australia, Nicholas.McPhee@monash.edu

5University of Melbourne, Parkville, Australia, jaye.weatherburn@unimelb.edu.au

 

Achieving data management goals involves change management. For a University a growing challenge relates to the shift from

‘data storage is cheap – use as much as you need’ to ‘some cold data storage options are cheap, but high end data storage not

so much and it’s not unlimited. Therefore   we need to assign some Terms and Conditions – especially instigating sentencing

regimes’. These are not easy conversations and often a require a deep cultural change across the board.

Aim would be for the convenor to open the session with a few key questions (5 mins) including:

  • What is sentencing and why is it needed at all?
  • What incentives and techniques can be used to help researchers become more aware of sentencing requirements for

data and how this can assist them?

  • What are the most effective strategies to assist researchers in capturing metadata at the start of the process?
  • Before technical tools are implemented, what are some of the main issues in gaining traction for data to be effectively

sentenced and managed over time?

Then introduce the three sets of speakers.  Each speaker is representing a different aspect of data management and each stage has often involved hard or difficult conversations.

1)    Dave Connell – Australian Antarctic Division – Commonwealth of Australia (10 mins) – How to easily capture metadata – a government perspective

Although the working environment of the federal government is different to that of a university, there are a number of similarities when it comes to managing data.   First and foremost of those is the need to capture metadata.  At the Australian Antarctic  Division  our remit is primarily the management of scientific data, and over the course of two decades we have experimented with several methods of metadata collection ranging from the unfortunately complex to the ludicrously simple.  Further to that has been the need to bring about a change in cultural attitudes towards data management and  archiving.   This presentation will focus on what has and hasn’t worked with regard to metadata capture and how other extenuating factors have assisted in data management in the AAD.

2)    Sandra Ennor/Nick McPhee/Cath Nicholls – Monash University – eSolutions/MAWG/MerC (10 mins) – Research Data

Management and University Records Management – Reflections on what we have learnt and applied

Monash University (via its Monash Agent Working Group, a collaboration of Library, MeRC, eSolutions and Records and Archives) have been actively sentencing legacy electronic research data for well over a year now.  The initial results have been positive, but there has been many lessons learnt during these early stages.  This presentation will focus on two small case studies undertaken to date on two different sets of electronic research data. In particular we will reflect on how well (or  not)  some of the traditional corporate records management activities (e.g. sentencing data) have translated across into managing research data.  In particular this presentation will focus on the role of metadata, the triggers for applying sentencing actions, as well as the key communication and change management techniques being applied.

3)    Digital Preservation Strategy – University of Melbourne (10 mins)

Since 2016, the University of Melbourne (UoM) has been actively investing time and resources towards the establishment  and  implementation  of digital preservation (“the series of managed activities necessary to ensure continued access to digital materials for as long as necessary” [1]) through a dedicated digital preservation project. A high-level strategy [2] and complementary roadmaps [3] have guided various project work to date, including training and skills framework development, infrastructure pilot projects, and cultural awareness improvements. Phase 1 of the UoM Digital Preservation Project concluded at the end of 2017, with much successful analysis to draw into planning a business case going forward. The business case for Phase 2 (the Implementation/Embedding project phase), including a draft preservation service architecture for the university, is currently under development.

Analysis work has shown that cultural change is essential, to increase awareness of the importance of digital preservation, and for organisations to invest in maintaining the value of digital materials over time. There is a clear need for digital preservation in different research disciplines, and for high-value data in active use. Active preservation processes must be initiated and maintained, especially for digital research material with complex dependencies (eg. cloud,  distributed  data,  proprietary  software/hardware,  complex  copyright  and  IP)  to  ensure  materials  remain accessible and reproducible. Governance and data stewardship planning continue to be a key focus for 2018 to meet this  cultural  change  requirement. The creation of a digital-preservation focused wiki to bring together disparate knowledge sources around curation for data is also a work in progress to continue to strive towards the collaborative cultural shift that  is required for managing valuable digital assets over time. Iterative, agile approaches have been essential both to drive change, and also acceptance of the new processes and new capabilities that digital preservation brings to an organisation.

Goal is to then allow another 20 mins for open discussion with the audience to tackle some of the questions raised at the top. Although the three institutions represented above are tackling different elements of the data management issue, when the three are placed together there is a nice holistic vision of what might be an ideal future state for the larger institutions dealing with diverse and rapidly expanding research legacy data. In particular the three themes of capture the metadata well (and early), apply sentencing and actively moving and deleting data as sentences dictate, and thirdly having an active digital preservation strategy. However,  while  all  of these themes are desirable, achieving buy in and relevance to the central audience (ie the research community itself), is not necessarily a given or easy.  Our job is to make it easier for our researchers, but part of that involves  having  the  hard  chat.    We  hope  this  BoF  will  be  an  opportunity  to  share  ideas  on  achieving  some of these communication and change management outcomes.

REFERENCES

  1. Digital  Preservation  Coalition,  Digital  Preservation  Handbook,  Glossary:  “Digital Preservation”, accessed 4 June 2018, https://www.dpconline.org/handbook/glossary#D
  2. University of Melbourne Digital Preservation Strategy 2015-2025 – Vision Mandate and Principles, accessed 4 June 2018. http://hdl.handle.net/11343/45135
  3. University of Melbourne  Digital Preservation Strategy 2015-2025 – Implementation Roadmaps, accessed 4 June 2018, http://hdl.handle.net/11343/45136

Biographies:

Rhys Francis – Rhys spent the first decade of his career as an academic researcher in parallel and distributed computing. The next decade and a half included roles as a senior principle researcher, research programme manager and strategic leader in information and communication technologies in the Commonwealth Scientific and Industrial Research Organisation (CSIRO). His experience includes  being  the High Performance Scientific Computing Director for CSIRO and the National Grid Programme Manager for the Australian Partnership for Advanced Computing. From 2006 Rhys worked within the Australian Government’s National Collaborative Research Infrastructure Strategy as the facilitator for its investment plan in eResearch and subsequently as the Executive Director of the Australian eResearch Infrastructure Council. Since then through a series of engagements he has continued to work to harness advancing information and communication technologies to the benefit of Australian research.

Dave Connell – Dave Connell completed a Bachelor of Science (honours) degree at the University of Tasmania, and has been working at the Australian Antarctic Division since 1998 and as the metadata officer since 1999.  His role is to catalogue and archive all scientific data collected by the Australian Antarctic program – specifically to ensure that scientists write high quality metadata records and archive their data in a timely manner.  During his time at the AAD, he has overseen the transition from ANZLIC metadata to DIF metadata, and also developed tools for converting DIF metadata into various profiles of the ISO 19115 metadata standard.  Dave is also very active in the Australian Government metadata space – reviewing and adapting ISO 19115 metadata  standards  for  use  in  Australian  scientific  organisations.    He  has  also  worked  with  the  Ocean  Acidification  – International Coordination Centre to develop an ocean acidification metadata profile.

Sandra Ennor – Sandra Ennor is a Senior Records Analyst at Monash University. Sandra has embraced a career in the Records Management industry analysing recordkeeping practices, project managing system implementations, enhancing training regimes and  increasing  education.  Those  elements  assisting with driving passions such as understanding Information Culture and evolution of business process. Sandra collaborates in Data Management and Big Data spaces with primary objectives encompassing frameworks such as – Change Management, Compliance and Governance, Networking and Rights

(including recordkeeping rights of the child/student and creating accessible systems for staff).

Nicholas McPhee – Nicholas McPhee has been part of the Monash University eResearch Centre since its creation more than ten years ago and is currently working with researchers and research groups in order to provide them with personalised information and data management strategies. Nicholas has also been involved in the development of information and data management policies and has maintained and administrated eResearch applications and data storage.

Jaye  Weatherburn  –  Jaye  Weatherburn is based in the Digital Scholarship team in Scholarly Services at the University of

Melbourne, working to improve and support data stewardship and digital preservation capability.

ORCID ID: http://orcid.org/0000-0002-2325-0331

Adding Archival pathways to CloudStor

Mr Guido Aben1, Ingrid Mason1, Adam Bell1

1AARNet, Kensington, Australia

 

INSTRUCTIONS

AARNet has for several years offered the CloudStor platform to the Australian R&E community. CloudStor was designed to accept content directly from researchers and other users, including both active research data and research data that is no longer being actively used but which has been cited in publications.  Although CloudStor provides vast storage capability, long-term preservation of the data stored wasn’t actively addressed, neither in architecture nor in customer facing functionality.  For the purposes of this abstract, we will define “preservation” as the management of activities that will allow the data to be discovered, accessed, rendered, deemed reliable and re-used over many years and even decades.

Digital preservation is a known challenge and is being addressed by numerous software tools, services and projects.  A Research Data Shared Services (RDSS) platform being piloted in the UK by JISC[1] is designed to provide a suite of services for researchers to deposit, publish, share and preserve research data. Similar national and supra-national projects are underway in Canada[2] and the EU[3].  One of the tools being piloted as part of the RDSS is Archivematica, an open-source digital preservation system designed to ingest content, perform preservation actions, generate comprehensive technical and preservation metadata, and generate system-independent Archival Information Packages (AIPs) for long-term storage.  AARNET have engaged Artefactual, the principal consulting firm behind the Archivematica codebase, to run a research data preservation pilot project for a select sample of content in CloudStor, the purpose of which is to assess whether and how Archivematica’s preservation functionality can best be integrated with CloudStor, and its functionality made available to data custodians at connected institutions.

CloudStor, being a platform that hosts live data before it is due for archival, gives us the interesting opportunity of inspecting files and collections ahead of the do-or-die moment of archival, and proactively identifying content with particular preservation risks; perhaps signaling these risks to librarians associated with the collection.  For example, Archivematica includes a format identification microservice that attempts to determine the exact format and version of each file in a dataset, based on the PRONOM registry maintained by The National Archives in the UK.  The project will investigate whether datasets that contain a large proportion of files not identifiable through the PRONOM registry are indeed at risk of being unusable in the future, as identified by participating librarians.  Other preservation actions available in Archivematica include assignment of persistent identifiers, checksum generation, file format validation, metadata extraction, fixity checking, transcription, normalization to preservation formats and generation of standardized technical, preservation and audit metadata.  We expect that the addition of these capabilities to CloudStor would enable the service to provide a truly sustainable long-term solution for research data preservation, storage and re-use.

REFERENCES

https://www.jisc.ac.uk/rd/projects/research-data-shared-service

2 https://www.canarie.ca/rdm/funding-information-rdm-call-1/

3 https://www.geant.org/News_and_Events/Pages/The-European-Open-Science-Cloud-for-Research.aspx

[1] https://www.jisc.ac.uk/rd/projects/research-data-shared-service

[2] https://www.canarie.ca/rdm/funding-information-rdm-call-1/

[3] https://www.geant.org/News_and_Events/Pages/The-European-Open-Science-Cloud-for-Research.aspx


Biography:

Guido Aben is AARNet’s director of eResearch. He holds an MSc in physics from Utrecht University.

In his current role at AARNet, Guido is responsible for building services to researchers’ demand, and generating demand for said services, with the CloudStor / FileSender family perhaps the most widely known of those.

EuropeanaTech and Europeana Research

Ms Ingrid Mason1

1AARNet, Canberra, Australia, ingrid.mason@aarnet.edu.au

 

Europeana [1], the pan-Europe cultural heritage infrastructure and discovery platform, enables access to European cultural heritage data collections.  The aim of this presentation is to apprise the eResearch community of the state of play of development and capability within the Europeana technology community, and to highlight the researcher focused services provided by Europeana.

The presentation will consist of a pulse report and a précis.  The pulse report is on the EuropeanaTech conference 15-16 May 2018 held in Rotterdam, Netherlands [2] covering the major topics of interest and the themes arising from this significant triennial technology conference for the European digital cultural heritage community.  The précis is of the Europeana infrastructure and the services targeted at the research community.

The progress enabled through European investment in digital cultural heritage infrastructure stands as a useful reference point and example for the Australasian eResearch community.  Firstly, in its capacity to support data intensive research in the humanities and arts, where cultural heritage data is an input, and secondly, its strategic affiliations with European research infrastructures e.g., CLARIN [3], DARIAH [4], EHRI [5], EUDAT [6], Parthenos [7] and the European research libraries community through LIBER [8].

REFERENCES

  1. EuropeanaTech Conference https://pro.europeana.eu/event/europeanatech-conference-2018 accessed 15/06/2018
  2. CLARIN (Common Language Resources and Technology Infrastructure) https://www.clarin.eu/ accessed 15/06/2018
  3. DARIAH (Digital Research Infrastructure for the Arts and Humanities) https://www.dariah.eu/ accessed 15/06/2018
  4. EHRI (European Holocaust Research Infrastructure) https://www.ehri-project.eu/ accessed 15/06/2018
  5. EUDAT https://www.eudat.eu/ accessed 18/06/2018
  6. Parthenos http://www.parthenos-project.eu/ accessed 18/06/2018
  7. LIBER https://libereurope.eu/ accessed 18/06/2018

Biography:

Ingrid Mason is a technologist, librarian, deployment strategist, and data specialist. Ingrid leads change associated with digital transformation and manages online and software development projects for academic researchers and cultural sector practitioners. She is also a self-professed metadata nerd and digital curator that has found a workplace that satisfies her interests in culture and society, humanities research, informatics, software and the web.

http://orcid.org/0000-0002-0658-6095

CiNii Research: A prototype of Japanese Research Data Discovery

Mr Fumihiro Kato1, Dr Ikki Ohmukai1, Dr Teruhito Kanazawa1, Dr Kei Kurakawa1

1National Institute Of Informatics, Chiyoda, Japan

 

INTRODUCTION

National Institute of Informatics (NII) hosts scholarly information services for Japanese researchers and students so far. CiNii [1] is a discovery service provided by NII for Japanese research literatures such as articles, books and dissertations. It harvests and integrates metadata of publications from institutional repositories, the National Diet Library, academic societies and other scholarly databases in Japan. As sharing and reusing research data are one of key concepts of open science, we have launched a project called CiNii Research to enhance CiNii to support research data as a first-class citizen since 2017.

CiNii Research aims to enable search and discovery of publications and datasets produced by research projects in Japan. To achieve this goal, we update NII scholarly services to support research data. And we also work on the development of the entire CiNii Research system. CiNii Research consists of three components illustrated in Figure 1. The first component is to aggregate metadata of research objects related to research projects in Japan. The Second component is to extract research objects and relationships among them from collected metadata in order to make a knowledge graph. The last component is to provide a discovery service for research objects by indexing nodes of the knowledge graph. In this presentation, we report on the progress of the development.

Figure 1: Components of CiNii Research

Aggregation

The first component is to aggregate metadata of research objects related to research projects in Japan. NII has already collaborated with Japanese universities and institutions to collect research objects behind NII scholarly information services . For instance, IRDB [2] is a national aggregator of institutional repositories in Japan. It includes 2.8 million records from 681 repositories as of June 2018. And the number of datasets is 55 thousand records (2.5% of total). As CiNii uses metadata collected by IRDB, we will update IRDB with JPCOAR Schema 1.0 [3] which is the latest metadata schema for Japanese institutional repositories to support new features like research data, identifiers and open access policies.

Another aggregator is KAKEN [4] that collects and hosts result reports of Grants-in-Aid for Scientific Research (KAKENHI) which is one of the major research funds by the Government of Japan. In addition to these existing aggregators, we also collect persistent identifiers from JaLC [5], DataCite, Crossref and ORCID. JaLC is the DOI registry agency in Japan and NII is one of board members of JaLC. Hence JaLC is our primary DOI source as Japanese repositories use JaLC to assign DOIs to research objects including datasets.

Knowlege GRAPH

Constructing a knowledge graph of research objects is an essential part of a modern discovery service as links between scholarly literature and dataset help to find further related research objects. We defined targeted types of resource objects as products, researchers, projects, organizations and funds. Products is defined as a superset of articles, books, dissertation and datasets. We currently focus on products, researchers and projects because these types and their relationships are most important for CiNii Research. Our system extracts research objects of these types from aggregated metadata to identify them with persistent identifiers and name disambiguation techniques.

Acquiring links between identified objects is the most important but hardest process of creating a knowledge graph as explicit links in metadata are rather a few as of this moment. Our current challenge is to extract relationships from our existing scholarly services to integrate into the knowledge graph. KAKEN has reports, product lists and researchers of research projects so that the system can obtain links among products, researchers and projects. As CiNii has products and researchers, the system can get links between them. However, the main issue of this challenge is that each scholarly service is mostly independent and only a part of national researcher identifiers is shared now. Therefore, we concentrate to integrate research objects and their links among services.

Also, we expect that metadata including identifiers of research objects and links between research objects will increase in future as NII has been developing a new version of institutional repository system called WEKO3 to implement the JPCOAR schema. The current WEKO is used by about 500 Japanese universities and institutions via our hosting service. After the hosting service is replaced to the WEKO3, we encourage researchers and librarians to input identifiers for research objects and relationships between research objects in their public repositories. They will help us to grow and refine our knowledge graph.

Creating a knowledge graph of research objects is also important for a global collaboration with other discovery and related services. Scholix [6] provides an interoperability framework for exchanging links between scholarly literature and data and global aggregators of data-literature links such as DataCite, Crossref, OpenAIRE or EMBL-EBI. OpenAIRE also provides OpenAIRE LOD services [7] to share their integration of data about research as Linked Data. Research Graph [8] creates a local graph for research management systems to make links to the larger Research Graph including funding information, collections of research datasets and open access repositories. We would like to share and exchange such links of research objects to collaborate with international activities.

Discovery service

We have been implementing an integrated search with Elasticsearch to show information of research objects and relevant objects based on our knowledge graph so that a user can follow their relation links to find more related research objects. CiNii Research provides a simple input form to search keywords. A user can select a target type or all types of research objects described in the Knowledge Graph section from tabs before searching words in the form. If a user selects the “dataset” tab, search results are filtered only for datasets. CiNii Research does not support a typical facet search that many discovery services implement for their search results because we would like to keep the results as much as simple at this time.

We plan to support a way to connect CiNii Research to our research data management platform called GakuNin RDM [9]. It will enable us to import specific research data directly after finding it on CiNii Research.

REFERENCES

  • Available from: https://ci.nii.ac.jp/en, accessed 7 Jun 2018.
  1. Available from: http://irdb.nii.ac.jp/analysis/index_e.php, accessed 7 Jun 2018.
  2. JPCOAR Schema Guidelines. Available from: https://schema.irdb.nii.ac.jp/en, accessed 7 Jun 2018
  3. Available from: https://kaken.nii.ac.jp/en/, accessed 7 Jun 2018
  4. Japan Link Center (JaLC). Available from: https://japanlinkcenter.org/top/english.html, accessed 7 Jun 2018.
  5. Burton, A. et al. The Scholix Framework for Interoperability in Data-Literature Information Exchange. D-Lib, 2017.
  6. Alexiou, G., et al., OpenAIRE LOD Services: Scholarly Communication Data. Save-SD 2016., Lecture Notes on Computer Science, vol 9792. 2016, p. 45-50.
  7. Aryani, A. and Wang, .J. Research Graph: Building a Distributed Graph of Scholarly Works using Research Data Switchboard, in Proceedings of Open Repository 2017, 2017.
  8. Komiyama, Y. and Yamaji, K. Nationwide Research Data Management service of Japan in the Open Science Era, in Proceedings of the 6th IIAI International Congress on Advanced Applied Informatics, 2017, pp.129-133.

Biography:

Fumihiro Kato is a researcher at Research Center of Open Science and Data Platform, National Institute of Informatics since 2017. He is currently responsible for the development of the Japanese research data discovery service. He also works for Linked Open Data projects like DBpedia Japanese and the IMI project to create a common vocabulary for Japanese national and local governments.

He received his Master of Media and Governance from Keio University in 2004. His research interests are web technologies, semantic web and linked open data.

https://orcid.org/0000-0001-8504-5782

Dimensions the next generation approach to data discovery

Ms Anne Harvey1

1Digital Science, Carnegie, Australia

 

The research landscape exists in silos, often split by proprietary tools and databases that do not meet the needs of the institutions they were developed for. What if we could change that? In this session we’ll showcase Dimensions: a platform developed by Digital Science in collaboration with over 100 research organizations around the world to provide a more complete view of research from idea to impact.

We’ll discuss how the data now available enables institutions to more easily gather the insights they need to inform the most effective development of their organization’s activities, and look at how linking different sections of the scholarly ecosystem (including grants, publications, patents and data) can deliver powerful results that can then be integrated into existing systems and workflows through the use of APIs and other applications.

In particular, we’ll explore how the Dimensions approach to re-imagining discovery and access to research will transform the scholarly landscape, and the opportunities it presents for the research community.


Biography:

Anne Harvey is the Managing Director for Digital Science Asia Pacific with an overall responsibility of supporting clients with their research management objectives.

Anne has been involved in a number of projects including Big Data Computing (which refers to the ability of an organisation to create, manipulate, manage and analyze large data sets and its ability to drive knowledge creation), Australia’s ERA 2012 and 2010 (research assessment exercise).

Anne has a passion for information and research and previous positions include Regional Sales Manager at Elsevier, Business Development Manager at Thomson Reuters.

Enabling eResearch with Automated Data Management from Ingestion Through Distribution

Mr David Fellinger1

1iRODS Consortium, Chapel Hill, United States, davef@renci.org

 

HSM as Critical Path

The early Beowulf clusters were generally utilized to solve iterative mathematical problems, simulate environments and processes, and generate visualizations of systems that are difficult if not impossible to physically recreate. These initial clusters allowed researchers to make great strides in shortening the time required to solve complex, multi-dimensional matrices such as Schrodinger’s equation applied to specific materials and systems. We have seen widely varying uses from understanding the fusion reactions at the core of the Sun to simulating a spark plug in a car cylinder. As small clusters evolved into supercomputers, file systems also evolved to capture the voluminous data that was generated by these simulations and visualizations. Parallel file systems such as GPFS and Lustre were designed to scale in both bandwidth and depth allowing data transitions from the cluster to the file system through multiple elements termed “gateway nodes”. These file systems utilize extremely high performance storage because compute and input output (I/O) operations are mutually exclusive. The use of slower storage would effectively increase the length of the I/O cycle diminishing the compute to I/O time ratio. This specialized storage is expensive and it is important that the supercomputer is the primary client to maximize efficiency. It is clear that data should never be distributed to users or researchers directly from the file system that is closely coupled to a supercomputer. Thousands of individual requests can slow down the parallel file system reducing the effective life of the supercomputer by extending the I/O cycle time needlessly. Secondary storage and file systems are generally used for widespread research data access. These file systems do not require custom access clients and are usually compatible with NFS or CIFS which are standard tools in existing operating systems. Data is then migrated from the expensive parallel file system to the less expense data distribution file system by using copy commands.

Even though the storage media used for distribution is less expensive, it is usually rotating media which must have power applied to operate. Many sites store little used data on tape which is extremely inexpensive and does not require continuous power. This is referred to as archive storage and is the final copy location before the data is finally deleted. The described workflow has fostered the growth of software processes which are termed hierarchal storage management (HSM) systems. Several organizations have developed these systems to move data from one location to another generally based on the age of the data and the frequency of access. While these systems are effective they usually require a great deal of human intervention.

The Growth of Big Data and Data Reduction

The proliferation of large scale sensor data has driven changes to both the compute and storage model. Data reduction environments now represent the majority of high performance computing in eResearch. Instruments such as scanning electron microscopes and genomic sequencers generate petabytes of data in health science research. Data reduction applications span studies from hydrology to seismic research as well as relating genotype to phenotype data in medical research. Instruments such as the Large Hadron Collider and telescopes both optical and radio generate a great deal of data that must be mined and reduced to be of value to scientists. The term “Big Data” was coined to describe this sort of sensor data having the characteristics of volume, velocity, variety, variability, and veracity. Large volumes of metadata are also generated to track and secure the provenance of the collected data to maintain veracity. The task of managing “Big Data” from creation through ingestion, reduction, and distribution cannot be easily achieved with traditional HSM tools which, in general, move data around based upon file date and type but not by content analysis. Human intervention on a file by file or object by object basis is also difficult if not impossible considering the petabytes that must be managed.

The Rise of Policy-Based Data Management

While it is often impossible to manage large quantities of data through human intervention, it is possible for data scientists and librarians to form a consensus that will dictate a policy which, in turn, dictates computer actionable rules that can be applied at every stage of a workflow. The Integrated Rule Oriented Data System (iRODS) has been designed to enable a very flexible policy-based management environment for a wide variety of applications. Recently developed features expand iRODS capabilities well beyond that of a traditional HSM to a complete tool that can simplify complex eResearch data manipulation and data discovery tasks.

The first phase of any data gathering and analysis project starts with ingestion of instrument or sensor data to enable analysis. This process consists of the establishment of a “landing zone” which is a storage buffer for incoming data. This “landing zone” may be centralized but, if the instruments have buffer file systems, the “landing zone” can also be ubiquitous so that data copies are minimized and the data effectively remains in the buffer space. Fully automated, rules based, data ingest capabilities have recently been added to the iRODS feature set. A file system scanning capability has been added to iRODS which can watch external file systems for new or updated files and launch appropriate action. The action taken can be based on file size and file transfer bandwidth capability. If the files to be processed are very large, it may be more efficient to register the data in place so that it can be moved in whole or part to a file system adjacent to a compute cluster only when it is required for a data reduction process. If the files are smaller and the data transfer bandwidths are large it might make more sense to centralize the data on a file system which can deliver data to a compute file system more efficiently. In either case, iRODS can extract metadata to enable additional discovery operations. As an example, it is a common practice to study large numbers of genomic sequence files in parallel to identify similarities or differences. Attributes are generally associated with these files to enable reduction efficiencies so that only files with specific attributes are compared. The attributes can be extracted by iRODS and used to enrich the metadata so that only data with very specific attributes are moved to a “scratch” file system for analysis. This entire process of automated file selection and data movement can be used to trigger other operations based on policy enforcement points. These operations could include launching a data reduction process or generating a report describing the data in the “landing zone”. The file scanner is a “pull” process but iRODS can also utilize a “push” process. Parallel file systems such as Lustre can generate a change log based on file system modifications and iRODS can push data to a specific location based upon these changes and associated rules. In fact iRODS can respond to any system state change or event to begin a data movement or analysis process. This could be useful for a tool such as a CAT scan machine that is not in continuous service but must be harvested when a scan is completed.

The process of data ingestion is itself a workflow dedicated to both data organization and description yielding a registered and discoverable tier of storage. Subsequent processing is metadata-driven based upon rules which are written to maintain the chosen policies. For example, Fourier planes of radio astronomy data with similar characteristics collected over time can be automatically migrated to a parallel file system to enable a compute process allowing astronomical event analysis. Selection of the required files would be based upon attributes described in the rich metadata extracted during the registration process. It may be necessary to migrate data with different characteristics to an archive reserved for analysis at a later time. The new iRODS function of metadata-driven storage tiering facilitates the efficient use of storage resources in eResearch. This function is unique to iRODS and is dynamic allowing data mobility decisions to be made in real time based upon user-defined metadata attributes or harvested metadata like machine availability, storage migration data bandwidth, real time attributes that may change in priority based upon newly ingested data, and the storage resource value and framework. Literally any attribute of a file or system element can be evaluated in a real time decision tree to enable efficient data analysis operation. All of these processes can operate in a parallel fashion limited only by the number of nodes assigned to the process. The processes can also operate over a wide physical distance or organizational boundaries utilizing iRODS federation enabling eResearch collaboration over continents or the world.

Finally, the entire process of data distribution to the scientific community can be managed by iRODS. Data provenance is assured since every step of the workflow from ingestion through publication can been tracked and audited. Multiple layers of attribute assignment over the process steps assure that the published data is fully discoverable and metadata-driven access controls assure regulatory compliance.

Modern eResearch implies the analysis and dissemination of data at a scale that is growing exponentially. The use of open source iRODS automation to enable a policy-based, rules-driven workflow can simplify the entire data lifecycle while allowing full traceability, reproducibility, and flexibility for when the policy changes in the future.


Biography:

Dave Fellinger is a Data Management Technologist with the iRODS Consortium. He has over three decades of engineering and research experience including film systems, video processing devices, ASIC design and development, GaAs semiconductor manufacture, RAID and storage systems, and file systems. As Chief Scientist of DataDirect Networks, Inc. he focused on building an intellectual property portfolio and presenting the technology of the company at conferences with a storage focus worldwide.

In his role at the iRODS Consortium, Dave is working with users in research sites and high performance computer centers to confirm that a broad range of use cases can be fully addressed by the iRODS feature set. He helped to launch the iRODS Consortium and was a member of the founding board.

He attended Carnegie-Mellon University and hold patents in diverse areas of technology.

Introducing ReDBox 2: A Hands On Exploration of ReDBox 2 and the Provisioner for Institutions

Gavin Kennedy1, Dr Peter Sefton2, Andrew Brazzatti1, Moises Sacal Bonequi2, Michael Lynch2

1Queensland Cyber Infrastructure Foundation, Brisbane, Australia, gavin.kennedy@qcif.edu.au
2University of Technology Sydney, Sydney, Australia, Peter.Sefton@uts.edu.au

DESCRIPTION

This is a hands-on workshop for institutions to preview the new ReDBox 2 Research Data Management platform. ReDBox is the most widely used research data management platform in Australian Universities, but with its focus on managing and publishing the metadata for data collections, it has never reached its potential as an end to end solution. QCIF and UTS have collaborated to develop ReDBox 2, a comprehensive platform to support the research data life cycle. It provides an integrated data management planning capability, it allows users to provision and manage research services such as storage infrastructure and it then supports the ingest of data packages using the DataCrate standard, allowing ReDBox to publish the data alongside the metadata. In developing ReDBox 2, we have focussed on making an easy to configure web application using Sails.JS, a modern javascript framework.

This workshop will be delivered in three parts:

Part 1 is an introduction to and discussion of ReDBox and the Provisioner (30 Minutes):

  1. Overview of ReDBox
  • The Mint Namespace Authority
  • Data Management Planning tools
  • Workspaces and Provisioning
  • Metadata and data harvesting and curation
  • DataCrate for data packaging
  • Publication workflows

PART 2 is a demonstration and hands-on exploration of RedBoX for institutions (120 Minutes):

  • Details of the technology stack (nodeJS, Sails, MongoDB)
  • How to install the platform
  • Loading data into Mint
  • Creating records, including DMPs, data records and publication records.
  • How to configure forms and workflows
  • How to integrate with services to create new workflows

PART 3 is a discussion on community involvement, timelines and future developments (30 minutes)

WHO SHOULD ATTEND

This workshop will be of interest to repository managers, data librarians and technical staff, as it will describe how the architecture of the platform supports the research data lifecycle. It does not assume extensive technical knowledge, so we encourage both developers and administrators to attend.

WHAT TO BRING

Attendees will need to bring a laptop with a web browser installed. During the workshop we will install additional software such as SSH/Putty and WinSCP or Cyberduck.


BIOGRAPHIES

Dr Peter Sefton is the Manager, eResearch Support at the University of Technology, Sydney (UTS). Before that he was in a similar role at the university of Western Sydney (UWS). At UTS Peter leads a team working with key stakeholders to implement university-wide eResearch infrastructure, including an institutional data repository, as well as collaborating widely with research communities at the institution on specific research challenges.

Gavin Kennedy is an IT research and solutions expert and is the head of Data Innovation Services at the Queensland Cyber Infrastructure Foundation (QCIF).  Gavin leads the QCIF based development team responsible for ReDBox, the popular research data management and publishing platform. Gavin is a passionate advocate for Open Source platforms to support open research and the FAIR data principles. Gavin has over 30 years IT experience in organisations as diverse as CSIRO, General Electric and British Telecom.

 

MyTardis: FAIR data management for instrument data

Conveners: Wojtek J. Goscinski1 , Amr Hassan2

Presenters: Andrew Janke3Andrew Mehnert4Aswin Narayanan5Dean Taylor6James M. Wettenhall7Jonathan Knispel8Keith E. Schulze9Lance Wilson10Manish Kumar11Samitha Amarapathy12

1Monash eResearch Centre, Monash University, Melbourne, Wojtek.Goscinski@monash.edu
2Monash eResearch Centre, Monash University, Melbourne, Amr.Hassan@monash.edu
3National Imaging Facility, Center for Advanced Imaging, The University of Queensland, Brisbane, andrew.janke@uq.edu.au
4Centre for Microscopy, Characterisation and Analysis, The University of Western Australia, Perth,andrew.mehnert@uwa.edu.au
5National Imaging Facility, Center for Advanced Imaging, The University of Queensland, Brisbane, a.narayanan@uq.edu.au
6Centre for Microscopy, Characterisation and Analysis, The University of Western Australia, Perth, dean.taylor@uwa.edu.au
7Monash eResearch Centre, Monash University, Melbourne, james.wettenhall@monash.edu
8Centre for Microscopy, Characterisation and Analysis, The University of Western Australia, Perth, jonathan.knispel@uwa.edu.au
9Monash eResearch Centre, Monash University, Melbourne, keith.schulze@monash.edu
10Monash eResearch Centre, Monash University, Melbourne, lance.wilson@monash.edu
11Monash eResearch Centre, Monash University, Melbourne, manish.kumar@monash.edu
12Monash eResearch Centre, Monash University, Melbourne, samitha.amarapathy@monash.edu

GENERAL INFORMATION

  • Workshop Length: One Day
  • This workshop will have the last 2 hours as a hands-on component

DESCRIPTION

Research data management platforms aim meet the challenges of capturing and managing large volumes of research data,  while ensuring that the data is Findable, Accessible, Interoperable and Reusable (FAIR). One such platform is MyTardis (https://www.mytardis.org), an open source research data management platform that was initially establish to handle and store macromolecular crystallography data {Meyer:2014ub, Androulakis:2008ku}. Through several national projects like the NeCTAR Characterisation Virtual Laboratory (https://www.massive.org.au/cvl), ImageTrove (http://projects.ands.org.au/id/ERIC08) and the ANDS Trusted Data projects (https://projects.ands.org.au/id/GFA16), MyTardis has evolved into a general purpose research data management system, with a focus on integrating scientific instruments and instrument facilities. It is used across light microscopy, electron microscopy, proteomics, cytometry, magnetic resonance imaging (MRI), positron emission tomography (PET), and other scientific techniques. It integrates over  100  Australian  instruments  across  Monash  University,  University  of  Queensland,  University  of  Newcastle, University of New South Wales, RMIT, and University of Western Australia.

In this workshop, representatives from the Characterisation community will share their experience in developing and operating large deployments of MyTardis. We will emphasise how MyTardis helps to securely store and manage data from  a  variety  of  different  instruments.  We  will also outline the short- to medium-term roadmap for MyTardis development and our plan to engage the wider community to help us build the next-generation platform for instrument data management. Finally, we will run a hands-on workshop on best-practices for deploying and operating MyTardis, specifically targeted at developers and system administrators.

Workshop Contents:

Overview of MyTardis and its deployments

  • Overview of MyTardis
  • Developing and operating MyTardis at Monash University
  • NIF Trusted Data Repositories
  • Developing and operating MyTardis at the University of Queensland and NIF
  • Developing and operating MyTardis at the University of Western Australia
  • Developing and operating MyTardis at the University of Newcastle
  • MyTardis features for instrument facilities

Future Roadmap

  • The Future of MyTardis
  • Requirements from instrument facilities
  • Addressing FAIR by integrating with the experiment, trusted data repositories.
  • Panel Discussion / BOF- Next-generation Instrument data Future and challenges

Hands On

  • Hands on session on deployment of MyTardis

WHO SHOULD ATTEND

  • Instrument facility managers
  • Data Managers
  • IT Managers & Directors
  • Professionals in associated disciplines
  • Research Computing Specialists
  • Research Managers
  • University Representatives
  • Researchers
  • Librarians
  • Software & App engineers

WHAT TO BRING

Attendees need to bring a laptop.


BIOGRAPHIES

Dr Wojtek James Goscinski is the coordinator of MASSIVE, a national high performance computing facility for data science, and Associate Director at the Monash eResearch Centre a role in which he leads teams to develop and implement digital strategies to nurture and underpin next-generation research. He holds a PhD in Computer Science, a Bachelor of Design (Architecture), and a Bachelor of Computer Science.

Dr Amr Hassan is the eResearch Delivery leader at the Monash eResearch Centre. He leads a team of eResearch professionals to ensure the delivery of high-quality ICT services, projects and programmes that enable the achievement of the eResearch strategic agenda of Monash University. He holds an interdisciplinary PhD in Computational Sciences, an M.Sc in Scientific Computing, and a B.Sc. of Computer Science.

 

Recent Comments

    About the conference

    eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

    Conference Managers

    Please contact the team at Conference Design with any questions regarding the conference.

    © 2018 - 2019 Conference Design Pty Ltd