Measuring Data

Ms Ai-Lin Soo1, Dr Rhys Francis, Dr Leslie Almberg2, Mr Jason Lohrey2

1Restech UNSW Sydney, Australia, 2Arcitecta,

Biography:

Dr. Leslie D. Almberg (she/her)

Researcher, Writer, Geoscience Educator, and Program Director

Dr. Leslie D. Almberg is a researcher and writer with a background in volcanic processes, geoscience education, innovative curriculum design, and journalism. She has a deep interest in finding simple ways to explain complex systems to broad and diverse audiences. Her current focus is on research data management, specifically the intersections between data collectors, data managers, data users, and the powers that be.

In her role with Arcitecta, she is building relationships within the Australasian research community to connect people with data solutions. As Program Director for Earth and Environmental Science at Australian Science Innovations, Dr. Almberg also leads a national initiative to engage and extend high-performing students in Earth sciences.

Dr Rhys Francis

Rhys was an academic researcher in parallel and distributed computing through the 1980s. Then, from 1990 through to 2005 his roles extended into strategic leadership in information and communication technologies for the CSIRO. From 2006 Rhys facilitated the development of a national investment plan in eResearch infrastructure for the Australian Government’s National Collaborative Research Infrastructure Strategy that shaped the foundations of the national e-infrastructure landscape visible in Australia. Today Rhys is part of the team developing the Australian BioCommons that is accelerating the adoption of digital technology in Australian life science research and also facilitates the Research Data Culture Conversation (researchdataculture.org).

Ai-Lin Soo

Ai-Lin Soo is a nationally recognised subject matter expert in research data who has used her experiences to contribute to institutional and national projects to improve data management practices. Some of her projects include activities under the Research Data Culture Conversation (researchdataculture.org) such as "The Macro View of Australian and New Zealand Aotearoa Research Data". In her pursuit to improve research data management practices, she is currently undertaking a Masters of Philosophy. As well as this, Ai-Lin is experienced in supporting research endeavours and is the Operations Manager for the Ai for Law Enforcement and Community Safety (AiLECS) Lab, a research centre partnership with law enforcement.

Abstract:

In 2017, the eResearch teams of five major universities initiated discussions on research data culture. A primary concern was the perceived rapid growth of data and its potential budgetary implications for CFOs. Contrary to general assumptions, initial measurements revealed a surprisingly lower growth rate, the reasons for which remain unclear.

This finding was subsequently supported by the MacroView, a two-year project estimating Australia's research data scale for December 2021 and 2022. Current extrapolations suggest approximately 370 petabytes (PB) of research data in 2022, growing at 22-25%. This would be 550-600 PB today. However, institutions lacked the capacity to report on significant aspects of their data holdings, leaving the detailed characteristics of this research data asset unknown.

To manage research data as a valuable asset, we believe understanding its origin, initial use, replication, ongoing use, and longevity is essential. Consequently, we are now examining data use at a research intensive Australian university in partnership with Arcitecta. Utilising their Mediaflux system, which manages in that case around 17 PB of data, we have generated de-identified event-by-event logs of all service activities. The logs allow us to visualise data asset state changes, including the age profile of data in use and data and metadata creation and access patterns.

Our overall goal is to define a set of measurable attributes that can form the basis for organization-wide data reporting guidelines. We also hope to observe different life cycles of data that occur in practice.

The poster will detail our methodology and preliminary results.

 

Categories