Clare Richards1, Kate Snow2, Chris Allen3, Matt Nethery4, Kelsey Druken5, Ben Evans6
1Australian National University, Canberra, Australia, Clare.Richards@anu.edu.au
2Australian National University, Canberra, Australia, Kate.Snow@anu.edu.au
3Australian National University, Canberra, Australia, Chris.Allen@anu.edu.au
4Australian National University, Canberra, Australia, Matthew.Nethery@anu.edu.au
5Australian National University, Canberra, Australia, Kelsey.Druken@anu.edu.au
6Australian National University, Canberra, Australia, Ben.Evans@anu.edu.au
NCI has established a combined Big Data/Compute repository for National Reference Data Collections supported through the former Research Data Services (RDS) project in which NCI led the management of the Earth System, Environmental and Geoscience Data Services, comprised primarily of Climate and Weather, Satellite Earth Observations and Geophysics data. NCI has over 15 years of experience working with these multiple domains and building the capacity, infrastructure and skills to manage this, and making the data suitable for use across these domains, as well as for uses beyond those of the domain that generated the data. Over recent years it has become apparent that as data volumes grow, then discovery and reproducibility of research and workflows have to be efficient, and internationally agreed standards, data services, data accessibility and data management practices all become critically important.
One major driver for developing this type of combined Big Data/Compute infrastructure has been to address the needs of the Australian Climate community. This national priority research area is one of the most computationally demanding and data-intensive in the environmental sciences. Such data needs to reside within a focused national centre to handle the scale and dimensions of the requirements, including computation to generate the data, the computational capacity to analyse this data and the expertise to manage very large and complex data collections. A large proportion of climate data comes from the World Climate Research Programme’s Coupled Model Intercomparison Project (CMIP) and is managed and shared by an international and collaborative infrastructure called the Earth Systems Grid Federation (ESGF).
In the not too distant past the CMIP data was shipped around the globe on several hard drives. The data was difficult to keep up to date and share with all the researchers who needed access. Indeed, the data generated for CMIP has always outstripped the capacity to share. However, the volumes of data quickly grew beyond a capacity to distribute in a timely fashion. For example, for CMIP3 in 2001 the data was the order of 10TB but by 2013 CMIP5 required 1 PB, and CMIP6 is predicted to be at least 20PB. The sheer size and complexity of the CMIP data collection makes it impossible for repositories to manage in isolation, as it is both difficult and costly to manage multiple copies of data for individual users. For climate researchers, being able to access and search such large volumes of globally distributed data for individual files that match specific criteria can be like finding several needles in many haystacks!
INTERNATIONAL COLLABORATION FOR MANAging CLIMATE DATA
To solve this problem the ESGF, an international collaboration led by the US Department of Energy, was set up to improve the storage and sharing of the rapidly expanding petascale datasets used in Climate research globally. Since its establishment more than 10 years ago, the ESGF has continued to grow and now manages tens of petabytes of climate science and other data at dozens of sites distributed around the globe. NCI has been a Tier 1 node of the ESGF since 2009 and has invested significantly in the development of the infrastructure, the evolution of Big Data management practices, and expertise to support this international collaboration. Developing such a capability requires a long-term commitment to harmonising and maintaining a ‘system for the management, access and analysis of climate data’ that meets global community protocols whilst respecting local access policies. 
As an international coordinated activity, the ESGF requires intense collaboration between the data generators and repository/node partners to ensure adherence to standards and protocols that create a robust, dependable and sustainable infrastructure that improves data discovery and use across the whole global network while supporting use in the local environment. To make the system work, each node/repository is managed independently but each agrees to adhere to common ESGF protocols, software and interfaces including:
- Data and metadata complies with agreed standards and conventions;
- Version control protocols, common vocabularies, and ‘Data Reference Syntax’; and
- Consistent publication requirements across the distributed network.
APPLYING the Benefits to other Domains
One of the benefits of all participants adhering to the ESGF protocols is that a Climate researcher in Australia can search and access data locally or internationally across all participating repositories, and be confident that the data can be reliably used. This is important for scientific collaboration and sharing, verification of research, and reproducibility of results – increasingly important for the publication of research papers. The ESGF model also demonstrates that the depth of expertise required to ensure that climate data keeps up with international standards and trends cannot be replicated across all participating nodes/repositories. However, smaller repositories can benefit from the collaborations which not only help define the standards and protocols required, but also provide expertise to develop the tools to manage this important globally distributed peta-scale data collection.
With the Big Data problem increasingly affecting other research domains there is so much that can be learned from the ESGF model of international collaboration to deliver value to other major research communities, particularly those that are seeking to be part of a shared global data services network and move beyond managing individual stores of downloadable data files.
- The Earth System Grid Federation, Design. https://esgf.llnl.gov/federation-design.html [Last accessed 22 June 2018].
- The Coupled Model Intercomparison Project (CMIP) https://www.wcrp-climate.org/wgcm-cmip [Last accessed 22 June 2018].
- The Earth System Grid Federation. https://esgf.llnl.gov/index.html [Last accessed 22 June 2018].
Clare Richards is the Senior HPC Innovations Project Manager at the National Computational Infrastructure (NCI). She manages several projects and activities within the Research, Engagement and Initiatives area, including the Climate Science Data Enhanced Virtual Laboratory and other collaborative projects with NCI partners. Prior to joining ANU in 2015, she had a lengthy and diverse career at the Bureau of Meteorology and has also dabbled in marketing and media.