Dr Siddeswara Guru1, Hamish Holewa2, Gerhard Weis2, Hoylen Sue3, Sarah Richmond2
1 The University of Queensland, St Lucia, Australia, email@example.com
2Griffith University, Brisbane, Australia, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
3QCIF Pty Ltd, Australia, email@example.com
The EcoCloud is a customised National eResearch Collaboration Tools and Resources (NeCTAR) hosted cloud infrastructure for ecosystem science to enable more effective access to ecosystem science data, compute platform and resources for innovative research. One of the motivations to develop EcoCloud is to bring wider accessibility of data to compute resource to perform complex analysis. The EcoCloud is part of a three domain specific Science Clouds, namely, Australian Ecosystems Science Cloud (EcoCloud), Australian Biosciences Cloud and Australian Marine Sciences Cloud which were all started with the collaboration between NeCTAR and Terrestrial Ecosystem Research Network (TERN), BioPlatform Australia (BPA) and Integrated Marine Observing System (IMOS) respectively. EcoCloud is built in partnership with TERN, NeCTAR, Queensland Cyber Infrastructure Foundation (QCIF), National Computing Infrastructure (NCI) and Atlas of Living Australia (ALA). The EcoCloud will be a building block towards the development of Australia’s open ecosystem science cloud for seamless access to data, tools and analysis to carry out transparent, reusable research and development across different ecosystem domains that can influence cutting edge research and evidence-based public policies.
Ecosystem science is a multi-disciplinary science which studies the inter-relationship between living organisms, physical features, bio-chemical processes, natural phenomena, and human activities . TERN plays a critical role in collecting and collating terrestrial ecosystem data, and ALA does the same for biodiversity data. But, there are several initiatives and organisations that collect and publish extensive collection of ecosystem data. With the advancement of technology and emergence of cloud platform developed by NeCTAR and Research Data Services (RDS), the ecosystem science community wants to leverage easy and seamless access to services like network infrastructure, storage and compute to build their analysis pipelines. Furthermore, accessing data residing in different data centres, multiple interfaces to search and access data and lack of streamlined access to cloud-enabled data for compute is still challenging and most often not possible.
The EcoCloud was started with a strong user-centric approach to address some of the challenges faced by the ecosystem science community to get streamlined access to data closer to compute resource. Two sets of community engagements were started with one focusing on domain scientist and another focusing more on application developers. Several patterns emerged about the needs of the community. In general, users wanted managed scalable compute platforms with openly accessible data collections closer to compute for further analysis. Following are some of the drivers for EcoCloud:
- ensure that data is widely accessible for use across different science disciplines
- harmonise TERN data infrastructure and other similar data infrastructures to offer a common platform to perform a data-centric search, query and access data from different platforms and virtual labs
- provide scalable managed computing environment with easy access to distributed and data-intensive computation and technologies
- develop a support system for a cross-disciplines use of data.
The overall component architecture was drawn up to identify functional and non-functional needs for the EcoCloud. Further, the cloud-enabled architecture was also developed to identify OpenStack cloud services that would be used and required in the implementation.
As part of the implementation of phase 1, we have decided to develop three managed platforms with open data platform. Figure 1 shows components of EcoCloud under development. We explain each of the components, and its functionality.
Figure 1: components of the phase 1 implementation of EcoCloud
EcoCloud will have three main components: user management, data storage and managed compute platforms. Users management is an essential part of a system with users need to register and login using Australian Access Federation (AAF) or Google account. User information will be stored in LDAP to create user accounts to access virtual desktop platform and file sync applications.
In the first phase, as a data platform, EcoCloud will offer EcoDrive and EcoStore. The EcoDrive will act as a user defined storage platform with sync functionality to bring their own data and access them from EcoCloud compute resource and analysis platform. A user should be able to write to an EcoDrive and able to access data from different EcoCloud analysis platforms. Users should be able to sync EcoDrive to their local desktop but size of an individual datasets that will be synced will be restricted to less than 500 MB to minimise large data copying. The EcoStore is a managed data store to host large-scale reference data. Typically, these datasets are continental scale and are ready for reuse. The EcoStore is primarily a managed storage to host reference quality controlled datasets that are available for use by the user community. The EcoStore don’t offer sync functionality to minimise large file copying. Instead, all the datasets will be accessible from managed platform. The OpenStack Swift storage will be used to implement EcoStore.
Managed platforms consist of RStudio Server, Jupyter Hub and virtual desktop. EcoCloud offers managed RStudio Server and Jupyter Hub as a managed service accessible from a web browser. These managed services will run as a container in kubernetes cluster orchestration system. A CoESRA virtual desktop environment will also be part of the platform . All virtual desktops will include applications RStudio, Canopy, QGIS, Panoply, Kepler, KNIME, OpenRefine and options to sync user’s Dropbox drive. All the managed application platforms could access EcoDrive and EcoCloud to enable data analysis and application development.
User support in eResearch is important to bring e-infrastructure closer to the domain research community. EcoCloud will have a dedicated online support system which includes sharing online tutorials, ehelpdesk to disseminate system expertise and helping users to accomplish their research by addressing problems arising in their eResearch needs.
- National Ocean service. Available from: http://oceanservice.noaa.gov/facts/ecosci.html, accessed 28 June 2017.
- Guru, S.M., C.I. Hanigan, H.A. Nguyen, E. Burns, J. Stein, W. Blanchard, D. B. Lindenmayer, and T. Clancy, “Development of a cloud-based platform for reproducible science: the case study of IUCN Red List of Ecosystems Assessment” Ecological Informatics, Vol. 36. 2016.
Siddeswara Guru is a Data Science Director for the Terrestrial Ecosystem Research Network (TERN). He initiates, coordinates and manages ecological data, e-infrastructure and synthesis projects apart from overseeing the data and information management activities across TERN.
Dr Hoylen Sue is a data specialist at the Queensland Cyber Infrastructure Foundation (QCIF). He has experience in research and development in the areas of data management, metadata, distributed systems and cloud computing.