Mr Guido Aben1
1AARNet, Kensington, Australia
CLOUDSTOR: from DATA platform to research hub
Research informed roadmap
AARNet has laid out a roadmap for CloudStor based on feedback from researchers from diverse domains, with varying data storage, movement and technology requirements, and observation of users and usage patterns. CloudStor stores MB-TB sized data, data all at once, data arriving in batches, data deposited or made accessible directly by hand or machine-to-machine interface, data as an input to or output of science, humanities, arts and social science. In order to make life easier and more productive for a range of researchers using CloudStor, AARNet is evaluating and will roll out a range of technologies and platforms, as steps toward transforming the infrastructure from a sync app to a research hub.
In this presentation the following technologies and services will be discussed: Jupyter Notebook, Kaltura, and direct S3-interface bulk storage access as candidates for integration that aid in the evolution of CloudStor as a data storage platform into a research hub and serve diverse research infrastructure requirements.
from synch app to research hub
CloudStor was initially conceived as a data movement and synchronisation platform to operate at AARNet line speed (~10Gbps) and synchronise across the Australian continent. Meeting these basic data infrastructure requirements at the outset in the design was a daunting prospect and an important piece of national research infrastructure foundation for AARNet to lay down in service of research and education. What has emerged is a consolidated and stable platform that supports day-to-day data movement and working storage operations of a sizable number of researchers (~37,000 as at June 2017).
Two new patterns of user/usage have emerged that reflect the shift from synch app to research hub:
- shared storage space (groups of researchers, administrators, and infrastructure specialists) enabling collaboration and multi-party data handling
- direct data transfer by machines (as system users) as an efficient step in the research and data curation lifecycles
Researchers and data infrastructure support specialists are actively using the platform to define groups of collaborators and share specific subsets of that data between specific users. Examples of these groups are: PARADISEC, the Australian Data Archive, the Australian Antarctic Division (for ice acoustic data) and several Centres of Excellence who use groups to bridge between academic and industry participants. As evinced by data management patterns (and confirmed through interviews with users) we have also discerned that CloudStor users (as data sources) are humans – and – machines. We are finding that instruments have been set up to upload data directly into CloudStor (albeit logged in as a user – as mandated though AAF policy). Those observations – the uptake of group functionality, as well as the group membership including humans and machines, reveal that CloudStor is no longer used exclusively as a “personal cloud folder” and platform, but is becoming a research data hub.
augmenting the research hub ecosystem
Triggered by this research demand as described above, AARNet is investigating several technologies and platforms, as candidate systems to integrate or interface with
- International evidence of research value (access to computational notebooks)
- Engagement with researchers across all domains, and in particular the humanities, arts and social sciences (access to multimedia data processing), and scientists using instruments generating big data (access to bulk storage).
Integration of Jupyter with CloudStor: Peer e-infrastructure service providers, notably CERN, have received high demand for a computational notebooks service that is tightly coupled with cloud storage; in CERN, this combined service is called SWAN. The integration of these pieces of research infrastructure enable researchers to execute relatively simple computation and data manipulation on the active data in cloud storage (and there is no need for researchers to further download the data, undertake compute, and re-upload the data into storage). Scripts used for computation can themselves be kept, versioned and run direct from CloudStor; the resultant system turns CloudStor into a cloud data manipulation engine. AARNet has a trial version of a “SWAN-like” service on the CloudStor roadmap.
Multimedia curation/viewing via plugin (through Kaltura): Large multimedia holdings are being stored in CloudStor, notably by groups of HASS researchers. In interviews with these research colleagues, we have discovered that direct previewing of these files from their cloud storage platform would be beneficial. In addition to this basic file viewing requirement, we understand that: annotation, geo-fencing, rights management, would further enhance the value of CloudStor as a multimedia data processing platform. As a result of this engagement with researchers AARNet the implementation of a Kaltura plugin node has been added to the CloudStor roadmap and we will be working with selected HASS researchers to fine-tune the offering.
Direct one-way (upload-only) transfer of oversize datasets:
For data generated in a day-to-day “trickle” pattern, the synch&share paradigm enabled by CloudStor client apps works well. A different paradigm is being tested out within CloudStor, for data generated by large science instruments and transferred into storage for two reasons. (1) The instrument data does not need to be kept in synch; the data just needs to be uploaded (and is never downloaded back to the instrument). (2) For raw performance, better interfaces exist than the WebDAV protocol used by the synch clients in CloudStor. AARNet is currently trialing direct S3 bulk-storage access to the CloudStor data vaults with a number of selected research groups.
Guido Aben is AARNet’s director of eResearch. He holds an MSc in physics from Utrecht University.
In his current role at AARNet, Guido is responsible for building services to researchers’ demand, and generating demand for said services, with CloudStor and CloudStor+ perhaps the most widely known of those.