Automating the Data Workflow and Distribution in Research Computing

Mr David Fellinger1

1iRODS Consortium, Chapel Hill, United States

 

In the early days of high performance research computing much of the work consisted of iterative mathematics and matrix resolution, complex system simulation, and visualization. In the US, machines like ASCI Blue Pacific at Lawrence Livermore National Laboratories completely ended the need for nuclear testing and ASCI Red at Sandia National Laboratories was used to visualize the core of the Sun giving researchers insights into a large scale fusion reaction. Many of these jobs ran multiple days so scheduling could easily be done with a relatively small number of resources. Intermediate results were often written out to “scratch” file systems as checkpoints so that a job could be restarted in case of a crash or partial cluster failure. Of course, restarts were also always handled manually.

The inception of parallel file systems represented a major innovation in high performance computing. Storage systems could be designed as parallel clusters directly coupled with compute clusters utilizing efficient network protocols and infrastructures based on fibre channel and infiniband. Input/Output (I/O) bandwidth could be scaled by increasing the elements in the storage cluster under the file system management. The parallel file system Lustre was developed starting in 2001 with deployments in 2002 as an open source project initially funded by a government agency with an interest in compute efficiency. This file system found immediate acceptance in both government and academic research sites world wide since far less time was required to move large amounts of data to or from compute clusters. Innovations in both storage systems and networking allowed I/O bandwidths to grow from gigabytes per second to terabytes per second in just a few years with compute clusters in excess of 100,000 cores and storage systems in the tens of petabytes.

As I/O systems and compute clusters were evolving there was also a concurrent evolution in the types of jobs that were run on these efficient machines. Besides simulation and visualization many machines were used for data analysis and reduction. Large data sets from physical devices like sensors, microscopes, genome sequencers, telescopes, and radio telescope arrays were placed on the parallel file systems and ingested into machine cache to achieve the required goals of the run. Data management now required a staff of resources to manage data moved into a site for reduction. Interesting use cases evolved like the use of an IBM Blue Gene machine at ASTRON in the Netherlands to execute fast Fourier transforms on radio telescope data in real time upon ingestion. Experiments on instruments like the Large Hadron Collider at CERN generated petabytes of data which had to be reduced by high performance computation to be of use. Through this evolution, tools were developed to schedule jobs on all or portions of a cluster, reporting tools were developed to check the job placements and optimize the schedulers, and reporting functions were automated so that systems administrators could understand the cluster resources engaged in the execution of a job. The migration of data to and from the cluster, however, remained a largely manual task requiring human intervention to control the data flow to and from the “scratch” file system which would be utilized for each compute job.

THE CASE FOR WORKFLOW AUTOMATION

The Integrated Rule-Oriented Data System (iRODS) was introduced as open source middleware to enable the establishment of research collections. Large repositories such as EUDAT in the European Union and the large genomic sequence repository at the Wellcome Trust Sanger Institute in the UK are maintained under iRODS management. Collection maintenance can be automated, from metadata extraction upon ingestion through the application of enforceable access controls and automated distribution to subscribers. Recently, the iRODS Consortium has launched a program to closely tie iRODS with parallel file systems and an interface to Lustre was the first to be written.

Lustre is an asymmetrical file system with a Metadata Server (MDS) retaining a custom database that relates file system names or V-nodes with reference designations or I-nodes and storage locations called Object Storage Targets (OST). The MDS also handles permission and every file transfer must start with an MDS query. The file system is extremely scalable since the intensive extent list construction and maintenance is done in multiple parallel OSTs with the entire file system structure tracked by the efficient MDS process. The iRODS interface consists of a service node operating as an additional Lustre client constantly monitoring the MDS through a changelog which is a service of the MDS. iRODS is not involved in any transactions relating to data transferred to or from the compute cluster so the performance of the file system is not compromised by the addition of a data management layer.

Interestingly, iRODS is also asymmetrical and scalable in that the database called the iCAT which catalogs and maintains data placement information can be a separate element with groups of servers actually executing data movement. The iRODS interface with Lustre allows the construction of, effectively, a parallel database such that the Lustre MDS and the iRODS iCAT are consistent. The iCAT may, however, contain additional metadata which would not be practical or required in the MDS.

All operations within iRODS are driven by policies consisting of specific rules custom to each deployment. These rules are executed at Policy Enforcement Points (PEPs) as data collections are constructed and manipulated.

A workflow example might be one from radio astronomy where specific time domain Fourier planes of observed data must be manipulated by a compute cluster to determine changes that may have occurred over specific time intervals. These data could exist on an external file system and a system scheduler like the Slurm Workload Manager could notify iRODS that the data is required on the “scratch” file system adjacent to the compute cluster. iRODS could then manage the movement of the data to Lustre tracking the changes on the MDS. Upon data reduction, the output files would be written to Lustre and iRODS can monitor this activity through the scheduler or by means of monitoring the predetermined V-node entries on the MDS. iRODS could then initiate the movement of the resultant data to an archive that could be utilized by researchers. These output data can be analyzed in flight by iRODS-driven processes so that the file headers could be read to enrich the iCAT metadata with elements such as the coordinates of movement within the Fourier planes.

Collaboration can be enabled by the federation capabilities of iRODS so that multiple sites can be granted access to all or part of the collection interactively. All of this activity can be fully automated and transacted based upon the policies which can evolve over time to effect all phases of data lifecycle management from analysis to publication and finally the transition to a long term archive.

The end result is a flexible data infrastructure that is immediately available and searchable after the data reduction run. In effect, a collection is built where the compute cluster is the co-author and iRODS is the policy driven editor.

Iterative science is often driven by changes in experimental paradigms triggered by previous results. Providing results to researchers as efficiently as possible is critical to maximizing the value of any iterative experiment. iRODS can be utilized to completely automate the data management and data delivery in a research computing environment shortening this experimental “feedback loop”.

 


Biography

Dave Fellinger is a Data Management Technologist with the iRODS Consortium. He has over three decades of engineering and research experience including film systems, video processing devices, ASIC design and development, GaAs semiconductor manufacture, RAID and storage systems, and file systems. As Chief Scientist of DataDirect Networks, Inc. he focused on building an intellectual property portfolio and representing the company to conferences with a storage focus worldwide.

In his role with the iRODS Consortium Dave is working with users in research sites to assure that the functions and features of iRODS enable fully automated data management through data ingestion, security, maintenance, and distribution. He serves on the External Advisory Board of the DataNet Federation Consortium and was a member of the founding board of the iRODS Consortium.

He attended Carnegie-Mellon University and hold patents in diverse areas of technology.

Cloud Infrastructures/SDN Testbeds for eScience Researches

Dr Kohei Ichikawa1, Dr Philip Papadopoulos2, Professor David Abramson3, Dr. Peter Elford4, Dr. David Wilde4, Dr. Warrick Mitchell4

Nara Institute of Science and Technology, Ikoma, Japan, ichikawa@is.naist.jp

2 University of California, San Diego, San Diego, USA, ppapadopoulos@ucsd.edu

3 The University of Queensland, Brisbane, Australia, david.abramson@uq.edu.au

4 AARNet, Brisbane, Melbourne, Australia, Peter.Elford, David.Wilde, Warrick.Mitchell@aarnet.edu.au

DESCRIPTION

Effective sharable cyberinfrastructure is fundamental for collaborative research in eScience. The goal of this session is to share and discuss the recent research activities, technologies, ideas and issues on shared cyberinfrastructures for international collaborative researches.  As a conference co-located with PRAGMA workshop, the session focuses on sharing and connecting the researches of both Australian eResearch community and PRAGMA community [1]. PRAGMA community has been developing a sharable cyberinfrastructure, PRAGMA testbed [2], for many years and the focus of the community has shifted to a virtualized infrastructure or Cloud computing recently. With this move to virtualized infrastructure, virtualized network technology is also becoming a major interest in the community, and PRAGMA has established an international SDN (Software-Defined Networking) testbed, PRAGMA-ENT (Experimental Network Testbed) [3, 4]. In the session, we will have several presentations on sharable cyberinfrastructure including Cloud resources and SDN testbeds from both Australian and PRAGMA communities, and we will also discuss common research topics and issues on shared cyberinfrastructures for eScience research.

The duration of this session would be 60 minutes.

Topics of interest include, but are not limited to, the following:

  1. eScience cyberinfrastructure/Testbed
  2. Grid/Cloud computing
  3. Resource scheduler/allocation/optimization
  4. Software-Defined Networking/Storage/Data Center
  5. Security of shared resources
  6. Operation/management of shared resources
  7. High availability and fault tolerance of shared resources
  8. Monitoring of shared resources
  9. Visualization infrastructure for collaborative data science
  10. eScience applications on widely distributed cyberinfrastructures

REFERENCES

  1. PRAGMA project. Available from: http://www.pragma-grid.net/, accessed in June
  2. Abramson, A. Lynch, H. Takemiya, Y. Tanimura, S. Date, H. Nakamura, Karpjoo Jeong, Suntae Hwang, Ji Zhu, Zhong-hua Lu, C. Amoreira, K. Baldridge, Hurng-Chun Lee, Chi-Wei Wang, Horng-Liang Shih, T. Molina, Wilfred W. Li, P. W. Arzberger, “Deploying Scientific Applications to the PRAGMA Grid Testbed: Strategies and Lessons,” Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID 06); 2006:241-248.
  3. PRAGMA-ENT project. Available from: https://github.com/pragmagrid/pragma_ent/wiki, accessed in June
  4. Kohei Ichikawa, Pongsakorn U-chupala, Che Huang, Chawanat Nakasan, Te-Lung Liu, Jo-Yu Chang, Li-Chi Ku, Whey-Fone Tsai, Jason Haga, Hiroaki Yamanaka, Eiji Kawai, Yoshiyuki Kido, Susumu Date, Shinji Shimojo, Philip Papadopoulos, Mauricio Tsugawa, Matthew Collins, Kyuho Jeong, Renato Figueiredo, Jose Fortes, “PRAGMA-ENT: An International SDN Testbed for a Cyberinfrastructure in the Pacific Rim,” Concurrency And Computation: Practice And Experience, Wiley InterScience; 2017:e4138.

BIOGRAPHY

Kohei Ichikawa is an Associate Professor in the Graduate School of Information Science at Nara Institute of Science and Technology, Japan. His past research work involved the design and development of middleware and applications for Grid, virtualized infrastructures and Software Defined Network testbed. He currently works on the PRAGMA project, developing an international Software Defined Networking Testbed for use by the PRAGMA researchers.

Don’t Forget the Workflows – the Drivers of the Compute and Storage Infrastructure

Mr Jake Carroll1

1The University of Queensland, QBI, St Lucia (Brisbane), Australia, jake.carroll@uq.edu.au

INTRODUCTION

Within the research computing sector, significant time, money and resources are spent upon the architecture, engineering and design of infrastructure to undertake a task. Whether expenditure is in building localised capability, or in the consumption of cloud resources, the result is generally the same. A user, group or institution pays for the capability to run or undertake a task on a platform to achieve a scientific and statistically significant output. Missing from this, however, and is oftentimes left up to the end user to consider for themselves, is the nature of what workflows best compliment infrastructure and interlock to provide best performance characteristics. Workflows1 are not novel and have an established place in scientific computing. A growing trend and performance driven objective, is the use of workflow aware infrastructure and the converse of this, infrastructure aware workflows. This presentation will provide insight into the University of Queensland’s Research Computing Centre efforts in the development, surfacing and eventual use of a data locality, caching and workflow aware storage infrastructure engine, known as the MeDiCI (Metropolitan Data Intensive Caching Infrastructure) project. The presentation will, further, illustrate a real-life example of how this infrastructure is being used to deliver a multi-site, multi supercomputing, multi data locality aware scientific outcome.

DATA LOCALITY AWARE INFRASTRUCTURE

The concept of location aware infrastructure is still relatively nascent and is the subject of significant research in industry and academia. At a macro level, in code path optimisation, a great deal of work has taken place in finding ways to determine lower cost code path latency in respect to core-to-memory and core-to-L1/L2/L3 time to surface, to avoid inefficiency in pipeline execution2. Where data caching locality has only had light treatment however, is at the storage IO layer. The MeDiCI architecture and the underpinning work of Abramson et al. challenges the notion that a supercomputer should have data in one location at all times, due to the nature of the workflow used to initially generate that data. Moreover, there are circumstances where the generation of data is widely distributed and disconnected from the processing hub or sink. MeDiCI aims to tie instrument and processing together, via the use of intelligent policy, autonomous filesystem movement semantics and workflow aware migratory patterns.

Figure 1: The caching architecture touch points of the UQ campus for the MeDiCI framework

In several cases, workflows have been constructed to take advantage of advanced computing functionality locally, within the organisational units within the University of Queensland, but, equally, within the remote site, where the majority of the larger scale supercomputing platforms reside. One such example is the work of UQ’s largest genomics research group, who have found purpose in the utility of local HPC facilities as well as remote HPC facilities for different workloads. The use of these combined facilities provides not only an enhanced number of resources and more scale to the group, but flexibility in the determination of where a workload is run to suit specific use cases for different workload categorisation. An example of this can be found in code known as GCTA (http://cnsgenomics.com/software/gcta/) which is MPI aware and benefits significantly from the use of well coupled, low latency RDMA constructs in infrastructure, whereas other codes such as BayseR (https://github.com/syntheke/bayesR) benefit from large shared memory spaces. Yet another code known as epiGPU (https://github.com/explodecomputer/epiGPU) benefits from the use of massively scaled GPU arrays via CUDA frameworks. Using the MeDiCI fabric-semantic and single transport namespace allows the data to surface where it needs to, but importantly, only when it needs to.

The implementation and use of the AFM (Advanced File Management) data transport for parallel IO movement between cache and home nodes is the key to automation of data movement upon fopen() of a file or structure on the filesystem within the workflow. MeDiCI challenges the need to put data into more than one location manually, and provides an autonomous mount and data distribution methodology to all of the resources mentioned above from a computational standpoint, to an eventually consistent filesystem namespace. In the use of these workflows, this is taken advantage of, as processing can occur at one computation ‘edge’, then trigger processing at another facility for a different treatment.

Figure 2: Data location and transports from one platform to another, autonomously, exhibiting automated cache hierarchy in the workflow.

REFERENCES

  1. Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M. Workflows for e-Science: Scientific Workflows for Grids. (2007). 1st Edition [Ebook]. New York: Springer Press, PP. 3-6. Available at http://www.springer.com/gp/book/9781846285196#otherversion=9781849966191 [Accessed 13th June, 2017]
  2. Shi, Q., Kurian, G., Devadas., S, Khan, O. LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies. (2016). ACM Transactions on Architecture and Code Optimisation (TACO), Volume 13, Issue 4, Article No. 37.

 


BIOGRAPHY

Jake Carroll is the Senior ICT Manager (Research) for one of the largest neuroscientific research organisations in the world – the Queensland Brain Institute, within the University of Queensland.

Jake has worked in the research and scientific computing sector for around a decade in various roles. On the way, as he’s run all over the planet, he’s collected many wonderful colleagues, friends, collaborators and associates.  You might find it odd to hear this from a ‘once was IT guy’, but the truth is, Jake believes in people and their passion, brilliance and innovation as the most precious commodity we have. People and their passion are what makes Jake still come to work every day.

Jake is passionate about making technology go further and work harder for research and has a track record of building things that allow scientists to do things that were not possible, previously, at the campus, the national and even the international scale.

Slightly older (and if he’s very lucky, slightly wiser), Jake now takes a more quiet and considered seat in the leadership, governance and directionality of research focused technology teams, infrastructure projects and think-tanks in Australia and across the world.

MeDiCI can do that!

Mr Michael Mallon1, Mr Jake Carroll2, Dr David Abramson3

1RCC The University of Queensland, St Lucia, Australia, m.mallon@uq.edu.au

2QBI The University of Queensland, St Lucia, Australia, jake.carroll@uq.edu.au

3RCC The University of Queensland, St Lucia, Australia, david.abramson@uq.edu.au

INTRODUCTION

The increasing amount of data being collected from simulations, instruments and sensors creates challenges for existing e-Science infrastructure[1]. In particular, it requires new ways of storing, distributing and processing data in order to cope with both the volume and velocity of the data. In addition, maintaining separate silos of storage, for different access technologies, becomes both a barrier to migration between technologies and a significant source of inefficiency in providing data services as the volume and/or velocity of data increases. The University of Queensland has recently designed and deployed MeDiCI, a data fabric that spans the metropolitan area and provides seamless access to data regardless of where it is created, manipulated and archived. MeDiCI is novel in that it exploits temporal and spatial locality to mask network limitations. This means that data only needs to reside locally in high speed storage whilst being manipulated, and it can be archived transparently in high capacity, but slower, technologies at other times. MeDiCI is built on commercially available technologies, namely, IBM’s Spectrum Scale (formerly GPFS)[2] and HPE/SGI Data Migration Facility (DMF)[3]. In this talk I will describe these innovations, in particular, how we are able to avoid the silos of storage issue when providing access to the data in the fabric.

Caches all the way down

MeDiCI is designed on two fundamental concepts. First, locality makes it possible to store data centrally in a dedicated storage system, but cache it temporarily where it is generated or used. Caching makes it possible to create an illusion of uniform high-speed access to all platforms, when in fact there may be a mix of speeds-and-feeds between systems. Second, while there may be multiple physical copies of data for fault tolerance, this should not be confused with the ways in which users wish to access it. Instead, we keep only one logical copy of a data set, which can then be exposed by a variety of access mechanisms to suite the applications. We may store multiple copies of the data to provide fault tolerance, but only one will be regarded as the primary data instance. We implement this resilient file store on our existing DMF infrastructure.

Moving data Out of the datacenter: Active File Management

Active file management is a feature of IBM’s Spectrum Scale that allows for a remote filesystem to be cached in a GPFS filesystem[4]. This remote (or “home”) filesystem may be either an NFS export or a GPFS remote filesystem mount. There are several modes of operation that may be employed to establish an AFM relationship between a home and cache fileset (a logical separation for a set of files in a GPFS filesystem). In the MeDiCI context, the mostly commonly used caching modes are independent-writer (where multiple caches may update files on a home share), single-writer (the cache assumes that it is “correct” and makes the home match it’s state) and read-only. There is no file locking across the AFM relationship and data transfer is asynchronous, which makes the AFM relationship more akin to eventual consistency rather than the immediate consistency. The GPFS clusters that make up the AFM relationship are still immediately consistent and fully POSIX compliant, it is just across the AFM relationship where immediate consistency is not implemented. This is particularly important for independent-writer mode where files can be updated at cache or at home. Care must be taken to ensure that files are not modified at both home and cache at the same time, otherwise, the “last” write will win.

Crossing Institutional Borders: Id Mapping

One of the great challenges with sharing data between organizations is independent identities. Over the past decade, there has been significant work in abstracting away identities for HTTPS workloads by the Australian Access Federation (AAF). While this work has been a success, there has been significant challenges in leveraging this work for POSIX identities. The id mapping feature of Spectrum Scale allows us to sidestep this issue. This feature allows for on the fly mapping of posix user, group and acl ids to and from a globally unique namespace. We have chosen AAF’s auEduPersonSharedToken attribute [5] as the globally unique namespace for user identities. The reason for this choice is that QCIF’s posix identities are managed via an AAF based portal and can capture the attribute at account creation time and at the same time, the issuing institution stores the attribute in either LDAP or a mysql database, linking the attribute to a local institutional credential, allowing for both sites to have easy access to the attribute. This significantly simplifies the management of the mapping scripts and leverages existing work.

One COPY, ANY way you want it

The final key in providing seamless access to data is breaking down the silos of storage barriers. The biggest barrier to making this a reality has been the significant differences between posix semantics and object/s3 semantics. One approach has been to attempt to build a posix style filestore on top of objects (eg, s3 storage gateway[6]). While this may work for some workloads, it typically breaks down when attempting to use multiple writers. The reverse solution is to build an object gateway on top of a posix filesystem. This allows for a fully posix compliant filesystem but means that the process for mapping objects needs to be done periodically. This is usually acceptable as a s3/swift endpoint is eventually consistent and so typically, s3/swift applications need to be designed to be tolerant of delays in updates or changes to objects.

GPFS provides the second type of solution to unifying file and object workloads via the unified file and object interface for OpenStack swift[7]. This style of object storage is deployed as a different storage policy and is backed by a custom storage backend ‘swiftonfile’. This backend encodes account and container information as a two layer directory tree and objects sit underneath this. A periodic process scans the directory tree for updates. IBM have also developed a swift middleware aimed at providing a mechanism for running swift on high latency media[8]. This allows for data to be staged in and out of the cache, eliminating any concerns with swift HTTP timeouts while accessing data that has been evicted out of all disk caches.

At the time of writing, we have demonstrated surfacing a single logical copy of data via Native GPFS client mount into HPC clusters, NFS exports into QCIF managed services, CIFS mounts to PCs and Scientific instruments using institutional credentials, Nextcloud, S3/swift using institutional credentials, and Swift using NeCTAR keystone credentials.

REFERENCES

  1. McFredries, P. “The Coming Data Deluge”, IEEE Spectrum, Feb 2011.
  2. IBM Spectrum Scale, https://www.ibm.com/us-en/marketplace/scale-out-file-and-object-storage
  3. SGI DMF, http://www.sgi.com/products/storage/tiered/dmf.html
  4. Active File Management, https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_afm.htm
  5. auEduPersonSharedToken, http://wiki.aaf.edu.au/tech-info/attributes/auedupersonsharedtoken
  6. Amazon Web Services, “AWS Storage Gateway”, Apr 2017. https://d0.awsstatic.com/whitepapers/Storage/aws-storage-gateway-file-gateway-for-hybrid-architectures.pdf
  7. Unified file and object interface, https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_unifiedaccessoverview.htm
  8. IBM redbooks, http://www.redbooks.ibm.com/redpapers/abstracts/redp5430.html

 


Biography

Michael has been working for the Research Computing Centre at UQ for 5 years in various devops and support capacities. His current role in the RCC is developing and supporting the Queensland NeCTAR and RDS facilities for QCIF. He has expertise in HPC, Cloud, Storage and Networking. He holds a Bachelor of Science (Hons) in Physics and a Bachelor of Engineering (Hons) in Software Engineering, both awarded by the University of Queensland.

Recent Comments

    Categories

    About the conference

    eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

    Conference Managers

    Please contact the team at Conference Design with any questions regarding the conference.

    © 2018 - 2020 Conference Design Pty Ltd