Mr David Fellinger1
1iRODS Consortium, Chapel Hill, United States
In the early days of high performance research computing much of the work consisted of iterative mathematics and matrix resolution, complex system simulation, and visualization. In the US, machines like ASCI Blue Pacific at Lawrence Livermore National Laboratories completely ended the need for nuclear testing and ASCI Red at Sandia National Laboratories was used to visualize the core of the Sun giving researchers insights into a large scale fusion reaction. Many of these jobs ran multiple days so scheduling could easily be done with a relatively small number of resources. Intermediate results were often written out to “scratch” file systems as checkpoints so that a job could be restarted in case of a crash or partial cluster failure. Of course, restarts were also always handled manually.
The inception of parallel file systems represented a major innovation in high performance computing. Storage systems could be designed as parallel clusters directly coupled with compute clusters utilizing efficient network protocols and infrastructures based on fibre channel and infiniband. Input/Output (I/O) bandwidth could be scaled by increasing the elements in the storage cluster under the file system management. The parallel file system Lustre was developed starting in 2001 with deployments in 2002 as an open source project initially funded by a government agency with an interest in compute efficiency. This file system found immediate acceptance in both government and academic research sites world wide since far less time was required to move large amounts of data to or from compute clusters. Innovations in both storage systems and networking allowed I/O bandwidths to grow from gigabytes per second to terabytes per second in just a few years with compute clusters in excess of 100,000 cores and storage systems in the tens of petabytes.
As I/O systems and compute clusters were evolving there was also a concurrent evolution in the types of jobs that were run on these efficient machines. Besides simulation and visualization many machines were used for data analysis and reduction. Large data sets from physical devices like sensors, microscopes, genome sequencers, telescopes, and radio telescope arrays were placed on the parallel file systems and ingested into machine cache to achieve the required goals of the run. Data management now required a staff of resources to manage data moved into a site for reduction. Interesting use cases evolved like the use of an IBM Blue Gene machine at ASTRON in the Netherlands to execute fast Fourier transforms on radio telescope data in real time upon ingestion. Experiments on instruments like the Large Hadron Collider at CERN generated petabytes of data which had to be reduced by high performance computation to be of use. Through this evolution, tools were developed to schedule jobs on all or portions of a cluster, reporting tools were developed to check the job placements and optimize the schedulers, and reporting functions were automated so that systems administrators could understand the cluster resources engaged in the execution of a job. The migration of data to and from the cluster, however, remained a largely manual task requiring human intervention to control the data flow to and from the “scratch” file system which would be utilized for each compute job.
THE CASE FOR WORKFLOW AUTOMATION
The Integrated Rule-Oriented Data System (iRODS) was introduced as open source middleware to enable the establishment of research collections. Large repositories such as EUDAT in the European Union and the large genomic sequence repository at the Wellcome Trust Sanger Institute in the UK are maintained under iRODS management. Collection maintenance can be automated, from metadata extraction upon ingestion through the application of enforceable access controls and automated distribution to subscribers. Recently, the iRODS Consortium has launched a program to closely tie iRODS with parallel file systems and an interface to Lustre was the first to be written.
Lustre is an asymmetrical file system with a Metadata Server (MDS) retaining a custom database that relates file system names or V-nodes with reference designations or I-nodes and storage locations called Object Storage Targets (OST). The MDS also handles permission and every file transfer must start with an MDS query. The file system is extremely scalable since the intensive extent list construction and maintenance is done in multiple parallel OSTs with the entire file system structure tracked by the efficient MDS process. The iRODS interface consists of a service node operating as an additional Lustre client constantly monitoring the MDS through a changelog which is a service of the MDS. iRODS is not involved in any transactions relating to data transferred to or from the compute cluster so the performance of the file system is not compromised by the addition of a data management layer.
Interestingly, iRODS is also asymmetrical and scalable in that the database called the iCAT which catalogs and maintains data placement information can be a separate element with groups of servers actually executing data movement. The iRODS interface with Lustre allows the construction of, effectively, a parallel database such that the Lustre MDS and the iRODS iCAT are consistent. The iCAT may, however, contain additional metadata which would not be practical or required in the MDS.
All operations within iRODS are driven by policies consisting of specific rules custom to each deployment. These rules are executed at Policy Enforcement Points (PEPs) as data collections are constructed and manipulated.
A workflow example might be one from radio astronomy where specific time domain Fourier planes of observed data must be manipulated by a compute cluster to determine changes that may have occurred over specific time intervals. These data could exist on an external file system and a system scheduler like the Slurm Workload Manager could notify iRODS that the data is required on the “scratch” file system adjacent to the compute cluster. iRODS could then manage the movement of the data to Lustre tracking the changes on the MDS. Upon data reduction, the output files would be written to Lustre and iRODS can monitor this activity through the scheduler or by means of monitoring the predetermined V-node entries on the MDS. iRODS could then initiate the movement of the resultant data to an archive that could be utilized by researchers. These output data can be analyzed in flight by iRODS-driven processes so that the file headers could be read to enrich the iCAT metadata with elements such as the coordinates of movement within the Fourier planes.
Collaboration can be enabled by the federation capabilities of iRODS so that multiple sites can be granted access to all or part of the collection interactively. All of this activity can be fully automated and transacted based upon the policies which can evolve over time to effect all phases of data lifecycle management from analysis to publication and finally the transition to a long term archive.
The end result is a flexible data infrastructure that is immediately available and searchable after the data reduction run. In effect, a collection is built where the compute cluster is the co-author and iRODS is the policy driven editor.
Iterative science is often driven by changes in experimental paradigms triggered by previous results. Providing results to researchers as efficiently as possible is critical to maximizing the value of any iterative experiment. iRODS can be utilized to completely automate the data management and data delivery in a research computing environment shortening this experimental “feedback loop”.
Dave Fellinger is a Data Management Technologist with the iRODS Consortium. He has over three decades of engineering and research experience including film systems, video processing devices, ASIC design and development, GaAs semiconductor manufacture, RAID and storage systems, and file systems. As Chief Scientist of DataDirect Networks, Inc. he focused on building an intellectual property portfolio and representing the company to conferences with a storage focus worldwide.
In his role with the iRODS Consortium Dave is working with users in research sites to assure that the functions and features of iRODS enable fully automated data management through data ingestion, security, maintenance, and distribution. He serves on the External Advisory Board of the DataNet Federation Consortium and was a member of the founding board of the iRODS Consortium.
He attended Carnegie-Mellon University and hold patents in diverse areas of technology.