Enabling eResearch with Automated Data Management from Ingestion Through Distribution

Mr David Fellinger1

1iRODS Consortium, Chapel Hill, United States, davef@renci.org


HSM as Critical Path

The early Beowulf clusters were generally utilized to solve iterative mathematical problems, simulate environments and processes, and generate visualizations of systems that are difficult if not impossible to physically recreate. These initial clusters allowed researchers to make great strides in shortening the time required to solve complex, multi-dimensional matrices such as Schrodinger’s equation applied to specific materials and systems. We have seen widely varying uses from understanding the fusion reactions at the core of the Sun to simulating a spark plug in a car cylinder. As small clusters evolved into supercomputers, file systems also evolved to capture the voluminous data that was generated by these simulations and visualizations. Parallel file systems such as GPFS and Lustre were designed to scale in both bandwidth and depth allowing data transitions from the cluster to the file system through multiple elements termed “gateway nodes”. These file systems utilize extremely high performance storage because compute and input output (I/O) operations are mutually exclusive. The use of slower storage would effectively increase the length of the I/O cycle diminishing the compute to I/O time ratio. This specialized storage is expensive and it is important that the supercomputer is the primary client to maximize efficiency. It is clear that data should never be distributed to users or researchers directly from the file system that is closely coupled to a supercomputer. Thousands of individual requests can slow down the parallel file system reducing the effective life of the supercomputer by extending the I/O cycle time needlessly. Secondary storage and file systems are generally used for widespread research data access. These file systems do not require custom access clients and are usually compatible with NFS or CIFS which are standard tools in existing operating systems. Data is then migrated from the expensive parallel file system to the less expense data distribution file system by using copy commands.

Even though the storage media used for distribution is less expensive, it is usually rotating media which must have power applied to operate. Many sites store little used data on tape which is extremely inexpensive and does not require continuous power. This is referred to as archive storage and is the final copy location before the data is finally deleted. The described workflow has fostered the growth of software processes which are termed hierarchal storage management (HSM) systems. Several organizations have developed these systems to move data from one location to another generally based on the age of the data and the frequency of access. While these systems are effective they usually require a great deal of human intervention.

The Growth of Big Data and Data Reduction

The proliferation of large scale sensor data has driven changes to both the compute and storage model. Data reduction environments now represent the majority of high performance computing in eResearch. Instruments such as scanning electron microscopes and genomic sequencers generate petabytes of data in health science research. Data reduction applications span studies from hydrology to seismic research as well as relating genotype to phenotype data in medical research. Instruments such as the Large Hadron Collider and telescopes both optical and radio generate a great deal of data that must be mined and reduced to be of value to scientists. The term “Big Data” was coined to describe this sort of sensor data having the characteristics of volume, velocity, variety, variability, and veracity. Large volumes of metadata are also generated to track and secure the provenance of the collected data to maintain veracity. The task of managing “Big Data” from creation through ingestion, reduction, and distribution cannot be easily achieved with traditional HSM tools which, in general, move data around based upon file date and type but not by content analysis. Human intervention on a file by file or object by object basis is also difficult if not impossible considering the petabytes that must be managed.

The Rise of Policy-Based Data Management

While it is often impossible to manage large quantities of data through human intervention, it is possible for data scientists and librarians to form a consensus that will dictate a policy which, in turn, dictates computer actionable rules that can be applied at every stage of a workflow. The Integrated Rule Oriented Data System (iRODS) has been designed to enable a very flexible policy-based management environment for a wide variety of applications. Recently developed features expand iRODS capabilities well beyond that of a traditional HSM to a complete tool that can simplify complex eResearch data manipulation and data discovery tasks.

The first phase of any data gathering and analysis project starts with ingestion of instrument or sensor data to enable analysis. This process consists of the establishment of a “landing zone” which is a storage buffer for incoming data. This “landing zone” may be centralized but, if the instruments have buffer file systems, the “landing zone” can also be ubiquitous so that data copies are minimized and the data effectively remains in the buffer space. Fully automated, rules based, data ingest capabilities have recently been added to the iRODS feature set. A file system scanning capability has been added to iRODS which can watch external file systems for new or updated files and launch appropriate action. The action taken can be based on file size and file transfer bandwidth capability. If the files to be processed are very large, it may be more efficient to register the data in place so that it can be moved in whole or part to a file system adjacent to a compute cluster only when it is required for a data reduction process. If the files are smaller and the data transfer bandwidths are large it might make more sense to centralize the data on a file system which can deliver data to a compute file system more efficiently. In either case, iRODS can extract metadata to enable additional discovery operations. As an example, it is a common practice to study large numbers of genomic sequence files in parallel to identify similarities or differences. Attributes are generally associated with these files to enable reduction efficiencies so that only files with specific attributes are compared. The attributes can be extracted by iRODS and used to enrich the metadata so that only data with very specific attributes are moved to a “scratch” file system for analysis. This entire process of automated file selection and data movement can be used to trigger other operations based on policy enforcement points. These operations could include launching a data reduction process or generating a report describing the data in the “landing zone”. The file scanner is a “pull” process but iRODS can also utilize a “push” process. Parallel file systems such as Lustre can generate a change log based on file system modifications and iRODS can push data to a specific location based upon these changes and associated rules. In fact iRODS can respond to any system state change or event to begin a data movement or analysis process. This could be useful for a tool such as a CAT scan machine that is not in continuous service but must be harvested when a scan is completed.

The process of data ingestion is itself a workflow dedicated to both data organization and description yielding a registered and discoverable tier of storage. Subsequent processing is metadata-driven based upon rules which are written to maintain the chosen policies. For example, Fourier planes of radio astronomy data with similar characteristics collected over time can be automatically migrated to a parallel file system to enable a compute process allowing astronomical event analysis. Selection of the required files would be based upon attributes described in the rich metadata extracted during the registration process. It may be necessary to migrate data with different characteristics to an archive reserved for analysis at a later time. The new iRODS function of metadata-driven storage tiering facilitates the efficient use of storage resources in eResearch. This function is unique to iRODS and is dynamic allowing data mobility decisions to be made in real time based upon user-defined metadata attributes or harvested metadata like machine availability, storage migration data bandwidth, real time attributes that may change in priority based upon newly ingested data, and the storage resource value and framework. Literally any attribute of a file or system element can be evaluated in a real time decision tree to enable efficient data analysis operation. All of these processes can operate in a parallel fashion limited only by the number of nodes assigned to the process. The processes can also operate over a wide physical distance or organizational boundaries utilizing iRODS federation enabling eResearch collaboration over continents or the world.

Finally, the entire process of data distribution to the scientific community can be managed by iRODS. Data provenance is assured since every step of the workflow from ingestion through publication can been tracked and audited. Multiple layers of attribute assignment over the process steps assure that the published data is fully discoverable and metadata-driven access controls assure regulatory compliance.

Modern eResearch implies the analysis and dissemination of data at a scale that is growing exponentially. The use of open source iRODS automation to enable a policy-based, rules-driven workflow can simplify the entire data lifecycle while allowing full traceability, reproducibility, and flexibility for when the policy changes in the future.


Dave Fellinger is a Data Management Technologist with the iRODS Consortium. He has over three decades of engineering and research experience including film systems, video processing devices, ASIC design and development, GaAs semiconductor manufacture, RAID and storage systems, and file systems. As Chief Scientist of DataDirect Networks, Inc. he focused on building an intellectual property portfolio and presenting the technology of the company at conferences with a storage focus worldwide.

In his role at the iRODS Consortium, Dave is working with users in research sites and high performance computer centers to confirm that a broad range of use cases can be fully addressed by the iRODS feature set. He helped to launch the iRODS Consortium and was a member of the founding board.

He attended Carnegie-Mellon University and hold patents in diverse areas of technology.

Recent Comments