How to Improve Fault Tolerance on HPC Systems?

Dr Ahmed Shamsul Arefin1

1Scientific Computing, Information Management & Technology, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Canberra, Australia


HPC systems of today are complex systems made from hardware and software that were not necessarily designed to work together as one complete system. Therefore, in addition to regular maintenance related downtimes, hardware errors such as voltage fluctuation, temperature variation, electric breakdown, manufacturing defects as well as software malfunctions are common in supercomputers. There are several ways to achieve some fault tolerance which includes (but not limited to): checkpoint/ restart, programming model-based fault tolerance and algorithm-based theoretical solutions, but in real-life none of these cover all the situations mentioned above [1]. In this work, we have reviewed a few of the checkpoint/restart methods and we listed our experience with them including two practical solutions to the problem.


On a small cluster consisting only two nodes (master and compute) constructed using Bright Cluster Manager (BCM) ver. 7.3, we deployed a couple of existing methods (see below) that can support operating systems and software level fault tolerances. Our experience described below. In our model cluster, each node had 16 CPU cores in 2 x Intel Xeon CPU E5-2650 0 @ 2.00GHz, min 64GB RAM, 500GB local HDD, and Ethernet/ Infiniband connections for networking purposes. Slurm ver. 16.05 was installed as a part of BCM’s provisioning on the both nodes along with SLES 12Sp1 as the operating system. The master was set to act as both head and login nodes as well as the job scheduler node.


Our first test candidate was the BLCR (Berkeley Lab Checkpoint/Restart). It can allow programs running on Linux to be checkpointed i.e., written entirely to a file and then later restarted. BLCR can be useful for jobs killed unexpectedly due to power outages or exceeding run limits. It neither requires instrumentation or modification to user source code nor recompilation. We were able to install it on our test cluster and successfully checkpointing a basic Slurm job consisting only one statement: sleep 99. However, we faced the following difficulties during its pre and post-installation:

  • The BLCR website [2] lists an obsolete version of the software (ver. 0.8.5, last updated on 29 Jan 2013) which does not work with newer kernels such as what we get on the SLES12SP1. We had to collect a newer version along with a software patch from the BCLR team to get it installed on the test cluster. Slurm had to be recompiled and reconfigured several times following the BLCR installation, requiring ticket interactions with the Slurm, BCM and BLCR teams. Unfortunately, at the end the latest BLCR installation process did not work on the SLES12SP2 due to newer kernel version again (>4.4.21).
  • MPI incompatibility: The BLCR when installed along with Slurm can only work with MVAPICH2 and doesn’t support Infiniband network [2]. This MPI variant was not available in our HPC apps directory, therefore could not be tested.
  • Not available to interactive jobs: The BLCR + Slurm combination could not checkpoint/restart interactive jobs properly.
  • Not available to GPU jobs: The BLCR did not support GPU or Xeon Phi jobs.
  • Not available to licensed software: Talking to a license server, after a checkpoint/ restart did not work.

Our next candidate was the DMTCP (Distributed Multi Threaded Checkpointing). This application level tool claims to transparently checkpoint a single-host or distributed MPI computations in user-space (application level) – with no modifications to user code or to the O/S. As of today, Slurm cannot be integrated with DMTCP [2,4]. It was also noted that after a checkpoint/restart computation results can become inconsistent when compared against the non-checkpoint output, validated by a different CSIRO-IMT (User Services) team member.

Our next candidate was the CRIU (Checkpoint/Restore In Userspace). It claims to freeze a running application or at least a part of it and checkpoint to persistent storage as a collection of files. It works in user space (application level), rathe r than in the kernel. However, it provides no Slurm, MPI and Infiniband supports as of today [2,5]. We therefore have not attempted it on the test cluster.


After reviewing some of the publicly available tools that could potentially support the HPC fault tolerance, we decided propose the following practical solutions to the CSIRO HPC users.


Application level checkpointing: In this method, user’s application (based on [1]) will be set to explicitly read and write the checkpoints. Only the data needed for recovery was written down and checkpoints need taken at “good” times. However, this can result in a higher time overhead, such as several minutes to perform a single checkpoint on larger jobs (increase in program execution time) as well as user-level development time. Therefore, users may need further supports when applying this solution.


Slurm supports job preemption, by the act of stopping one or more “low-priority” jobs to let a “high-priority” job run. When a high-priority job has been allocated resources that have already been allocated to one or more low priority jobs, the low priority job(s) are pre-empted (suspended). The low priority job(s) can resume once the high priority job completes. Alternately, the low priority job(s) are killed, requeued and restarted using other resources. To validate this idea,  we’ve  deployed a  new  “killable job”  partition implementing a  “kill  and  requeue” policy.  This  solution was successfully tested by killing low priority jobs and can be further tuned to auto requeue lost jobs following a disaster, resulting an improved fault tolerance on our HPC systems.


Using a test cluster, we investigated and evaluated some of the existing methods to provide a better fault tolerance on our HPC systems. Our observations suggests that the program level checkpointing would be the best way to safeguard a running code/ job, but at the cost of development times. However, we have not yet attempted a few other methods that could potentially provide an alternative solution, such as: SCR- Scalable Checkpoint/restart for MPI, FTI – Fault Tolerance Interface for hybrid systems Docker, Charm++ and so on, which remains as our future work.


[1] DeBardeleben et al., SC 15 Tutorial: Practical Fault Tolerance on Today’s HPC Systems SC 2015

[2] Hargrove, P. H., & Duell, J. C. (2006). Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing.

[3] Rodríguez-Pascual, M. et al. Checkpoint/restart in Slurm: current status and new developments, SLUG 2016

[4] Ansel, J., Arya, K., & Cooperman, G. (2009). DMTCP: Transparent checkpointing for cluster computations and the desktop. In IPDPS 2009. IEEE International Symposium on (pp. 1-12). IEEE.

[5] Emelyanov, P. CRIU: Checkpoint/Restore In Userspace, July 2011.


Dr Ahmed Arefin works as an IT Advisor for HPC Systems at the Scientific Computing, IM&T, CSIRO. In the past, he was as a Specialist Engineer for the HPC systems at the University of Southern Queensland. He has done his PhD and Postdoc in the area of HPC & parallel data mining from the University of Newcastle and  published articles in PLOS ONE and Springer journals and IEEE sponsored conference proceedings. His primary research interest focuses on the application of high performance computing in data mining, graphs/networks and visualization.

ASCI, a Service Catalog for Docker

Mr John Marcou1, Robbie Clarken1, Ron Bosworth1, Andreas Moll1

1Australian Synchrotron, Melbourne, Victoria,,,,



The Australian Synchrotron Computing Infrastructure (ASCI) is a platform to deliver users easy access to a remote desktop to process their experiment  data. Every experiment  station, or beamline, can start and connect to their own desktop environment with a web-browser, find their specific processing applications, and access their experiment data.

ASCI acts as a Service Catalog for Docker. It is used to start remote interactive desktops or processing services which run inside Docker containers.


  • Remote Desktop instances
  • Web-browser based
  • User home folders
  • Desktop sharing feature between users
  • CUDA and OpenGL on NVIDIA support
  • Run multiple sessions on multiple nodes in parallel
  • Label-based scheduler to distribute the load on cluster


ASCI is a stateless  micro-services  solution.  Every  component  runs within  a Docker container.  The  infrastructure   is  based  on  the  CoreOS  Container  Linux  operating system, providing the main tooling to run containerized  applications.  This operating system is deployed using Terraform which is an infrastructure management tool supporting  multiple  providers,  and  allows  automated  machine  deployment.  With ASCI,  we  use  Terraform  manifests  to  deploy  CoreOS  Container  Linux  to  virtual machines on VMware vSphere cluster, or as stateless operating system on bare-metal, using CoreOS Matchbox.


The ASCI Control Plane provides two main components:

  • the ASCI Web UI contacts the ASCI API to interact with ASCI and start Desktops
  • the ASCI API contacts the Docker API on the compute nodes to schedule and manage new ASCI instance

All the proxy elements are used to route the request within the infrastructure in order to reach the Web-UI or a specific Desktop. The ASCI Admin UI is a convenient way to customize the scheduler and list the running ASCI instances.


The user connects to a web-interface which lists the environments available for creation. When a desktop is requested, the ASCI API schedules the Docker container on the cluster. When the user connects to a desktop, the ASCI API generates a one-time-password  and which is delivered to the user’s  web-browser  to establish  a VNC connection over WebSockets (noVNC).

The desktop  instance  is running  in a Docker  container.  A base image is built with standard tools and libraries, shared by  every  environment,   such  as  the  NVIDIA,  CUDA  and VirtualGL libraries, and the graphical environment (MATE).

This  image  is  use  as  parent  for  every  child  environment, which provides specialised scientific applications.

Users can store their documents in their home folder. The sharing feature allow users to share their Desktop with others.



                                                                                                                                                                             ASCI architecture



Supporting OpenGL is complex since the Xorg implementation doesn’t allow multiple desktops attached to the same GPU to process GLX instructions. A solution is the VirtualGL approach. Under this system there is a single GPU- attached Xorg server, called 3DX, and multiple non-GPU desktops, called 2DX. When an application started on a 2DX desktop needs to process a GLX instruction, the VirtualGL library catches and forwards the instruction to the 3DX server for processing on the GPU.

VirtualGL architecture


ASCI relies on the following infrastructure services:

  • DNSmasq provides DNS resolution for the ASCI DNS sub-domain
  • BitBucket is the Git repository manager used is this environment
  • ASCI delivered applications are built as RPM packages which are stored on a local YUM Repository
  • Autobuild is an application which builds Docker image on new commit event, and push them to the Docker Registry
  • Docker Registry stores the Docker images. These images are downloaded as needed by the Docker hosts
  • Matchbox  is  the  PXE  server  to  provide  Boot-On-Lan.  It  is  configurable  via  API.  This  system  is  used  to  boot  ASCI workers on the network

 The monitoring solution is built with these components:

  • JournalBeat  and  MetricBeat  run  on  every  monitored  system  and  collect  log  and  metrics  to  send  to  the  logging database
  • HeartBeat monitors a list of services to report theirs state in the logging database
  • ElasticSearch is a search engine used to store and index logs and metrics
  • Kibana is a Web interface for ElasticSearch
  • ElastAlert, the alert manager, watches logs and metrics to trigger alerts and notifications based on rules



I work at the Australian Synchrotron as DevOps/HPC engineer.

HPC at The University of Sydney: balancing user experience with system utilisation

Dr Stephen Kolmann1

1The University Of Sydney, Sydney, Australia

The University of Sydney’s High Performance Computing cluster, called Artemis, first came online with 1344 cores of standard compute, 48 cores with high memory and 5 nodes with 2 K40 GPUs each. These resources were made available as one large resource pool, shared by all users, with job priority determined by PBS Professional’s fairshare algorithm.1 This, coupled with three-month maximum walltimes, led to high system utilisation. However, wait times were long, even for small, short jobs, resulting in sub-optimal end user experience.

To help cater for strong demand and improve the end user experience, we expanded Artemis to 4264 cores. We knew this expansion would help lower wait times, but we did not rely on this alone. In collaboration with our managed service provider (Dell Managed Services at the time, now NTT Data Services), we designed a queue structure that still caters for a heterogeneous workload, but lowers wait times for small jobs, at the expense of some system utilisation. Figure 1 shows how we partitioned compute resources on Artemis to achieve this balance.

Figure 1: Distribution of Artemis’s compute cores. The left pie chart shows the coarse division of all Artemis’s compute cores, and the right pie chart shows the nominal distribution of compute cores within the shared area of the left pie chart.

The cores are divided into three broad categories: condominiums, strategic allocations and shared cores.

  1. Condominiums are compute nodes that we manage on behalf of condominium owners
  • Strategic allocations are dedicated to research groups who won access via a competitive application process
  • Shared cores are available to any Sydney University researcher who wants Artemis access

The shared cores are further sub-divided into separate resource pools that cater for different sized jobs.  This division was made to segregate small, short jobs from large, long running jobs. The idea behind this partitioning is that short, small should start quickly, but larger, longer running jobs should be willing to tolerate longer wait times.

This poster will explore our experience with this queue structure and how it has impacted metrics such as job wait times, system utilisation and researcher adoption.


1. PBS Professional Administrators Guide, p. 165. Available from:, accessed 1 Sep 2017.


Stephen Kolmann is currently working at The University of Sydney as an HPC Specialist where he provides end-user HPC documentation and training and acts as a consultant for internal HPC-related projects. He completed a PhD in Computational Chemistry, where he made extensive use of HPC facilities both at The University of Sydney and NCI

Challenges of Real-time Processing in HPC Environments The ASKAP Experience

Mr Eric Bastholm1, Mr Juan Carlos Guzman1

1CSIRO, Kensington, Australia


CSIRO Astronomy and Space Science operates the Australian Square Kilometer Array Pathfinder (ASKAP), a state-of-the-art radio telescope located at the Murchison Radio Observatory (MRO) in Western Australia. The data rates produced are significant enough that it is necessary to process in real-time in a multi-user high performance computing (HPC) environment. This presents a number of challenges with respect to resource availability and the performance variability resulting from interactions of many parallel software and hardware components. These issues include: disk performance and capacity, inter-node and intra-node parallel communication overheads, and process isolation.

Typically, supercomputing facilities are used for batch processing of large established data sources. Small intermittent processing delays resulting from temporary resource contention issues do not affect these processes as the total compute time is the dominating factor and small wait times are insignificant. However, for real-time processing of high data ingest rates these small and unpredictable contentions for resources can be problematic. In our case, even delays of seconds can have a negative effect on our ability to reliably ingest the data.

We have learned much from addressing these challenges and our experience and solutions will provide valuable input to the radio astronomy community, especially for larger projects, like the Square Kilometer Array, where the same challenges will present at even larger scales.

Disk Performance

A key component in the system is the high performance, high capacity Lustre file system. Our findings suggest that even though the hardware can be shown to perform at the required spec (10 GB/s sustained read/write) in practice other factors come into play which can lower that rate significantly or introduce a higher variance in the rate lowering its average. These include: the number of active writers, the I/O patterns used, reliance on external code libraries with unknown I/O behaviours, and the remaining capacity of the disk.

Inter-process Performance

To ingest and process high data rates it is necessary to parallelise the implementation. We have found that there are complicated issues surrounding process interaction which cause performance degradation in memory access, inter-process communication and threading. This manifests as dead regions in the processing stream which can result in lost data, errors, or a reduction in output quality.

Process Isolation

Our experience indicates that it is important to isolate key processes from the rest of the system as much as possible. Especially, those processes responsible for data ingest. In principle this seems fairly straight forward, but in practice it is non-trivial because the HPC environment is a complex mesh of devices, software, and networks which essentially prevents total isolation of any process.


Eric Bastholm

Team Leader ASKAP Science Data Processor

Leads a team of software developers with backgrounds in astronomy, software engineering and data science whose purpose is to develop and test the data reduction software pipelines for the Australian Square Kilometre Array Pathfinder (ASKAP) radio telescope managed by CSIRO.

Teaching High Throughput Computing: An International Comparison of Andragogical Techniques

Mr Lev Lafayette1

1University Of Melbourne, Parkville, Australia


The importance of High Throughput Computing (HTC), whether through high performance or cloud-enabled, is a critical issue for research institutions as data metrics are increasing at a rate greater than the capacity of user systems [1]. As a result nascent evidence suggests higher research output from institutions that provide access to HTC facilities. However the necessary skills to operate HTC systems is lacking from the very research communities that would benefit from them. Apart from reducing research output this also places additional unnecessary pressure on system administrators who need to respond to uneducated user requests.

With empirical data spanning several years correlating training and utilization from the Victorian Partnership for Advanced Computing (VPAC), the University of Melbourne, and Goethe-University Frankfurt Am Main, theoretical and practical issues are raised. As advanced adult learners, the postgraduate researchers involved in the course offerings from these institutions should be an ideal fit for andragogical teaching techniques [2]. However exploring the practices and course offerings of the listed institutions indicates a combination of experience and cultural factors reinforces the notion of a continuum between pedagogical and andragogical techniques, but also raises whether there are correlative stages for the components within this continuum or whether they are independent.

The results of this international investigation with variant teaching methods, classes, and results, provide opportunities for other institutions to initiate or review their own training programmes and create the conditions for dynamic HTC user communities and increased informed utilization of their systems.


  1. Martin Hilbert, Priscila López. The World’s Technological Capacity to Store, Communicate, and Compute Information, Science Vol. 332 no. 6025 1 April 2011, pp. 60-65.
  2. Malcolm Shepherd Knowles, Elwood F. Holton, Richard A. Swanson, The Adult Learner: The Definitive Classic in Adult Education and Human Resource Development, Gulf Publishing Company, 1998


Lev Lafayette is the HPC Support and Training Officer at the University of Melbourne. Prior to that he worked for several years at the Victorian Partanership for Advanced Computing in a similar role. An active advocate of HPC and Linux, he has presented at numerous conferences and user groups on these subjects over the past decade.

Maximising Research Potential with Multipurpose Hardware

Miss Sara Drieman1

1eRSA, Thebarton, Australia


eResearch hardware is typically purchased with academic grant funding, resulting in large scale solutions using current market high end hardware. Due to the periodic nature of the funding and the continuous advances in technology, solutions purchased today can often appear archaic in even a year’s time. Without the funding to upgrade, this tends to lead to hardware solutions exceeding warranty and falling short of researcher needs towards the end of the lifecycle. In addition, funding is increasingly uncertain and grants may not be available to replace end of life infrastructure.

This was the cycle eRSA needed to break out of in order to accommodate the ever changing researcher needs. With new hardware required for both HPC and Cloud infrastructure, the problem had to be approached uniquely in order to provide both. Fortunately the solution was made possible with hardware leasing and by selecting hardware to meet both needs based on demand. Rather than the traditional “build it and they will come” approach and only focusing on a single service capability, we were able to create a multipurpose cluster with 8 nodes being used for Cloud and 12 for HPC.

As this solution is easily scalable, the cluster can grow to meet demand, rather than building a large underutilised platform. This provides the flexibility of increasing each service as required as well as meeting the researchers needs as they evolve. By leasing the hardware from the vendor, we are able to stay ahead of the hardware lifecycle, while ensuring capabilities progress as the technology does.

Of course, constant change of hardware components can lead to a high overhead in system administration. This challenge is handled with configuration management tools, ensuring new nodes need minimal human input to join the cluster.

This presentation details how we configured the multipurpose cluster, an overview of the technical specifications of the cluster, how we overcame the challenges of increased overheads, and other cost saving aspects we  adopted; all to ensure we provide our users with a service that allows them to do what they do best, leaving the technical knowhow to us.



Sara Drieman is the Operations Team Leader at eResearch SA, where she works with researchers as well as commercial users. Having an extensive background in both IT Support and System Administration roles, Sara takes pride in helping users adopt technologies as well as improving services for her team and eRSA clients.

ASCI: An interactive desktop computing platform for researchers at the Australian Synchrotron

Mr Robbie Clarken1, Mr John Marcou1, Mr Ron Bosworth1, Dr Andreas Moll1

1Australian Synchrotron, Melbourne, Australia



The volume and quality of scientific data produced at the Australian Synchrotron continues to grow rapidly due to advancements in detectors, motion control and automation. This means it is critical that researchers have access to computing infrastructure that enables them to efficiently process and extract insight from their data. To facilitate this, we have developed a compute platform to enable researchers to analyse their data in real time while at the beamline as well as post-experiment by logging in remotely. This system, named ASCI, provides a convenient web-based interface to launch Linux desktops running inside Docker containers on high-performance compute hardware. Each session has the user’s data mounted and is preconfigured with the software required for their experiment.


ASCI consists of a cluster of high performance compute nodes and a number supporting applications. These include an application for launching and managing instances (asci-api), a web interface (asci-webui) and a proxy for relaying connections (asci-proxy).


Figure 1: Sequence for creating an ASCI desktop session

Users connect to ASCI by logging in to the web interface and selecting an environment appropriate for processing their data. The webui will send a request for a new instance of that environment type to the asci-api. The asci-api selects the best compute node to launch the instance on based upon the requirements of the environment and the load on the cluster. Once it has picked a node, the asci-api launches a Docker container based upon the requested environment. The user is then presented with an icon representing the running session and they can connect to this desktop from their web browser.

When the user initiates a connection, a VNC session is created inside the Docker instance with a one-time password. This password is used to launch a NoVNC connection in the user’s browser and the user is presented with their desktop and can commence analysing their data.


Docker containers are a technology for creating isolated process environments on Linux. We chose this technology for the ASCI user environments because they deliver almost identical performance compared with running on the bare-metal operating system, while enabling multiple users to simultaneously use the node. Docker also enables us to create predefined environments, tailored with the applications required for different types of experiments. Each environment is based on a docker image which is defined by a text file outlining how to prepare the desktop. These image recipes support inheritance, enabling us to have a base image with installs the ASCI infrastructure applications and then child images with the specialised scientific software for the different experiments.


An early goal of the ASCI project was to allow users to connect to desktop sessions through their standard browser rather than requiring users run a specialised application. This greatly lowers the barrier to entry to using the system and allows users to access it from any operating system, including tablets and mobiles.

We built the web interface using Flask for the server and React for generating the front-end. For rendering the desktops in the browser, we utilise NoVNC which delivers a VNC connection over WebSockets. This results in a responsive interface that runs on all platforms, including mobile and tablet.


In order to provide GPU hardware acceleration to multiple ASCI instances on one node, we need to use a modification of the traditional X architecture. Usually, the X server has direct access to the GPU hardware and this allows graphical application to execute OpenGL instruction. The challenge when running multiple desktops on a single node, is that multiple X servers cannot share direct access to the same GPU hardware.

To address this, we run a single X server directly on the node; this is known as the 3DX server. Then every ASCI instance runs its own internal 2DX server which handles graphical applications, such as the desktop environment. When applications make use of the GPU, we launch them with special environment variables which causes them to load VirtualGL libraries in place of the standard OpenGL libraries. The VirtualGL libraries will catch and forward all OpenGL instructions on to the 3DX server which then executes them on the GPU.


Every component of the ASCI system runs inside its own Docker container. This enables us to precisely define the environment of each application, such as the operating system and dependencies, and to easily reproduce the applications on different hosts. It also means when a developer tests the application on their local machine, they are doing so in the same environment as it will run in production. To facilitate deploying updates we created an application called Autobuild which receives notifications from our Bitbucket server whenever a tag is added to an application’s git repository. When Autobuild sees a new tag it clones the code from Bitbucket and uses Docker to build an image for the application based on a Docker file in the repository. The built image is then pushed to our internal Docker registry ready for deployment.


To monitor ASCI we use a collection of open source tools known as the Elastic Stack. This includes a database, Elastic Search, for capturing logs and metrics, and the front-end website, Kibana, for viewing logs and creating dashboards. To harvest logs we have the applications inside the docker containers log to standard out and configure docker to forward the logs to journald. A utility called Journalbeat then collects the logs and sends them to an Elastic Search pipeline based on the source of the log. The pipeline parses the log and ingests the output into Elastic Search. For alerting, we have an application called ElastAlert monitor the Elastic Search database and trigger a Slack notification based on certain rules. This enables us to be instantly alerted whenever an error occurs or in the case of unusual user behaviour on the website which may be indicative of an attack on the system.


ASCI is now in use at the Australian Synchrotron for processing data being produced at the Medical Imaging, X-ray Fluorescence Microscopy and Micro-crystallography beamlines. The simple web interface and tailored environments provide an easy and intuitive platform for users to process their data and the automated build systems allow fast and painless deployment of updates. Future upgrades to the system will include supporting alternative interfaces to the environments, such as Jupyter Notebooks, and integrating a batch job submission system to distribute processing tasks across multiple nodes.


Robbie Clarken is a Scientific Software Engineer at the Australian Synchrotron. Robbie has a BSc in Nanotechnology from Flinders University and in previous roles at the Synchrotron has been a Particle Accelerator Operator and Robotics Controls Engineer. Currently he helps researchers at the Synchrotron extract insight from their data by developing data processing systems.

Automating the Data Workflow and Distribution in Research Computing

Mr David Fellinger1

1iRODS Consortium, Chapel Hill, United States


In the early days of high performance research computing much of the work consisted of iterative mathematics and matrix resolution, complex system simulation, and visualization. In the US, machines like ASCI Blue Pacific at Lawrence Livermore National Laboratories completely ended the need for nuclear testing and ASCI Red at Sandia National Laboratories was used to visualize the core of the Sun giving researchers insights into a large scale fusion reaction. Many of these jobs ran multiple days so scheduling could easily be done with a relatively small number of resources. Intermediate results were often written out to “scratch” file systems as checkpoints so that a job could be restarted in case of a crash or partial cluster failure. Of course, restarts were also always handled manually.

The inception of parallel file systems represented a major innovation in high performance computing. Storage systems could be designed as parallel clusters directly coupled with compute clusters utilizing efficient network protocols and infrastructures based on fibre channel and infiniband. Input/Output (I/O) bandwidth could be scaled by increasing the elements in the storage cluster under the file system management. The parallel file system Lustre was developed starting in 2001 with deployments in 2002 as an open source project initially funded by a government agency with an interest in compute efficiency. This file system found immediate acceptance in both government and academic research sites world wide since far less time was required to move large amounts of data to or from compute clusters. Innovations in both storage systems and networking allowed I/O bandwidths to grow from gigabytes per second to terabytes per second in just a few years with compute clusters in excess of 100,000 cores and storage systems in the tens of petabytes.

As I/O systems and compute clusters were evolving there was also a concurrent evolution in the types of jobs that were run on these efficient machines. Besides simulation and visualization many machines were used for data analysis and reduction. Large data sets from physical devices like sensors, microscopes, genome sequencers, telescopes, and radio telescope arrays were placed on the parallel file systems and ingested into machine cache to achieve the required goals of the run. Data management now required a staff of resources to manage data moved into a site for reduction. Interesting use cases evolved like the use of an IBM Blue Gene machine at ASTRON in the Netherlands to execute fast Fourier transforms on radio telescope data in real time upon ingestion. Experiments on instruments like the Large Hadron Collider at CERN generated petabytes of data which had to be reduced by high performance computation to be of use. Through this evolution, tools were developed to schedule jobs on all or portions of a cluster, reporting tools were developed to check the job placements and optimize the schedulers, and reporting functions were automated so that systems administrators could understand the cluster resources engaged in the execution of a job. The migration of data to and from the cluster, however, remained a largely manual task requiring human intervention to control the data flow to and from the “scratch” file system which would be utilized for each compute job.


The Integrated Rule-Oriented Data System (iRODS) was introduced as open source middleware to enable the establishment of research collections. Large repositories such as EUDAT in the European Union and the large genomic sequence repository at the Wellcome Trust Sanger Institute in the UK are maintained under iRODS management. Collection maintenance can be automated, from metadata extraction upon ingestion through the application of enforceable access controls and automated distribution to subscribers. Recently, the iRODS Consortium has launched a program to closely tie iRODS with parallel file systems and an interface to Lustre was the first to be written.

Lustre is an asymmetrical file system with a Metadata Server (MDS) retaining a custom database that relates file system names or V-nodes with reference designations or I-nodes and storage locations called Object Storage Targets (OST). The MDS also handles permission and every file transfer must start with an MDS query. The file system is extremely scalable since the intensive extent list construction and maintenance is done in multiple parallel OSTs with the entire file system structure tracked by the efficient MDS process. The iRODS interface consists of a service node operating as an additional Lustre client constantly monitoring the MDS through a changelog which is a service of the MDS. iRODS is not involved in any transactions relating to data transferred to or from the compute cluster so the performance of the file system is not compromised by the addition of a data management layer.

Interestingly, iRODS is also asymmetrical and scalable in that the database called the iCAT which catalogs and maintains data placement information can be a separate element with groups of servers actually executing data movement. The iRODS interface with Lustre allows the construction of, effectively, a parallel database such that the Lustre MDS and the iRODS iCAT are consistent. The iCAT may, however, contain additional metadata which would not be practical or required in the MDS.

All operations within iRODS are driven by policies consisting of specific rules custom to each deployment. These rules are executed at Policy Enforcement Points (PEPs) as data collections are constructed and manipulated.

A workflow example might be one from radio astronomy where specific time domain Fourier planes of observed data must be manipulated by a compute cluster to determine changes that may have occurred over specific time intervals. These data could exist on an external file system and a system scheduler like the Slurm Workload Manager could notify iRODS that the data is required on the “scratch” file system adjacent to the compute cluster. iRODS could then manage the movement of the data to Lustre tracking the changes on the MDS. Upon data reduction, the output files would be written to Lustre and iRODS can monitor this activity through the scheduler or by means of monitoring the predetermined V-node entries on the MDS. iRODS could then initiate the movement of the resultant data to an archive that could be utilized by researchers. These output data can be analyzed in flight by iRODS-driven processes so that the file headers could be read to enrich the iCAT metadata with elements such as the coordinates of movement within the Fourier planes.

Collaboration can be enabled by the federation capabilities of iRODS so that multiple sites can be granted access to all or part of the collection interactively. All of this activity can be fully automated and transacted based upon the policies which can evolve over time to effect all phases of data lifecycle management from analysis to publication and finally the transition to a long term archive.

The end result is a flexible data infrastructure that is immediately available and searchable after the data reduction run. In effect, a collection is built where the compute cluster is the co-author and iRODS is the policy driven editor.

Iterative science is often driven by changes in experimental paradigms triggered by previous results. Providing results to researchers as efficiently as possible is critical to maximizing the value of any iterative experiment. iRODS can be utilized to completely automate the data management and data delivery in a research computing environment shortening this experimental “feedback loop”.



Dave Fellinger is a Data Management Technologist with the iRODS Consortium. He has over three decades of engineering and research experience including film systems, video processing devices, ASIC design and development, GaAs semiconductor manufacture, RAID and storage systems, and file systems. As Chief Scientist of DataDirect Networks, Inc. he focused on building an intellectual property portfolio and representing the company to conferences with a storage focus worldwide.

In his role with the iRODS Consortium Dave is working with users in research sites to assure that the functions and features of iRODS enable fully automated data management through data ingestion, security, maintenance, and distribution. He serves on the External Advisory Board of the DataNet Federation Consortium and was a member of the founding board of the iRODS Consortium.

He attended Carnegie-Mellon University and hold patents in diverse areas of technology.

Cloud Infrastructures/SDN Testbeds for eScience Researches

Dr Kohei Ichikawa1, Dr Philip Papadopoulos2, Professor David Abramson3, Dr. Peter Elford4, Dr. David Wilde4, Dr. Warrick Mitchell4

Nara Institute of Science and Technology, Ikoma, Japan,

2 University of California, San Diego, San Diego, USA,

3 The University of Queensland, Brisbane, Australia,

4 AARNet, Brisbane, Melbourne, Australia, Peter.Elford, David.Wilde,


Effective sharable cyberinfrastructure is fundamental for collaborative research in eScience. The goal of this session is to share and discuss the recent research activities, technologies, ideas and issues on shared cyberinfrastructures for international collaborative researches.  As a conference co-located with PRAGMA workshop, the session focuses on sharing and connecting the researches of both Australian eResearch community and PRAGMA community [1]. PRAGMA community has been developing a sharable cyberinfrastructure, PRAGMA testbed [2], for many years and the focus of the community has shifted to a virtualized infrastructure or Cloud computing recently. With this move to virtualized infrastructure, virtualized network technology is also becoming a major interest in the community, and PRAGMA has established an international SDN (Software-Defined Networking) testbed, PRAGMA-ENT (Experimental Network Testbed) [3, 4]. In the session, we will have several presentations on sharable cyberinfrastructure including Cloud resources and SDN testbeds from both Australian and PRAGMA communities, and we will also discuss common research topics and issues on shared cyberinfrastructures for eScience research.

The duration of this session would be 60 minutes.

Topics of interest include, but are not limited to, the following:

  1. eScience cyberinfrastructure/Testbed
  2. Grid/Cloud computing
  3. Resource scheduler/allocation/optimization
  4. Software-Defined Networking/Storage/Data Center
  5. Security of shared resources
  6. Operation/management of shared resources
  7. High availability and fault tolerance of shared resources
  8. Monitoring of shared resources
  9. Visualization infrastructure for collaborative data science
  10. eScience applications on widely distributed cyberinfrastructures


  1. PRAGMA project. Available from:, accessed in June
  2. Abramson, A. Lynch, H. Takemiya, Y. Tanimura, S. Date, H. Nakamura, Karpjoo Jeong, Suntae Hwang, Ji Zhu, Zhong-hua Lu, C. Amoreira, K. Baldridge, Hurng-Chun Lee, Chi-Wei Wang, Horng-Liang Shih, T. Molina, Wilfred W. Li, P. W. Arzberger, “Deploying Scientific Applications to the PRAGMA Grid Testbed: Strategies and Lessons,” Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID 06); 2006:241-248.
  3. PRAGMA-ENT project. Available from:, accessed in June
  4. Kohei Ichikawa, Pongsakorn U-chupala, Che Huang, Chawanat Nakasan, Te-Lung Liu, Jo-Yu Chang, Li-Chi Ku, Whey-Fone Tsai, Jason Haga, Hiroaki Yamanaka, Eiji Kawai, Yoshiyuki Kido, Susumu Date, Shinji Shimojo, Philip Papadopoulos, Mauricio Tsugawa, Matthew Collins, Kyuho Jeong, Renato Figueiredo, Jose Fortes, “PRAGMA-ENT: An International SDN Testbed for a Cyberinfrastructure in the Pacific Rim,” Concurrency And Computation: Practice And Experience, Wiley InterScience; 2017:e4138.


Kohei Ichikawa is an Associate Professor in the Graduate School of Information Science at Nara Institute of Science and Technology, Japan. His past research work involved the design and development of middleware and applications for Grid, virtualized infrastructures and Software Defined Network testbed. He currently works on the PRAGMA project, developing an international Software Defined Networking Testbed for use by the PRAGMA researchers.

Don’t Forget the Workflows – the Drivers of the Compute and Storage Infrastructure

Mr Jake Carroll1

1The University of Queensland, QBI, St Lucia (Brisbane), Australia,


Within the research computing sector, significant time, money and resources are spent upon the architecture, engineering and design of infrastructure to undertake a task. Whether expenditure is in building localised capability, or in the consumption of cloud resources, the result is generally the same. A user, group or institution pays for the capability to run or undertake a task on a platform to achieve a scientific and statistically significant output. Missing from this, however, and is oftentimes left up to the end user to consider for themselves, is the nature of what workflows best compliment infrastructure and interlock to provide best performance characteristics. Workflows1 are not novel and have an established place in scientific computing. A growing trend and performance driven objective, is the use of workflow aware infrastructure and the converse of this, infrastructure aware workflows. This presentation will provide insight into the University of Queensland’s Research Computing Centre efforts in the development, surfacing and eventual use of a data locality, caching and workflow aware storage infrastructure engine, known as the MeDiCI (Metropolitan Data Intensive Caching Infrastructure) project. The presentation will, further, illustrate a real-life example of how this infrastructure is being used to deliver a multi-site, multi supercomputing, multi data locality aware scientific outcome.


The concept of location aware infrastructure is still relatively nascent and is the subject of significant research in industry and academia. At a macro level, in code path optimisation, a great deal of work has taken place in finding ways to determine lower cost code path latency in respect to core-to-memory and core-to-L1/L2/L3 time to surface, to avoid inefficiency in pipeline execution2. Where data caching locality has only had light treatment however, is at the storage IO layer. The MeDiCI architecture and the underpinning work of Abramson et al. challenges the notion that a supercomputer should have data in one location at all times, due to the nature of the workflow used to initially generate that data. Moreover, there are circumstances where the generation of data is widely distributed and disconnected from the processing hub or sink. MeDiCI aims to tie instrument and processing together, via the use of intelligent policy, autonomous filesystem movement semantics and workflow aware migratory patterns.

Figure 1: The caching architecture touch points of the UQ campus for the MeDiCI framework

In several cases, workflows have been constructed to take advantage of advanced computing functionality locally, within the organisational units within the University of Queensland, but, equally, within the remote site, where the majority of the larger scale supercomputing platforms reside. One such example is the work of UQ’s largest genomics research group, who have found purpose in the utility of local HPC facilities as well as remote HPC facilities for different workloads. The use of these combined facilities provides not only an enhanced number of resources and more scale to the group, but flexibility in the determination of where a workload is run to suit specific use cases for different workload categorisation. An example of this can be found in code known as GCTA ( which is MPI aware and benefits significantly from the use of well coupled, low latency RDMA constructs in infrastructure, whereas other codes such as BayseR ( benefit from large shared memory spaces. Yet another code known as epiGPU ( benefits from the use of massively scaled GPU arrays via CUDA frameworks. Using the MeDiCI fabric-semantic and single transport namespace allows the data to surface where it needs to, but importantly, only when it needs to.

The implementation and use of the AFM (Advanced File Management) data transport for parallel IO movement between cache and home nodes is the key to automation of data movement upon fopen() of a file or structure on the filesystem within the workflow. MeDiCI challenges the need to put data into more than one location manually, and provides an autonomous mount and data distribution methodology to all of the resources mentioned above from a computational standpoint, to an eventually consistent filesystem namespace. In the use of these workflows, this is taken advantage of, as processing can occur at one computation ‘edge’, then trigger processing at another facility for a different treatment.

Figure 2: Data location and transports from one platform to another, autonomously, exhibiting automated cache hierarchy in the workflow.


  1. Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M. Workflows for e-Science: Scientific Workflows for Grids. (2007). 1st Edition [Ebook]. New York: Springer Press, PP. 3-6. Available at [Accessed 13th June, 2017]
  2. Shi, Q., Kurian, G., Devadas., S, Khan, O. LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies. (2016). ACM Transactions on Architecture and Code Optimisation (TACO), Volume 13, Issue 4, Article No. 37.



Jake Carroll is the Senior ICT Manager (Research) for one of the largest neuroscientific research organisations in the world – the Queensland Brain Institute, within the University of Queensland.

Jake has worked in the research and scientific computing sector for around a decade in various roles. On the way, as he’s run all over the planet, he’s collected many wonderful colleagues, friends, collaborators and associates.  You might find it odd to hear this from a ‘once was IT guy’, but the truth is, Jake believes in people and their passion, brilliance and innovation as the most precious commodity we have. People and their passion are what makes Jake still come to work every day.

Jake is passionate about making technology go further and work harder for research and has a track record of building things that allow scientists to do things that were not possible, previously, at the campus, the national and even the international scale.

Slightly older (and if he’s very lucky, slightly wiser), Jake now takes a more quiet and considered seat in the leadership, governance and directionality of research focused technology teams, infrastructure projects and think-tanks in Australia and across the world.

Recent Comments

    About the conference

    eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

    Conference Managers

    Please contact the team at Conference Design with any questions regarding the conference.

    © 2018 - 2020 Conference Design Pty Ltd