How to Improve Fault Tolerance on HPC Systems?

Dr Ahmed Shamsul Arefin1

1Scientific Computing, Information Management & Technology, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Canberra, Australia


HPC systems of today are complex systems made from hardware and software that were not necessarily designed to work together as one complete system. Therefore, in addition to regular maintenance related downtimes, hardware errors such as voltage fluctuation, temperature variation, electric breakdown, manufacturing defects as well as software malfunctions are common in supercomputers. There are several ways to achieve some fault tolerance which includes (but not limited to): checkpoint/ restart, programming model-based fault tolerance and algorithm-based theoretical solutions, but in real-life none of these cover all the situations mentioned above [1]. In this work, we have reviewed a few of the checkpoint/restart methods and we listed our experience with them including two practical solutions to the problem.


On a small cluster consisting only two nodes (master and compute) constructed using Bright Cluster Manager (BCM) ver. 7.3, we deployed a couple of existing methods (see below) that can support operating systems and software level fault tolerances. Our experience described below. In our model cluster, each node had 16 CPU cores in 2 x Intel Xeon CPU E5-2650 0 @ 2.00GHz, min 64GB RAM, 500GB local HDD, and Ethernet/ Infiniband connections for networking purposes. Slurm ver. 16.05 was installed as a part of BCM’s provisioning on the both nodes along with SLES 12Sp1 as the operating system. The master was set to act as both head and login nodes as well as the job scheduler node.


Our first test candidate was the BLCR (Berkeley Lab Checkpoint/Restart). It can allow programs running on Linux to be checkpointed i.e., written entirely to a file and then later restarted. BLCR can be useful for jobs killed unexpectedly due to power outages or exceeding run limits. It neither requires instrumentation or modification to user source code nor recompilation. We were able to install it on our test cluster and successfully checkpointing a basic Slurm job consisting only one statement: sleep 99. However, we faced the following difficulties during its pre and post-installation:

  • The BLCR website [2] lists an obsolete version of the software (ver. 0.8.5, last updated on 29 Jan 2013) which does not work with newer kernels such as what we get on the SLES12SP1. We had to collect a newer version along with a software patch from the BCLR team to get it installed on the test cluster. Slurm had to be recompiled and reconfigured several times following the BLCR installation, requiring ticket interactions with the Slurm, BCM and BLCR teams. Unfortunately, at the end the latest BLCR installation process did not work on the SLES12SP2 due to newer kernel version again (>4.4.21).
  • MPI incompatibility: The BLCR when installed along with Slurm can only work with MVAPICH2 and doesn’t support Infiniband network [2]. This MPI variant was not available in our HPC apps directory, therefore could not be tested.
  • Not available to interactive jobs: The BLCR + Slurm combination could not checkpoint/restart interactive jobs properly.
  • Not available to GPU jobs: The BLCR did not support GPU or Xeon Phi jobs.
  • Not available to licensed software: Talking to a license server, after a checkpoint/ restart did not work.

Our next candidate was the DMTCP (Distributed Multi Threaded Checkpointing). This application level tool claims to transparently checkpoint a single-host or distributed MPI computations in user-space (application level) – with no modifications to user code or to the O/S. As of today, Slurm cannot be integrated with DMTCP [2,4]. It was also noted that after a checkpoint/restart computation results can become inconsistent when compared against the non-checkpoint output, validated by a different CSIRO-IMT (User Services) team member.

Our next candidate was the CRIU (Checkpoint/Restore In Userspace). It claims to freeze a running application or at least a part of it and checkpoint to persistent storage as a collection of files. It works in user space (application level), rathe r than in the kernel. However, it provides no Slurm, MPI and Infiniband supports as of today [2,5]. We therefore have not attempted it on the test cluster.


After reviewing some of the publicly available tools that could potentially support the HPC fault tolerance, we decided propose the following practical solutions to the CSIRO HPC users.


Application level checkpointing: In this method, user’s application (based on [1]) will be set to explicitly read and write the checkpoints. Only the data needed for recovery was written down and checkpoints need taken at “good” times. However, this can result in a higher time overhead, such as several minutes to perform a single checkpoint on larger jobs (increase in program execution time) as well as user-level development time. Therefore, users may need further supports when applying this solution.


Slurm supports job preemption, by the act of stopping one or more “low-priority” jobs to let a “high-priority” job run. When a high-priority job has been allocated resources that have already been allocated to one or more low priority jobs, the low priority job(s) are pre-empted (suspended). The low priority job(s) can resume once the high priority job completes. Alternately, the low priority job(s) are killed, requeued and restarted using other resources. To validate this idea,  we’ve  deployed a  new  “killable job”  partition implementing a  “kill  and  requeue” policy.  This  solution was successfully tested by killing low priority jobs and can be further tuned to auto requeue lost jobs following a disaster, resulting an improved fault tolerance on our HPC systems.


Using a test cluster, we investigated and evaluated some of the existing methods to provide a better fault tolerance on our HPC systems. Our observations suggests that the program level checkpointing would be the best way to safeguard a running code/ job, but at the cost of development times. However, we have not yet attempted a few other methods that could potentially provide an alternative solution, such as: SCR- Scalable Checkpoint/restart for MPI, FTI – Fault Tolerance Interface for hybrid systems Docker, Charm++ and so on, which remains as our future work.


[1] DeBardeleben et al., SC 15 Tutorial: Practical Fault Tolerance on Today’s HPC Systems SC 2015

[2] Hargrove, P. H., & Duell, J. C. (2006). Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing.

[3] Rodríguez-Pascual, M. et al. Checkpoint/restart in Slurm: current status and new developments, SLUG 2016

[4] Ansel, J., Arya, K., & Cooperman, G. (2009). DMTCP: Transparent checkpointing for cluster computations and the desktop. In IPDPS 2009. IEEE International Symposium on (pp. 1-12). IEEE.

[5] Emelyanov, P. CRIU: Checkpoint/Restore In Userspace, July 2011.


Dr Ahmed Arefin works as an IT Advisor for HPC Systems at the Scientific Computing, IM&T, CSIRO. In the past, he was as a Specialist Engineer for the HPC systems at the University of Southern Queensland. He has done his PhD and Postdoc in the area of HPC & parallel data mining from the University of Newcastle and  published articles in PLOS ONE and Springer journals and IEEE sponsored conference proceedings. His primary research interest focuses on the application of high performance computing in data mining, graphs/networks and visualization.

Recent Comments