Mr Gareth Elliott1
1DSTG, Australia
Biography:
Gareth is currently working with researchers at DSTG in several areas of digital science. He previously undertook research as a computational chemist.
Abstract:
Many scientific calculations scale easily on a higher number of CPUs or GPUs, allowing for a faster simulation time, or the ability to perform a more accurate or more complicated calculation in the same amount of time. However, some calculations are limited by poor scaling, licensing or limited access to compute hardware. This necessitates them to simply run for a longer period of time to achieve the desired result. For example, gathering sufficient statistics from molecular dynamics, convergence of a geometry optimisation, reaching an optimisation threshold or sufficiently searching a feature space. Many software packages offer a checkpoint/restore functionality which allow for a disaster recovery should there be a hardware failure, or to allow the running of a long job where a wall time limit exists (such as usually implemented on a HPC system). However, what happens when researchers want to use software that does not offer this functionality, and they want to run it uninterrupted for over a month? This talk will discuss the use of generic checkpointing software to implement that functionality allowing the researchers to achieve robustness against hardware failures and wall time constraints where not previously possible.