Dr Ahmed Shamsul Arefin1
1Csiro, ACTON, Australia
Biography:
Dr Ahmed Shamsul Arefin is a Computation and HPC Specialist working within the HPC & Research Platforms Team, Scientific Computing Platforms, CSIRO. He completed his PhD in Computer Science (Data-Parallel Computing & GPUs) from the University of Newcastle, Australia and worked as a Postdoctoral Researcher (Parallel Data Mining) at the Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine (CIBM), The University of Newcastle, Australia. His research interest focuses on the application of HPC in data mining, graphs and visualization.
Abstract:
In today's world of high-performance computing (HPC), creating a dependable and cost-effective disaster recovery (DR) solution is of utmost importance. The ability to seamlessly replicate and backup the live cluster, including its intricate details such as libraries and kernel versions, is crucial for ensuring accurate results under normal circumstances.
This work proposes a method of backing up HPC master nodes using the Clonezilla disk imaging/cloning utility, which supports smart copying for various file systems. The resulting backup image folder is then securely stored locally on NFS storage, providing the means for safe and successful recovery and re-provisioning of the entire cluster, as demonstrated through testing on a test HPC cluster. We find it a safe, secure, and low-cost alternative to HPC infrastructure replication on the public cloud.
Safeguarding an organization's critical assets, such as HPC, in the face of unforeseen disruptions is mission-critical. This work delves into the transformative impact of Clonezilla in enabling an efficient and reliable DR solution, ensuring the continuity of essential operations during times of crisis.