Gerard Kennedy1, Ahmed Shamsul Arefin2, Steve McMahon3
1Research School of Engineering, ANU, Canberra, Australia
2Scientific Computing, IM&T, CSIRO Canberra, Australia
3IM&T, DST Group Canberra, Australia
In this work, we present a Software Image Test (SIT) tool that can test the software image of a node or nodes in a HPC cluster system. The script comprises of a collection of BATS tests that run in an automated SLURM job and the outcomes are sent to the executing user via email. The results help to decide if the software image is ready for rolling on the production cluster.
BATS (Bash Automated Testing System)  is a TAP (Test Anything Protocol)  compliant testing framework for Bash. It provides a simple way to verify the functionality of the executing programs. The test uses BATS files, which are essentially Bash scripts with special syntax for defining the test cases. If every command in a test case exits with a 0 status code (success) the test is considered as passed. See an example run below:
Figure 1: BATS tests developed for the HPC Software Image Testing.
With the syntax demonstrated above, we have developed a number of BATS tests (see Figure 1):
- nvidia.bats: This script contains tests for the node GPUs. It uses Nvidia Validation Suite , where the outcomes help to quickly check the CUDA configuration/ setup, ECC enablement, etc. It runs Deployment, Memory, and PCIe/Bandwidth tests, giving a quick overview of the main components of the GPUs.
- intel.bats: This test runs the Intel Cluster Checker . This package requires ‘config.xml’, ‘packagelist.head’, ‘packagelist.node’ and ‘nodelist’ files setup to execute successfully. The ‘Config.xml’ determines the modules that will be tested, and can be altered if the user wishes. Some examples of the modules tested: ping, ssh, infiniband, mpi_local, packages (uses packagelists), storage, etc. This test requires multiple nodes to run on.
- benchmark.bats: This test runs the Intel MPI Benchmark , which helps to ensure that MPI has been correctly configured on the node(s) in question. This test requires multiple nodes to run on.
- apoa.bats: This test uses the NAMD  ApoA1 simulation and tests OpenMP, MPI, CUDA, etc. configurations.
Further to these tests, we have developed scripts a few more essential tests, e.g., checking the storage mounts, SLURM partitions, ssh host-keys, etc.
In order to execute the SIT script, user must provide a valid set of input arguments. The possible input arguments are; Partition: The SIT script runs as a batch job, therefore the user needs define the partition in which the node or the set of nodes are located. If the nodes that we wish to test are spread across multiple partitions, need to enter the partitions as a comma separated list. Node(s): The user can input as many nodes as they wish.
Here are four examples of valid initialization commands and input argument combinations:
The SIT sends an email to the executing user as shown in the Figure 2. As the tests outcomes are sent as an email, users do not need to wait on the console. Based on the results, we further tune the software image as required.
Figure 2: SIT outcomes are sent as an email when the job is finished.
Conclusions and future works
We have devised a TAP-based tool that can quickly check the suitability of a software image before rolling it onto the production cluster nodes. The script as demonstrated above is simple, but robust enough to accommodate as many factors we wish to test. Our future plan includes to create a GUI, possibly web version where user add/remove tests and get outcomes visually. We are also aiming to use the Nvidia’s DCGM tool which has recently replaced the validation suit.
- Stephenson, S., “BATS”, https://github.com/sstephenson/bats
- Test Anything Protocol http://testanything.org/
- Nvidia Validation Suit http://docs.nvidia.com/deploy/nvvs-user-guide/index.html
- Intel cluster checker https://software.intel.com/en-us/intel-cluster-checker
- Intel MPI Benchmark https://software.intel.com/en-us/articles/intel-mpi-benchmarks
- NAMD https://www.ks.uiuc.edu/Research/namd/
Gerard Kennedy: Gerard is working as a Research Assistant at the Research School of Engineering, ANU and the Australian Centre for Robotic Vision. He is developing an asparagus-picking robot and involved with the robot’s perception system, which includes areas such as camera calibration, image segmentation and 3D reconstruction. He has a B.E in Mechatronics, Robotics and Systems Engineering from the Australian National University.
Ahmed Arefin: Ahmed works within the High Performance Computing Systems Team at the Scientific Computing, IM&T, CSIRO. He has done his PhD and Postdoc in the area of HPC & parallel data mining from the University of Newcastle and he published articles in PLOS ONE and Springer journals and IEEE sponsored conference proceedings. His primary research interest focuses on the application of high performance computing in data mining, graphs/networks and visualization.
Steve McMahon: Steve McMahon is an IT professional with a strong background in science, software development and IT service delivery. He understands the IT infrastructure needs of scientists and has worked with many. He has worked on negotiating, designing and establishing IT infrastructure for several large scale science projects. He has done major software development in the fields of computational fluid dynamics and biophysics simulation. He was integral in planning and implementing a broad range of data services for the federally funded Australian Research Collaboration Service (ARCS). Steve is currently working as the Engineering Manager for HPC and Computational Sciences at the DST Group.