Systems Administration in Research Computing

Conveners: Mr Greg Lehmann1, Mr Jake Carroll2

Gin Tan3, Dr Robert Bell4, Michael Mallon6, Linh Vu7, Steve McMahon5

1CSIRO, Pullenvale, Australia,
2The University of Queensland, St. Lucia, Australia,
3Monash University, Melbourne, Australia,
4CSIRO, Melbourne, Australia,
5CSIRO, Canberra, Australia,
6The University of Queensland, Brisbane, Australia,
7The University of Melbourne, Melbourne, Australia,


The workshop will be a full day event, without a hands on component. There are no limits on the number of attendees. There are no special requirements in equipment.


Research Computing uses tools and techniques that are specialized in nature. Systems administrators working with these tools and the scientists who use them have a different skill set to the norm in IT. This workshop will present information in this area and showcase use cases with the aim of knowledge transfer between practitioners.

1. Workshop introduction and site introductions. 5 minutes per site e.g.

a. Pawsey
c. NCI
d. DST
e. Monash
f. Swinburne
g. CQU
h. From the floor

2. Space/data management techniques. Flushing, quotas and HSM with encapsulation. Data life cycle, dataset concept. Exclude publication of datasets. – various – Rob Bell, Greg Lehmann, David Rose
45 mins


3. BeeGFS Use Cases in Australian HPC – Jake Carroll and Greg Lehmann
(1) Filesystems for accelerated computing – Australia’s first all flash BeeGFS production environment

Through analysis and system observability, it has become evident that accelerated supercomputing has presented a new kind of challenge to filesystems. This presentation discusses the challenges the University of Queensland faced in the process of scaling DL, AI, ML and deconvolution workloads and the pressures these workloads created on traditional parallel filesystems. Arriving eventually with the use of an RDMA all flash BeeGFS implementation, this presentation details the architectural considerations, workloads and corner cases that obviated such an approach.

(2) CSIRO’s new scratch FS – a first look a couple of months in.
30 mins

4. A Year with CephFS for HPC – Linh Vu
This presentation discusses the findings and challenges that the University of Melbourne experienced within a year of implementing CephFS as the storage solution for our growing HPC service. I will talk about our journey from a small POC 6-node 768TB (raw) NLSAS cluster to over 10 times the size, with a mix of NLSAS, SAS SSD and NVME SSD storage pools to cater for different workloads. I will address the design, technical and managerial challenges we have had to face to bring a relatively unknown filesystem to HPC, which we are now heavily investing in.
30 mins

5. Efficiently sharing data between HPC and cloud computing platforms – Michael Mallon
One of the guiding principles of the Medici project is to make where data lives somewhat independent from how a researcher might want to consume data. Adhering to this principle enables researchers to choose the most appropriate tool for a particular part of a workflow without incurring a mirroring or replication overhead. One of the more difficult places to adhere to this principle is the intersection cloud computing and HPC resources in workflows. I’ll talk about how we’ve addressed this using GPFS’s unified object and file interface and swifthlm.
30 mins


6. Ansible for Cluster Build – Gin Tan
The new M3 cluster is a bit different to a traditional HPC cluster. The cluster sits on the Monash research cloud and instances are provisioned with ansible – we called it cluster-in-a-box. The idea is to be able to provision a cluster anytime and anywhere we want.
30 mins

7. OpenHPC Experiences on the UQ Wiener cluster – Jake Carroll
30 mins

8. Using Bright Cluster Manager to streamline and improve HPC operations – Steve McMahon
Managing HPC systems can be complex.  There’s a lot happening and a lot of things to check to make sure they are working correctly.  This talk is about how using a product like Bright Cluster Manager can simplify HPC operations, check for common problems and improve service levels.
30 mins


8. Slurm on Ozstar at Swinburne – Chris Samuel
This short talk will cover how we use Slurm on Swinburne’s OzStar GPU cluster. It will cover what plugins we use, and why, as well as how we try and balance the various competing requirements for scheduling our workload through fair-share, partition configurations and our Lua job submit plugin. If time permits it will also cover as yet unsolved problems we wish to address.
30 mins

9. Scheduling containers in the cloud and hpc – Gin Tan
How we use the same container to run jobs in both Kubernetes and Slurm. The idea is to take HPC workload bursting into the cloud and looking for suggestions from the crowd as well if there’s any. The workload will be as simple as using Tensorflow in the container.
30 mins

10. HPC procurement panel discussion – various speakers including Jake Carroll
30 mins


IT workers who maintain the underlying Computing and Data Infrastructure used by scientists to do eResearch.


No special equipment required. Some background in IT required, preferably in HPC/Cloud computing.



Greg Lehmann has 35 years IT experience. Greg worked at the University of Queensland in his early career and has had varied mini careers in CSIRO. At present he works in the data team focused on filesystem delivery for HPC and cloud. Greg still has a strong interest in HPC systems in general which was his previous role. He is also the Infiniband fabric tech lead for CSIRO.

Jake Carroll is currently the Associate Director of Research Computing for UQ’s three large scientifically intensive research institutes – the Australian Institute for Bioengineering and Nanotechnology, the Institute for Molecular Bioscience and the Queensland Brain Institute.

Jake has spent the last 12 years in scientific computing, working on everything from building supercomputers to managing the strategy and complexity that comes with scientific endeavour.

Jake spends his time working to make scientific computing platforms, technology and infrastructure as good as it can be, such that world class research can be conducted, unencumbered.


Recent Comments