Dr Stephen Kolmann1
1The University Of Sydney, Sydney, Australia firstname.lastname@example.org
The University of Sydney’s High Performance Computing cluster, called Artemis, first came online with 1344 cores of standard compute, 48 cores with high memory and 5 nodes with 2 K40 GPUs each. These resources were made available as one large resource pool, shared by all users, with job priority determined by PBS Professional’s fairshare algorithm.1 This, coupled with three-month maximum walltimes, led to high system utilisation. However, wait times were long, even for small, short jobs, resulting in sub-optimal end user experience.
To help cater for strong demand and improve the end user experience, we expanded Artemis to 4264 cores. We knew this expansion would help lower wait times, but we did not rely on this alone. In collaboration with our managed service provider (Dell Managed Services at the time, now NTT Data Services), we designed a queue structure that still caters for a heterogeneous workload, but lowers wait times for small jobs, at the expense of some system utilisation. Figure 1 shows how we partitioned compute resources on Artemis to achieve this balance.
Figure 1: Distribution of Artemis’s compute cores. The left pie chart shows the coarse division of all Artemis’s compute cores, and the right pie chart shows the nominal distribution of compute cores within the shared area of the left pie chart.
The cores are divided into three broad categories: condominiums, strategic allocations and shared cores.
- Condominiums are compute nodes that we manage on behalf of condominium owners
- Strategic allocations are dedicated to research groups who won access via a competitive application process
- Shared cores are available to any Sydney University researcher who wants Artemis access
The shared cores are further sub-divided into separate resource pools that cater for different sized jobs. This division was made to segregate small, short jobs from large, long running jobs. The idea behind this partitioning is that short, small should start quickly, but larger, longer running jobs should be willing to tolerate longer wait times.
This poster will explore our experience with this queue structure and how it has impacted metrics such as job wait times, system utilisation and researcher adoption.
1. PBS Professional Administrators Guide, p. 165. Available from:
http://www.pbsworks.com/documentation/support/PBSProAdminGuide12.pdf, accessed 1 Sep 2017.
Stephen Kolmann is currently working at The University of Sydney as an HPC Specialist where he provides end-user HPC documentation and training and acts as a consultant for internal HPC-related projects. He completed a PhD in Computational Chemistry, where he made extensive use of HPC facilities both at The University of Sydney and NCI