Lev Lafayette1, Mitch Turnbull2, Mark Wilcox3, Eric A. Treml4
1University of Melbourne, Parkville, Australia, firstname.lastname@example.org
2Nyriad, Cambridge, New Zealand, email@example.com
3Nyriad, Cambridge, New Zealand, firstname.lastname@example.org
4Deakin University, Geelong, Australia, email@example.com
Identifying probable dispersal routes and for marine populations is a data and processing intensive task of which traditional high performance computing systems are suitable, even for single-threaded applications. Whilst processing dependencies between the datasets exist, a large level of independence between sets allows for use of job arrays to significantly improve processing time. Identification of bottle-necks within the code base suitable for GPU optimisation however had led to additional performance improvements which can be coupled with the existing benefits from job arrays. This small example offers an example of how to optimise single-threaded applications suitable for GPU architectures for significant performance improvements. Further development is suggested with the expansion of the GPU capability of the University of Melbourne’s “Spartan” HPC system.
University of Melbourne HPC and Marine Spatial Ecology With Job Arrays
From 2011-2016, the University of Melbourne provided general researcher access to a medium-sized HPC cluster system called “Edward”, designed in a traditional fashion. As “Edward” was being retired an analysis of actual job metrics indicated that the overwhelming majority of jobs were single node or even single core, especially as job arrays. The successor system, “Spartan”, was therefore designed more with a view of high throughput rather than high performance. A small traditional HPC system with a high-speed interconnect was partitioned from a much larger partition built on OpenStack virtual machines from the NeCTAR research cloud. This proved to a highly efficient and optimised method both in terms of finances and throughput .
A specific example of large number of computational tasks that are designed for single-threaded applications with modest memory requirements is that for research in the marine biodiversity and population connectivity, which has significant implications for the design of marine protected areas. In particular there is a lack of quantitative methods to incorporate, for example, larval dispersal via ocean currents, population persistence, impact on fisheries etc. The Marine Spatial Ecology and Conservation (MSEC) laboratory at the University of Melbourne has been engaging in several research projects to identify the probable dispersal routes and spatial population structure for marine species, and integrate these connectivity estimates into marine conservation planning .
Code Review for GPGPU Optimisation
There are a number of architectural constraints on GPUs. They are, to a very large extent, independent of their host system. Object code needs to be compiled for the GPU (e.g., using OpenCL or nvcc). There is no shared memory between the GPU and CPU and any unprocessed data must be transferred to the GPGPU environment and then back to the CPU environment when completed. This said, GPUs typically only have small amounts of cached memory, if at all, replacing the need with GPU pipelining and ensuring very high memory transfer between the GPU and the host .
During the first half of 2017 Nyriad reviewed the HPC infrastructure, existing MATLAB(R) source code and sample data, and wrote a test suite designed to run the CPU and GPU versions at the same time. There were two review stages; the first for optimisation of the existing MATLAB (R) code base, followed by identification of functions that could be distribution and rewritten for GPUs.
Nyriad code review identified bottlenecks that were available for GPGPU workloads. On the University of Melbourne HPC system, “Spartan”, using a single GPU, a 90x performance improvement was achieved over the original code and a 3.75x improvement over the CPU version with 12 threads available for the 4.6 GB Atlantic Model simulating 442 reefs. The simulation, previously taking 8 days to complete on one of the most powerful nodes (i.e. GPU or physical), could be completed in 2 hours. On the other hand, for the 4 MB South Africa Benguela Region dataset the GPU version is faster than the original code, but slower than the improved CPU implementation.
If the code is refactored to process reefs in parallel we anticipate that utilisation of the node would improve on a per-GPU and multi-GPU level, significantly reducing the single simulation time by fully utilising the Spartan GPU node on which it is run. With this change we predict a performance improvement of over 5x compared to the existing GPU code on meaning while using more resources on a node the execution time of a single simulation would greatly reduce. Smaller datasets would also likely achieve some improvement as per-GPU utilisation would increase. Demonstrated in Figure 2. is the performance increase of the current two versions, and the predicted performance of the multithreaded GPU version, when running a single simulation on the Atlantic data set of 442 reefs over 100 days.
Nyriad’s review found that there is significant opportunity in the use of data integrity and mathematical equivalence algorithmic techniques for enabling porting of code to GPUs with minimal impact to the research workflow. With notable performance improvements to a range of job profiles, a significant expansion of Spartan’s GPGPU capacity has just been implemented. The partition, funded by Linkage Infrastructure, Equipment and Facilities (LIEF) grants from the Australian Research Council is composed of 68 nodes and 272 nVidia P100 GPGPU cards The major usage of the new system will be for turbulent flows, theoretical and computational chemistry, and genomics, representative of the needs of major participants.
The University of Melbourne and Nyriad will continue their research collaborations, especially in the GPGPU environment for data integrity and mathematical equivalence, scalability testing and hybrid clusters to enable more scientific programming users to progressively scale their work up to larger systems.
- Lev Lafayette, Greg Sauter, Linh Vu, Bernard Meade, “Spartan : Performance and Flexibility: An HPC-Cloud Chimera”, OpenStack Summit, Barcelona, October 27, 2016
- For example, Keyse, J., Treml, EA., Huelsken, T., Barber, P., DeBoer, T., Kochzuis, M., Muryanto, A., Gardner, J., Liu, L., Penny, S., Riginos, C. (2018), Journal of Biogeography, February 2018
- Shigeyoshi Tsutsui, Pierre Collet (eds), (2013), Massively Parallel Evolutionary Computation on GPGPUs, Springer-Verlag
Lev Lafayette is the Senior HPC Support and Training Officer at the University of Melbourne, where he has been since 2015. Prior to that he worked at the Victorian Partnership for Advanced Computing in a similar role for eight years.