Profiling Strategies for Accelerating Neural Network Training on HPC GPU Clusters

Profiling Strategies for Accelerating Neural Network Training on HPC GPU Clusters

Oliver Cairncross1

1The University Of Queensland, St Lucia, QLD, Australia

Abstract

When dealing with large datasets, it is important to train machine learning (ML) models quickly and efficiently and profiling is a critical step to achieve this. Running ML code on large HPC clusters is a convenient way to tackle large ML tasks. Nevertheless, we often find that performance is still too slow to be practical. The goal profiling is to identify and resolve bottlenecks in data pipelines, memory utilization, and CPU/GPU usage and optimise performance. our case we focus on training neural networks with large image datasets.

To illustrate the concepts to be discussed, a generic case study will be presented. The study involves training a deep neural network architecture on a large-scale image dataset using a multi-GPU CUDA cluster with a high-performance file system and interconnect. The TensorFlow machine learning framework is employed.

Various tools used for profiling include:
– system performance tools;
– machine learning framework profilers;
– GPU vendor-provided profilers; and
– specialized profilers supporting multi-GPU frameworks.

Different combinations of these tools are required to address various scenarios. Therefore, how and when they are used is critical to identify, understand, and potentially address performance bottlenecks. Although this presentation focuses on training neural networks, some of concepts (such as data pipelines) can be applied to other tasks.

Biography

Oliver’s career started in corporate IT working on large database systems. He moved to research computing about a decade ago and completed a degree in computational maths. Initially working in the life sciences area focusing on imaging and visualisation. Two years ago, Oliver moved to the Machine Learning domain with a focus on parallel processing. His primary concern is to apply high performance clusters with multi-GPU architectures to large ML tasks.

Categories