Testing HPC Software Stack with Virtualization

Dr Ahmed Shamsul Arefin1

1CSIRO, Canberra, Australia

INTRODUCTION

In this work, we present a simple and effective virtual cluster deployment process, which can facilitate a playground for the sysadmins and help to eliminate some of the HPC software stack bugs, such as kernel/ software incompatibility.

METHODS

We used the following three main software tools: VMWare, Bright Cluster Manager , Easy8 License free, SLES15 ISO, from Bright Computing and a  decommissioned compute node from our production HPC cluster with 16 CPU cores in 2 x Intel Xeon CPU E5-2650 0 @ 2.00GHz, 128GB RAM, 500GB local HDD. We started the deployment process by installing a base operating system and a virtualization tool VMWare on the physical hardware. We created the head node VM using Bright’s SLES15 ISO image and compute node VMs with pre-allocated the disk storage and MACs, but did not install any OS at this stage. Bright Cluster Manager admin portal  created the compute nodes where the head node served the OS image, IP addresses and hostnames `node [01-08]`.  Then we tested the latest Slurm and MPI, compilers and some of the commercial software compatibility against the latest OS kernel and libs. We checked the admin scripts, cron jobs and ssh keys and security and firewall features.

CONCLUSION

The virtual HPC cluster helped us to create a simple playground for testing software incompatibility issues, but not the actual HPC performance improvements . Overall, this development helped deploying a new OS image to the production, reducing bugs in the later stage and enhancing the HPC user experience.


Biography:

Dr Ahmed Arefin is a Computation Scientist working within the HPC Systems Team, Scientific Computing Platforms, CSIRO. He completed his PhD in Computer Science (Data-Parallel Computing & GPUs) from the University of Newcastle, Australia and worked as a Postdoctoral Researcher (Parallel Data Mining) at the Centre for Bioinformatics, Biomarker Discovery & Information-Based Medicine (CIBM), The University of Newcastle, Australia. His research interest focuses on the application of HPC in data mining, graphs and visualization.

MOVING YOUR RESEARCH TO THE NEXT LEVEL WITH ACCELERATED COMPUTING

Dr Gabriel Noaje1, Dr John Taylor2, Ms Gin Tan3, Dr Gregory Poole4

1NVIDIA, Singapore, Singapore
2Data 61, CSIRO, Canberra, Australia
3Monash eResearch Centre, Monash University, Melbourne, Australia
4Centre for Astrophysics and Supercomputing, Swinburne University of Technology, Melbourne, Australia 

GPU accelerated computing is transforming computational science and AI across multiple domains.

Join the featured speakers discussing cutting-edge technologies to accelerate their research in a broad range of scientific disciplines. The session will also provide an avenue to discuss the challenges and barriers in the adoption of GPU computing by the larger higher education and research community in Australia.

Agenda:

Building advanced GPU Computing Capabilities to support AI for Science (15 min presentation)

Dr. John Taylor

Research Group Leader CSIRO, Data61 & Program Leader Defence Science and Technology Group

As the volume of high-resolution data created by the scientific instruments grows, a next-generation HPC requires innovation infrastructure optimisation to enable research to achieve the most from their experimental data (15 min presentation)

Ms. Gin Tan

Principle Research Systems Architect, Monash eResearch Centre, Monash University

Meeting Astronomy Community Challenges with GPU Acceleration (15 min presentation)

                Dr. Gregory Poole

Astronomy Data Science Coordinator, Centre for Astrophysics and Supercomputing, Swinburne University of Technology

Moving Your Research to the Next Level with Accelerated Computing (15 min panel discussion)

                Dr. Gabriel Noaje, Dr. John Taylor, Ms. Gin Tan, Dr. Gregory Poole


Biography:

Dr. Gabriel Noaje is a Senior Solutions Architect at NVIDIA APAC South specialized in HPC and DL. Gabriel has more than 12 years of experience in accelerator technologies and parallel computing. Prior to joining NVIDIA, Gabriel worked both for large OEMs like SGI and HPE, as well as large HPC centers in Singapore and France.

Gabriel holds a PhD in Computer Sciences from the University of Reims Champagne-Ardenne, France and a BSc and MSc in Computer Sciences from the Polytechnic University of Bucharest, Romania.

Dr. John Taylor is currently Computational Platforms Research Group Leader in CSIRO DATA61. At CSIRO he is leading complex, multi-site, large scale interdisciplinary teams of research scientists, computational scientists, computer scientists and software engineers drawn from across all areas of CSIRO science and from Information Management and Technology (IM&T) that are delivering high quality strategic science. He is also Program Leader, HPC and Computational Science, at Defence Science and Technology Group, Dept. of Defence where he is building a new HPC capability to support defence research.

He has held leadership positions managing large and diverse programs of research and teaching at prominent universities and research laboratories in both Australia and the United States of America. In these positions he has taken the lead in developing the vision, the culture of excellence, setting the strategic directions, building high performing teams and delivering on the strategic goals.

Ms. Gin Tan is the Principle Research Systems Architect at Monash University who brings over eight years of experience in running high performance and high throughput computing. In her capacity, she delivers computing resources to researchers and strengthening University in eResearch capacity by providing computing solutions that may involve developing new hardware or software solutions, re-architecting, or redeploying existing open-source or commercial solutions. Also, Gin takes complete responsibility for delivering IT solutions to customers, managing staff, and supporting a team of developers and system admins. Over the years, she has been involved in creating, designing, and implementing new and reliable tech which has led to the growth and expansion of the organization. Gin is supporting software-defined infrastructure and also has the vision to drive the hardware technology using the software.

Dr. Gregory Poole obtained his PhD from the University of Victoria (Canada) in 2007 after completing his MSc at the University of Toronto (Canada) and his BSc at the University of Waterloo (Canada). He has contributed to published studies in a variety of fields, from the thermal properties of interstellar dust and observations of distant star clusters to simulations of colliding galaxy clusters and of the large-scale structure of the Universe.

He is currently the Astronomy Data Science Coordinator for the Centre of Astrophysics and Supercomputing at Swinburne University of Technology and manages the development efforts of the Swinburne node of the Astronomy Data and Computing Services (ADACS) program.

BREAKING 16 AI PERFORMANCE RECORDS IN LATEST MLPERF BENCHMARKS

Dr Gabriel Noaje1

1Nvidia, Singapore, Singapore

The MLPerf consortium mission is to “build fair and useful benchmarks” to provide an unbiased training and inference performance reference for ML hardware, software, and services. MLPerf Training v0.7 is the third instantiation for training and continues to evolve to stay on the cutting edge.

This round consists of eight different workloads that cover a broad diversity of use cases, including vision, language, recommendation, and reinforcement learning.

In MLPerf Training v0.7, the new NVIDIA  A100 Tensor Core GPU and the DGX SuperPOD-based Selene supercomputer set all 16 performance records across per-chip and maxscale workloads for commercially available systems. These breakthroughs were a result of a tight integration of hardware, software, and system-level technologies.

NVIDIA engineers have developed a host of innovations to achieve these levels of performance. This presentation details many of the optimizations used to deliver the outstanding scale and performance.

Many of these improvements have been made available on NGC, which is the hub for NVIDIA GPU-optimized software. The AI community can thus realize the benefits of these optimizations in their real-world applications, not just better benchmark scores.


Biography:

Dr Gabriel Noaje is a Senior Solutions Architect at NVIDIA APAC South specialized in HPC and DL. Gabriel has more than 12 years of experience in accelerator technologies and parallel computing. Prior to joining NVIDIA, Gabriel worked both for large OEMs like SGI and HPE, as well as large HPC centers in Singapore and France.

Gabriel holds a PhD in Computer Sciences from the University of Reims Champagne-Ardenne, France and a BSc and MSc in Computer Sciences from the Polytechnic University of Bucharest, Romania.

Recent Comments

    Categories

    About the conference

    eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

    Conference Managers

    Please contact the team at Conference Design with any questions regarding the conference.

    © 2018 - 2020 Conference Design Pty Ltd