Challenges of Real-time Processing in HPC Environments The ASKAP Experience

Mr Eric Bastholm1, Mr Juan Carlos Guzman1

1CSIRO, Kensington, Australia eric.bastholm@csiro.au juan.guzman@csiro.au

Overview

CSIRO Astronomy and Space Science operates the Australian Square Kilometer Array Pathfinder (ASKAP), a state-of-the-art radio telescope located at the Murchison Radio Observatory (MRO) in Western Australia. The data rates produced are significant enough that it is necessary to process in real-time in a multi-user high performance computing (HPC) environment. This presents a number of challenges with respect to resource availability and the performance variability resulting from interactions of many parallel software and hardware components. These issues include: disk performance and capacity, inter-node and intra-node parallel communication overheads, and process isolation.

Typically, supercomputing facilities are used for batch processing of large established data sources. Small intermittent processing delays resulting from temporary resource contention issues do not affect these processes as the total compute time is the dominating factor and small wait times are insignificant. However, for real-time processing of high data ingest rates these small and unpredictable contentions for resources can be problematic. In our case, even delays of seconds can have a negative effect on our ability to reliably ingest the data.

We have learned much from addressing these challenges and our experience and solutions will provide valuable input to the radio astronomy community, especially for larger projects, like the Square Kilometer Array, where the same challenges will present at even larger scales.

Disk Performance

A key component in the system is the high performance, high capacity Lustre file system. Our findings suggest that even though the hardware can be shown to perform at the required spec (10 GB/s sustained read/write) in practice other factors come into play which can lower that rate significantly or introduce a higher variance in the rate lowering its average. These include: the number of active writers, the I/O patterns used, reliance on external code libraries with unknown I/O behaviours, and the remaining capacity of the disk.

Inter-process Performance

To ingest and process high data rates it is necessary to parallelise the implementation. We have found that there are complicated issues surrounding process interaction which cause performance degradation in memory access, inter-process communication and threading. This manifests as dead regions in the processing stream which can result in lost data, errors, or a reduction in output quality.

Process Isolation

Our experience indicates that it is important to isolate key processes from the rest of the system as much as possible. Especially, those processes responsible for data ingest. In principle this seems fairly straight forward, but in practice it is non-trivial because the HPC environment is a complex mesh of devices, software, and networks which essentially prevents total isolation of any process.


Biography:

Eric Bastholm

Team Leader ASKAP Science Data Processor

Leads a team of software developers with backgrounds in astronomy, software engineering and data science whose purpose is to develop and test the data reduction software pipelines for the Australian Square Kilometre Array Pathfinder (ASKAP) radio telescope managed by CSIRO.

About the conference

eResearch Australasia provides opportunities for delegates to engage, connect, and share their ideas and exemplars concerning new information centric research capabilities, and how information and communication technologies help researchers to collaborate, collect, manage, share, process, analyse, store, find, understand and re-use information.

Conference Managers

Please contact the team at Conference Design with any questions regarding the conference.

© 2016 - 2017 Conference Design Pty Ltd