Mr Eric Bastholm1, Mr Juan Carlos Guzman1
1CSIRO, Kensington, Australia email@example.com firstname.lastname@example.org
CSIRO Astronomy and Space Science operates the Australian Square Kilometer Array Pathfinder (ASKAP), a state-of-the-art radio telescope located at the Murchison Radio Observatory (MRO) in Western Australia. The data rates produced are significant enough that it is necessary to process in real-time in a multi-user high performance computing (HPC) environment. This presents a number of challenges with respect to resource availability and the performance variability resulting from interactions of many parallel software and hardware components. These issues include: disk performance and capacity, inter-node and intra-node parallel communication overheads, and process isolation.
Typically, supercomputing facilities are used for batch processing of large established data sources. Small intermittent processing delays resulting from temporary resource contention issues do not affect these processes as the total compute time is the dominating factor and small wait times are insignificant. However, for real-time processing of high data ingest rates these small and unpredictable contentions for resources can be problematic. In our case, even delays of seconds can have a negative effect on our ability to reliably ingest the data.
We have learned much from addressing these challenges and our experience and solutions will provide valuable input to the radio astronomy community, especially for larger projects, like the Square Kilometer Array, where the same challenges will present at even larger scales.
A key component in the system is the high performance, high capacity Lustre file system. Our findings suggest that even though the hardware can be shown to perform at the required spec (10 GB/s sustained read/write) in practice other factors come into play which can lower that rate significantly or introduce a higher variance in the rate lowering its average. These include: the number of active writers, the I/O patterns used, reliance on external code libraries with unknown I/O behaviours, and the remaining capacity of the disk.
To ingest and process high data rates it is necessary to parallelise the implementation. We have found that there are complicated issues surrounding process interaction which cause performance degradation in memory access, inter-process communication and threading. This manifests as dead regions in the processing stream which can result in lost data, errors, or a reduction in output quality.
Our experience indicates that it is important to isolate key processes from the rest of the system as much as possible. Especially, those processes responsible for data ingest. In principle this seems fairly straight forward, but in practice it is non-trivial because the HPC environment is a complex mesh of devices, software, and networks which essentially prevents total isolation of any process.
Team Leader ASKAP Science Data Processor
Leads a team of software developers with backgrounds in astronomy, software engineering and data science whose purpose is to develop and test the data reduction software pipelines for the Australian Square Kilometre Array Pathfinder (ASKAP) radio telescope managed by CSIRO.