Miss Kerri Wait1
1Monash eResearch Centre, Monash University, Clayton Campus, Australia
The Telegraf-Influx-Grafana (TIG) stack is a powerful tool to explore and visualise the state of High Performance Computing (HPC) environments and their auxiliary services. Join me for a demonstration of the metrics and dashboards we’ve found most useful at the Monash eResearch Centre (MeRC).
The TIG stack consists of:
– Telegraf, plugin driven metric collection agent
– InfluxDB, time-series database
– Grafana, visualisation and dashboard web UI
Leveraging the software-defined HPC infrastructure at MeRC to deploy the telegraf agent to any number of different services is trivial. There’s no need to struggle with obscure configurations in vendor-specific web UIs. Telegraf is plugin driven and configured using a text file; enable and configure the appropriate plugins for the service in question and the telegraf agent is ready to go. The data is stored in InfluxDB, and queried using Grafana.
With TIG, I’m able to collect, store, and visualise spine switch hardware counters from the fabric, compute node health metrics like cpu and diskio, detailed jobstats for our Lustre storage, operations on our OpenLDAP servers, the utilisation of FlexLM tokens on license servers, as well as traffic on nginx and apache servers. I can monitor for specific disk usage patterns, alert particular team members via Slack, and troubleshoot user jobs.
Community maintained stacks like TIG provide rapid access to metrics, an interactive troubleshooting and exploration environment, as well as alerting and reporting functionality. They also allow you to antagonise your colleagues with the catchcry “I’ve got a dashboard for that!”
Kerri Wait is an HPC Consultant at Monash University. As an engineer, Kerri has a keen interest in pulling things apart and reassembling them in novel ways. She applies the same principles to her work in eResearch, and is passionate about making scientific research faster, more robust, and repeatable by upskilling user communities and removing entry barriers. Kerri is currently focused on monitoring and visualisation techniques for infrastructure at all levels of the Monash HPC platforms.