Building an Enterprise Lustre based HPC Storage Solution

Dr Torben Kling Petersen1

1Cray Inc, Billdal, Sweden, tpetersen@cray.com    

 

Storage is becoming an increasingly more important component of modern HPC systems. The requirement for faster and larger solutions keep pushing the boundaries of what current hardware and software can accommodate. Building multi-petabyte systems capable of delivering 1,000+ GB/s throughput requires careful attention to detail and the most reliable components available without breaking the budgets available. Storage is no longer a separate addition to the computational part of an HPC system, it is becoming a fully integrated component capable of supporting future needs of the application and its users. This means being capable of handling widely different workflows ranging from large streaming I/O such as checkpointing etc. to applications that create 100,000+ small files with a random pattern. The latter usually means high IOPS situations where hard disk drive systems rarely perform well. But all flash-based systems are still much too expensive so how do you create hybrid systems that minimizes data movement?  Is flash really useful today or is it still a thing of the future? What can we do today to make flash an integrated component of HPC storage?

On the file system side of things, there are currently several open source solutions on the market competing with commercial offers. Today, the leading open source solution for HPC storage is Lustre but Lustre have, in the past, had a bad reputation for instability, hard to manage and lacking both enterprise features as well as enterprise support. Essentially, Lustre was deemed only suitable for large scale research and academic users where cost is more importuned than reliability. So, how come the majority of weather services in the world today runs Lustre for all day to day tasks with multiple deadlines to meet on a daily basis?  Most of the weather and climatology research sites used to run GPFS but have switched. Same goes for a lot of commercial companies in the energy sector. What has changed and what does this mean for the rest of us?

Cray Inc. has a reputation of delivering the largest compute systems in the world with no less that 8 systems on the top20[1] (Fall 2017 list). What’s less knows is that all 8 also contains a significant ClusterStor HPC storage solution. While there’s several options for scalable and parallel file systems on the market today, the customers chose a Lustre ClusterStor solution.

While there’s ample access to marketing material on ClusterStor[2, 3], the inner workings of the solution is less well known. And while some may argue that “I don’t need to know how a car works to drive it”, most users of HPC does have an interest in understanding the difference between different solution, file system choices and how certain important metrics are achieved.

This BoF is intended as an interactive presentation and general discussions on a number of topics that make a storage solution deliver capacity and throughput for the years it is expected to be in service … These topics include but are not limited to:

  • Performance and Reliability
  • Data integrity
  • Management and monitoring
  • Flash vs Hard disk drives – when to use what and why …
  • Data management and tiering
  • Lustre and Lustre futures or alternatives …

REFERENCES

[1]   Top500. “Top500,” http://www.top500.org.

[2]   K. Claffey, A. Poston, and T. Kling Petersen, “Xyratex ClusterStor – World Record Performance at Massive Scale,” http://www.xyratex.com/sites/default/files/files/field_inline_files/Xyratex_white_paper_ClusterStor_The_Future_of_HPC_Storage_1-0_0.pdf, 2012].

[3]   Cray, “Cray ClusterStor,” https://www.cray.com/sites/default/files/Cray-ClusterStor-Storage-Brochure.pdf, 2017].


Biography:

Torben Kling Petersen has worked with high performance computing in one form or another since 1994. After leaving academic life in 2000, he’s held technical leadership positions in a number of tech companies (mostly through acquisitions) including Sun Microsystems, Oracle, Xyratex, Seagate, and most recently Cray Inc.

In the various companies, Torben has architected a significant number of HPC and HPC storage systems as well as worked with engineering to bring several new products to market.

Torben has authored a large number of white papers and technical articles over the years and have presented at more conferences and events that can be easily listed.

Categories