Conveners
Achieving High Service Availability for HPC: Monitoring and Logging on ETH Clusters
- Urban Borštnik (ETH Zurich)
- Diego Moreno (ETH Zurich)
Description
Achieving a high level of service quality and availability are key goals of the central clusters of the ETH. Acting upon and monitoring collected logs and metrics is crucial to meeting these goals.
This presentation will focus on our solutions to automating cluster maintenance.
We will present some of our solutions in this area. One is the Cluster Monkey tools that act upon event- and metrics-driven triggers. Another is storage monitoring, which helps our users to improve their data workflow and give us insights into our upcoming storage platform.