3 October 2019
WSL Birmensdorf
Europe/Zurich timezone
hpc-ch forum

Session

Achieving High Service Availability for HPC

3 Oct 2019, 13:45
Room Englersaal (WSL Birmensdorf )

Room Englersaal

WSL Birmensdorf

Zürcherstrasse 111 8903 Birmensdorf

Conveners

Achieving High Service Availability for HPC: Monitoring and Logging on ETH Clusters

  • Urban Borštnik (ETH Zurich)
  • Diego Moreno (ETH Zurich)

Description

Achieving a high level of service quality and availability are key goals of the central clusters of the ETH. Acting upon and monitoring collected logs and metrics is crucial to meeting these goals.
This presentation will focus on our solutions to automating cluster maintenance.
We will present some of our solutions in this area. One is the Cluster Monkey tools that act upon event- and metrics-driven triggers. Another is storage monitoring, which helps our users to improve their data workflow and give us insights into our upcoming storage platform.

Presentation materials

Building timetable...