3 October 2019
WSL Birmensdorf
Europe/Zurich timezone
hpc-ch forum

Session

Monitoring System and User Applications Performance

3 Oct 2019, 13:15
Room Englersaal (WSL Birmensdorf )

Room Englersaal

WSL Birmensdorf

Zürcherstrasse 111 8903 Birmensdorf

Conveners

Monitoring System and User Applications Performance

  • Victor Holanda Rusu (CSCS)

Description

The number of components and subsystems in today’s High Performance Computing system make it hard to understand application performance fluctuations and determine root causes when performance is not what is expected. Changes in the OS, programming environment, system components, hardware failures, resource oversubscription or poorly written applications can all contribute to bad application performance.

This problem is exacerbated by the challenges the monitoring poses to systems and application engineers to continuously maintain regression tests and tools that cover and mine as much as possible information about the user experience. In general, it requires to access information from multiple sources, such as system subcomponents, user job scripts and regression logs. At CSCS, we use ReFrame - a regression framework designed to facilitate the writing and maintainability of regression tests - together with XALT, which allows to track user executables running on a cluster. Both tools send the data to the centralized log and metrics infrastructure at CSCS.

In this presentation, we will show how we use the data acquired by ReFrame's performance tests together with XALT's data in order to understand the status of our flagship system, Piz Daint, and if it is undergoing any performance regime change.

Presentation materials

Building timetable...