Logging and Monitoring

Europe/Zurich
Room Englersaal (WSL Birmensdorf )

Room Englersaal

WSL Birmensdorf

Zürcherstrasse 111 8903 Birmensdorf
Michele De Lorenzi (CSCS), Thomas Kramer (WSL)
Description

Introduction

High Performance Computing systems generate a huge amount of logs and metric data during their operations: information about resources utilization, performance, failures, errors and so on is worth to be stored and analyzed.

This kind of data is often unstructured and not easily comprehensible: finding correlations, recognizing meaningful events, discard false positives is a common challenge all HPC centers have to face.

The reward is worth the effort: post mortem investigation, problems and incidents trouble shooting, security threat hunting, early warning and alerting, applications performance analysis, evaluation of resources utilization are all contexts that take advantage of a careful elaboration of logs and metrics data.

A thorough understanding of the underlying infrastructure producing this information is essential to make sense of it especially considering the complex hardware and software stack modern large scale systems comprise.

Key Questions

  1. What are the benefits of collecting logs and metrics data
  2. How to correlate logs from different systems
  3. Centralized collection of logs and metrics: challenges and returns
  4. Are logs and metrics Big Data?
  5. How to tackle the increasing complexity of multi-layered architectures (virtualization, containers, etc.)
  6. Threat Intelligence: Proactively identifying unusual network activities and unauthorized accesses
Registration
Registration Form
Participants
  • Alex Upton
  • Alexander Kashev
  • Alvise Dorigo
  • Andrei Plamada
  • Arnaud Fortier
  • Aurélien Cavelan
  • Carlo Pignedoli
  • Christoph Witzig
  • Cristian Scurtescu
  • Diego Moreno
  • Dino Conciatore
  • Enrico Favero
  • Filippo Stenico
  • Gianfranco Sciacca
  • Giuseppe Lo Re
  • Gottardo Pestalozzi
  • Gunnar Jansen
  • Hardik Kothari
  • Heinrich Billich
  • Jani Heikkinen
  • Jean-Baptiste Aubort
  • Jean-Claude De Giorgi
  • Luca Capello
  • Luca Cervigni
  • Markus Reinhardt
  • Martin Jacquot
  • Massimo Brero
  • Mattia Belluco
  • Michele De Lorenzi
  • Miguel Gila
  • Nick Holway
  • Nicolas Kowenski
  • Patrick Zosso
  • Raluca Hodoroaba
  • Riccardo Murri
  • Roberto Fabbretti
  • Silvan Hostettler
  • Thomas Kramer
  • Urban Borštnik
  • Victor Holanda Rusu
  • Warren Paulus
  • Wolfgang Zipfel
  • Yann Sagon
    • 09:30 10:00
      Coffee and registration 30m
    • 10:00 10:15
      Welcome and Introduction
      Conveners: Michele De Lorenzi (CSCS), Thomas Kramer (WSL)
    • 10:15 10:35
      Short Portrait of WSL

      WSL is a research institute of the Swiss Confederation. As part of an ETH Domain research institute, the Confederation requires the WSL to provide cutting-edge research and social benefits, particularly for Switzerland. One of the WSL's important national functions is to conduct the Swiss National Forest Inventory (NFI) and long-term forest ecosystem monitoring (LWF). WSL is particularly active in applied research, but basic research is also among its duties. SLF employees develop tools and guidelines for authorities, industry and the public in order to offer them support in natural hazard risk management and in the analysis of climatic and environmental changes.

      Research for People and the Environment:

      1. WSL explores the dynamics of the terrestrial environment and the use and protection of natural habitats and cultural landscapes
      2. WSL monitors forests, landscapes, biodiversity, natural hazards and snow and ice
      3. WSL develops sustainable solutions for socially relevant issues - together with its partners from science and society
    • 10:35 11:05
      Logs and Metrics Collection and Visualization at CSCS

      As the complexity of systems increases and the scale of these systems increases, the amount of system level data recorded increases.

      Managing the vast amounts of log and metrics data is a challenge that CSCS solved with the introduction of a centralized log and metrics infrastructure based on Elasticsearch, Graylog, Kibana, and Grafana. This is a fundamental service at CSCS that provides easy correlation of events bridging the gap from the computation workload to nodes enabling failure diagnosis.

      Currently, the Elasticsearch cluster at CSCS is handling more than 30'000'000'000 online documents (one year) and another 40'000'000'000 archived (two years). The integrated environment from logging and metrics to graphical representation enables powerful dashboards and monitoring displays.

      Convener: Dino Conciatore (CSCS)
    • 11:05 11:35
      Data Analysis for Improving High Performance Computing Operations and Research
      Convener: Aurélien Cavelan (University of Basel)
    • 11:35 12:00
      Monitoring and Logging on ETH Clusters
    • 12:00 13:15
      Lunch and Networking 1h 15m
    • 13:15 13:45
      Monitoring System and User Applications Performance

      The number of components and subsystems in today’s High Performance Computing system make it hard to understand application performance fluctuations and determine root causes when performance is not what is expected. Changes in the OS, programming environment, system components, hardware failures, resource oversubscription or poorly written applications can all contribute to bad application performance.

      This problem is exacerbated by the challenges the monitoring poses to systems and application engineers to continuously maintain regression tests and tools that cover and mine as much as possible information about the user experience. In general, it requires to access information from multiple sources, such as system subcomponents, user job scripts and regression logs. At CSCS, we use ReFrame - a regression framework designed to facilitate the writing and maintainability of regression tests - together with XALT, which allows to track user executables running on a cluster. Both tools send the data to the centralized log and metrics infrastructure at CSCS.

      In this presentation, we will show how we use the data acquired by ReFrame's performance tests together with XALT's data in order to understand the status of our flagship system, Piz Daint, and if it is undergoing any performance regime change.

      Convener: Victor Holanda Rusu (CSCS)
    • 13:45 14:15
      Daint, logging at 2000 messages per second

      Our flagship supercomputer, Piz Daint, produces a huge amount of logging data every second: environmental data, hardware counters, filesystem performance, scheduler data, etc. We store and process, in seconds, all this information, and produce plots that help us understand better the system and make more solid business decisions.

      In this presentation, I will show how we use Elasticsearch to collect and extract useful information from this, but also how you can use it to store even more relevant data for your use case.

      Convener: Miguel Gila (CSCS)
    • 14:15 14:40
      Community Development
    • 14:40 15:00
      Coffee Break 20m
    • 15:00 15:45
      Visit of WSL: Long-term Forest Ecosystem Research (LWF)
      Convener: Gottardo Pestalozzi
    • 15:45 15:50
      Farewell and end of the meeting