Logging and Monitoring

Room Englersaal (WSL Birmensdorf )

Room Englersaal

WSL Birmensdorf

Zürcherstrasse 111 8903 Birmensdorf
Michele De Lorenzi (CSCS), Thomas Kramer (WSL)


High Performance Computing systems generate a huge amount of logs and metric data during their operations: information about resources utilization, performance, failures, errors and so on is worth to be stored and analyzed.

This kind of data is often unstructured and not easily comprehensible: finding correlations, recognizing meaningful events, discard false positives is a common challenge all HPC centers have to face.

The reward is worth the effort: post mortem investigation, problems and incidents trouble shooting, security threat hunting, early warning and alerting, applications performance analysis, evaluation of resources utilization are all contexts that take advantage of a careful elaboration of logs and metrics data.

A thorough understanding of the underlying infrastructure producing this information is essential to make sense of it especially considering the complex hardware and software stack modern large scale systems comprise.

Key Questions

  1. What are the benefits of collecting logs and metrics data
  2. How to correlate logs from different systems
  3. Centralized collection of logs and metrics: challenges and returns
  4. Are logs and metrics Big Data?
  5. How to tackle the increasing complexity of multi-layered architectures (virtualization, containers, etc.)
  6. Threat Intelligence: Proactively identifying unusual network activities and unauthorized accesses
Registration Form
  • Dino Conciatore
  • Gottardo Pestalozzi
  • Hardik Kothari
  • Michele De Lorenzi
  • Raluca Hodoroaba
  • Thomas Kramer
  • Victor Holanda Rusu
    • 09:30 10:00
      Coffee and registration 30m
    • 10:00 10:15
      Welcome and Introduction
      Conveners: Michele De Lorenzi (CSCS), Thomas Kramer (WSL)
    • 10:15 10:35
      Short Portrait of WSL

      WSL is a research institute of the Swiss Confederation. As part of an ETH Domain research institute, the Confederation requires the WSL to provide cutting-edge research and social benefits, particularly for Switzerland. One of the WSL's important national functions is to conduct the Swiss National Forest Inventory (NFI) and long-term forest ecosystem monitoring (LWF). WSL is particularly active in applied research, but basic research is also among its duties. SLF employees develop tools and guidelines for authorities, industry and the public in order to offer them support in natural hazard risk management and in the analysis of climatic and environmental changes.

      Research for People and the Environment:

      1. WSL explores the dynamics of the terrestrial environment and the use and protection of natural habitats and cultural landscapes
      2. WSL monitors forests, landscapes, biodiversity, natural hazards and snow and ice
      3. WSL develops sustainable solutions for socially relevant issues - together with its partners from science and society
    • 10:35 11:20
      Keynote Presentation
    • 11:20 11:50
      Logs and Metrics Collection and Visualization at CSCS

      As the complexity of systems increases and the scale of these systems increases, the amount of system level data recorded increases.

      Managing the vast amounts of log and metrics data is a challenge that CSCS solved with the introduction of a centralized log and metrics infrastructure based on Elasticsearch, Graylog, Kibana, and Grafana. This is a fundamental service at CSCS that provides easy correlation of events bridging the gap from the computation workload to nodes enabling failure diagnosis.

      Currently, the Elasticsearch cluster at CSCS is handling more than 30'000'000'000 online documents (one year) and another 40'000'000'000 archived (two years). The integrated environment from logging and metrics to graphical representation enables powerful dashboards and monitoring displays.

      Convener: Dino Conciatore (CSCS)
    • 11:50 12:15
      Technical Presentation 25m
    • 12:15 13:15
      Lunch and Networking 1h
    • 13:15 13:45
      Monitor and Logging User Applications
      Convener: Victor Holanda Rusu (CSCS)
    • 13:45 14:15
      Daint, logging at 2000 messages per second
      Convener: Miguel Gila (CSCS)
    • 14:15 14:40
      Community Development
    • 14:40 15:00
      Coffee Break 20m
    • 15:00 15:45
      Visit of WSL: Long-term Forest Ecosystem Research (LWF)
      Convener: Gottardo Pestalozzi
    • 15:45 15:50
      Farewell and end of the meeting