Logging and Monitoring

Europe/Zurich
Room Englersaal (WSL Birmensdorf )

Room Englersaal

WSL Birmensdorf

Zürcherstrasse 111 8903 Birmensdorf
Michele De Lorenzi (CSCS), Thomas Kramer (WSL)
Description

Introduction

High Performance Computing systems generate a huge amount of logs and metric data during their operations: information about resources utilization, performance, failures, errors and so on is worth to be stored and analyzed.

This kind of data is often unstructured and not easily comprehensible: finding correlations, recognizing meaningful events, discard false positives is a common challenge all HPC centers have to face.

The reward is worth the effort: post mortem investigation, problems and incidents trouble shooting, security threat hunting, early warning and alerting, applications performance analysis, evaluation of resources utilization are all contexts that take advantage of a careful elaboration of logs and metrics data.

A thorough understanding of the underlying infrastructure producing this information is essential to make sense of it especially considering the complex hardware and software stack modern large scale systems comprise.

Key Questions

  1. What are the benefits of collecting logs and metrics data
  2. How to correlate logs from different systems
  3. Centralized collection of logs and metrics: challenges and returns
  4. Are logs and metrics Big Data?
  5. How to tackle the increasing complexity of multi-layered architectures (virtualization, containers, etc.)
  6. Threat Intelligence: Proactively identifying unusual network activities and unauthorized accesses
Participants
  • Alex Upton
  • Alexander Kashev
  • Alvise Dorigo
  • Andrei Plamada
  • Arnaud Fortier
  • Aurélien Cavelan
  • Carlo Pignedoli
  • Christoph Witzig
  • Cristian Scurtescu
  • Diego Moreno
  • Dino Conciatore
  • Dirk Lipinski
  • Enrico Favero
  • Filippo Stenico
  • Gianfranco Sciacca
  • Giuseppe Lo Re
  • Gottardo Pestalozzi
  • Gunnar Jansen
  • Hardik Kothari
  • Heinrich Billich
  • Jani Heikkinen
  • Jean-Baptiste Aubort
  • Jean-Claude De Giorgi
  • Luca Capello
  • Luca Cervigni
  • Markus Reinhardt
  • Martin Jacquot
  • Massimo Brero
  • Mattia Belluco
  • Michele De Lorenzi
  • Miguel Gila
  • Nick Holway
  • Nicolas Kowenski
  • Nina Mujkanovic
  • Patrick Zosso
  • Raluca Hodoroaba
  • Riccardo Murri
  • Roberto Fabbretti
  • Silvan Hostettler
  • Steven Armstrong
  • Szymon Gadomski
  • Thomas Kramer
  • Urban Borštnik
  • Victor Holanda Rusu
  • Warren Paulus
  • Wolfgang Zipfel
  • Yann Sagon
    • 09:30
      Coffee and registration
    • Welcome and Introduction
      Conveners: Michele De Lorenzi (CSCS), Thomas Kramer (WSL)
    • Short Portrait of WSL

      WSL is a research institute of the Swiss Confederation. As part of an ETH Domain research institute, the Confederation requires the WSL to provide cutting-edge research and social benefits, particularly for Switzerland. One of the WSL's important national functions is to conduct the Swiss National Forest Inventory (NFI) and long-term forest ecosystem monitoring (LWF). WSL is particularly active in applied research, but basic research is also among its duties. SLF employees develop tools and guidelines for authorities, industry and the public in order to offer them support in natural hazard risk management and in the analysis of climatic and environmental changes.

      Research for People and the Environment:

      1. WSL explores the dynamics of the terrestrial environment and the use and protection of natural habitats and cultural landscapes
      2. WSL monitors forests, landscapes, biodiversity, natural hazards and snow and ice
      3. WSL develops sustainable solutions for socially relevant issues - together with its partners from science and society
      Convener: Gottardo Pestalozzi (WSL)
    • Logs and Metrics Collection and Visualization at CSCS

      As the complexity of systems increases and the scale of these systems increases, the amount of system level data recorded increases.

      Managing the vast amounts of log and metrics data is a challenge that CSCS solved with the introduction of a centralized log and metrics infrastructure based on Elasticsearch, Graylog, Kibana, and Grafana. This is a fundamental service at CSCS that provides easy correlation of events bridging the gap from the computation workload to nodes enabling failure diagnosis.

      Currently, the Elasticsearch cluster at CSCS is handling more than 30'000'000'000 online documents (one year) and another 40'000'000'000 archived (two years). The integrated environment from logging and metrics to graphical representation enables powerful dashboards and monitoring displays.

      Convener: Dino Conciatore (CSCS)
    • Data Analysis for Improving High Performance Computing Operations and Research

      This project addresses the challenge of HPC data analysis in a reproducible and legal manner. The data originates in the HPC systems of the consortium: NEMO at University of Freiburg (NEMO-UniFR), sciCORE at University of Basel (sciCORE-UniBas), and HPC at University of Strasbourg (HPC-UniStra).
      The goals of the project are to analyze the data collected at NEMO-UniFR since July 2016 to improve their research and operations activities, and to offer monitoring, operational, and research insights to also improve the sciCORE- UniBas and HPC-UniStra activities.
      The proposed approach entails: monitoring of systems and applications; legal compliance via de-identification and anonymization; and data analysis. Monitoring data is collected under various types, formats, and sizes.
      Meaningful integration of the various types and formats is a significant challenge. This challenge can be addressed by ensuring that the HPC monitoring data follows the FAIR (findable, accessible, interoperable, and reusable) data principles already in the data collection stage.
      The outcome will be solutions for improving the HPC operations and research of three Eucor HPC centers, and satisfy the data protection and privacy requirements.

      Convener: Aurélien Cavelan (University of Basel)
    • Daint, logging at 2000 messages per second

      Our flagship supercomputer, Piz Daint, produces a huge amount of logging data every second: environmental data, hardware counters, filesystem performance, scheduler data, etc. We store and process, in seconds, all this information, and produce plots that help us understand better the system and make more solid business decisions.

      In this presentation, I will show how we use Elasticsearch to collect and extract useful information from this, but also how you can use it to store even more relevant data for your use case.

      Convener: Miguel Gila (CSCS)
    • 12:05
      Lunch and Networking
    • Monitoring System and User Applications Performance

      The number of components and subsystems in today’s High Performance Computing system make it hard to understand application performance fluctuations and determine root causes when performance is not what is expected. Changes in the OS, programming environment, system components, hardware failures, resource oversubscription or poorly written applications can all contribute to bad application performance.

      This problem is exacerbated by the challenges the monitoring poses to systems and application engineers to continuously maintain regression tests and tools that cover and mine as much as possible information about the user experience. In general, it requires to access information from multiple sources, such as system subcomponents, user job scripts and regression logs. At CSCS, we use ReFrame - a regression framework designed to facilitate the writing and maintainability of regression tests - together with XALT, which allows to track user executables running on a cluster. Both tools send the data to the centralized log and metrics infrastructure at CSCS.

      In this presentation, we will show how we use the data acquired by ReFrame's performance tests together with XALT's data in order to understand the status of our flagship system, Piz Daint, and if it is undergoing any performance regime change.

      Convener: Victor Holanda Rusu (CSCS)
    • Achieving High Service Availability for HPC: Monitoring and Logging on ETH Clusters

      Achieving a high level of service quality and availability are key goals of the central clusters of the ETH. Acting upon and monitoring collected logs and metrics is crucial to meeting these goals.
      This presentation will focus on our solutions to automating cluster maintenance.
      We will present some of our solutions in this area. One is the Cluster Monkey tools that act upon event- and metrics-driven triggers. Another is storage monitoring, which helps our users to improve their data workflow and give us insights into our upcoming storage platform.

      Conveners: Diego Moreno (ETH Zurich), Urban Borštnik (ETH Zurich)
    • 14:45
      Coffee Break
    • Visit of WSL: Long-term Forest Ecosystem Research (LWF)
      Convener: Gottardo Pestalozzi
    • Farewell and end of the meeting