Logging and Monitoring

Name: Logging and Monitoring
Start: 2019-10-03T09:00:00+02:00
End: 2019-10-03T17:00:00+02:00
Location: WSL Birmensdorf

Thursday 3 Oct 2019, 09:00 → 17:00 Europe/Zurich

Room Englersaal (WSL Birmensdorf )

Room Englersaal

WSL Birmensdorf

Zürcherstrasse 111 8903 Birmensdorf

Michele De Lorenzi (CSCS), Thomas Kramer (WSL)

Description

Introduction

High Performance Computing systems generate a huge amount of logs and metric data during their operations: information about resources utilization, performance, failures, errors and so on is worth to be stored and analyzed.

This kind of data is often unstructured and not easily comprehensible: finding correlations, recognizing meaningful events, discard false positives is a common challenge all HPC centers have to face.

The reward is worth the effort: post mortem investigation, problems and incidents trouble shooting, security threat hunting, early warning and alerting, applications performance analysis, evaluation of resources utilization are all contexts that take advantage of a careful elaboration of logs and metrics data.

A thorough understanding of the underlying infrastructure producing this information is essential to make sense of it especially considering the complex hardware and software stack modern large scale systems comprise.

Key Questions

What are the benefits of collecting logs and metrics data
How to correlate logs from different systems
Centralized collection of logs and metrics: challenges and returns
Are logs and metrics Big Data?
How to tackle the increasing complexity of multi-layered architectures (virtualization, containers, etc.)
Threat Intelligence: Proactively identifying unusual network activities and unauthorized accesses

Participants

Alex Upton
Alexander Kashev
Alvise Dorigo
Andrei Plamada
Arnaud Fortier
Aurélien Cavelan
Carlo Pignedoli
Christoph Witzig
Cristian Scurtescu
Diego Moreno
Dino Conciatore
Dirk Lipinski
Enrico Favero
Filippo Stenico
Gianfranco Sciacca
Giuseppe Lo Re
Gottardo Pestalozzi
Gunnar Jansen
Hardik Kothari
Heinrich Billich
Jani Heikkinen
Jean-Baptiste Aubort
Jean-Claude De Giorgi
Luca Capello
Luca Cervigni
Markus Reinhardt
Martin Jacquot
Massimo Brero
Mattia Belluco
Michele De Lorenzi
Miguel Gila
Nick Holway
Nicolas Kowenski
Nina Mujkanovic
Patrick Zosso
Raluca Hodoroaba
Riccardo Murri
Roberto Fabbretti
Silvan Hostettler
Steven Armstrong
Szymon Gadomski
Thomas Kramer
Urban Borštnik
Victor Holanda Rusu
Warren Paulus
Wolfgang Zipfel
Yann Sagon

Support

raluca.hodoroaba@cscs.ch

- 09:30 → 10:00
  
  Coffee and registration 30m
- 10:00 → 10:15
  
  Welcome and Introduction
  
  Conveners: Michele De Lorenzi (CSCS), Thomas Kramer (WSL)
  
  20191003_092606.jpg
  
  Welcome_01.jpeg
  
  Welcome_02.jpeg
- 10:15 → 10:35
  Short Portrait of WSL
  WSL is a research institute of the Swiss Confederation. As part of an ETH Domain research institute, the Confederation requires the WSL to provide cutting-edge research and social benefits, particularly for Switzerland. One of the WSL's important national functions is to conduct the Swiss National Forest Inventory (NFI) and long-term forest ecosystem monitoring (LWF). WSL is particularly active in applied research, but basic research is also among its duties. SLF employees develop tools and guidelines for authorities, industry and the public in order to offer them support in natural hazard risk management and in the analysis of climatic and environmental changes.
  
  Research for People and the Environment:
  1. WSL explores the dynamics of the terrestrial environment and the use and protection of natural habitats and cultural landscapes
  2. WSL monitors forests, landscapes, biodiversity, natural hazards and snow and ice
  3. WSL develops sustainable solutions for socially relevant issues - together with its partners from science and society
  Convener: Gottardo Pestalozzi (WSL)
  
  Gottardo Pestalozzi.JPG
- 10:35 → 11:05
  
  Logs and Metrics Collection and Visualization at CSCS
  
  As the complexity of systems increases and the scale of these systems increases, the amount of system level data recorded increases.
  
  Managing the vast amounts of log and metrics data is a challenge that CSCS solved with the introduction of a centralized log and metrics infrastructure based on Elasticsearch, Graylog, Kibana, and Grafana. This is a fundamental service at CSCS that provides easy correlation of events bridging the gap from the computation workload to nodes enabling failure diagnosis.
  
  Currently, the Elasticsearch cluster at CSCS is handling more than 30'000'000'000 online documents (one year) and another 40'000'000'000 archived (two years). The integrated environment from logging and metrics to graphical representation enables powerful dashboards and monitoring displays.
  
  Convener: Dino Conciatore (CSCS)
  
  Dino Conciatore_01.jpg
  
  Dino Conciatore_02.jpg
  
  Logs and Metrics Collector CSCS HPC-ch.pdf
- 11:05 → 11:35
  
  Data Analysis for Improving High Performance Computing Operations and Research
  
  This project addresses the challenge of HPC data analysis in a reproducible and legal manner. The data originates in the HPC systems of the consortium: NEMO at University of Freiburg (NEMO-UniFR), sciCORE at University of Basel (sciCORE-UniBas), and HPC at University of Strasbourg (HPC-UniStra).
  The goals of the project are to analyze the data collected at NEMO-UniFR since July 2016 to improve their research and operations activities, and to offer monitoring, operational, and research insights to also improve the sciCORE- UniBas and HPC-UniStra activities.
  The proposed approach entails: monitoring of systems and applications; legal compliance via de-identification and anonymization; and data analysis. Monitoring data is collected under various types, formats, and sizes.
  Meaningful integration of the various types and formats is a significant challenge. This challenge can be addressed by ensuring that the HPC monitoring data follows the FAIR (findable, accessible, interoperable, and reusable) data principles already in the data collection stage.
  The outcome will be solutions for improving the HPC operations and research of three Eucor HPC centers, and satisfy the data protection and privacy requirements.
  
  Convener: Aurélien Cavelan (University of Basel)
  
  Aurélien Cavelan_02.JPG
  
  Aurélien Cavelan_03.jpg
  
  Aurélien Cavelan.jpg
  
  Presentation_Cavelan.pdf
- 11:35 → 12:05
  
  Daint, logging at 2000 messages per second
  
  Our flagship supercomputer, Piz Daint, produces a huge amount of logging data every second: environmental data, hardware counters, filesystem performance, scheduler data, etc. We store and process, in seconds, all this information, and produce plots that help us understand better the system and make more solid business decisions.
  
  In this presentation, I will show how we use Elasticsearch to collect and extract useful information from this, but also how you can use it to store even more relevant data for your use case.
  
  Convener: Miguel Gila (CSCS)
  
  20191003_HPC_CH_LOGGING_DAINT_LOGGING_AT_2000_MSG_SEC.pdf
  
  Miguel Gila_01.jpg
  
  Miguel Gila_02.jpg
- 12:05 → 13:15
  
  Lunch and Networking 1h 10m
- 13:15 → 13:45
  
  Monitoring System and User Applications Performance
  
  The number of components and subsystems in today’s High Performance Computing system make it hard to understand application performance fluctuations and determine root causes when performance is not what is expected. Changes in the OS, programming environment, system components, hardware failures, resource oversubscription or poorly written applications can all contribute to bad application performance.
  
  This problem is exacerbated by the challenges the monitoring poses to systems and application engineers to continuously maintain regression tests and tools that cover and mine as much as possible information about the user experience. In general, it requires to access information from multiple sources, such as system subcomponents, user job scripts and regression logs. At CSCS, we use ReFrame - a regression framework designed to facilitate the writing and maintainability of regression tests - together with XALT, which allows to track user executables running on a cluster. Both tools send the data to the centralized log and metrics infrastructure at CSCS.
  
  In this presentation, we will show how we use the data acquired by ReFrame's performance tests together with XALT's data in order to understand the status of our flagship system, Piz Daint, and if it is undergoing any performance regime change.
  
  Convener: Victor Holanda Rusu (CSCS)
  
  hpc_ch-logging_monitoring_export.pdf
  
  Victor Holanda Rusu_01.jpg
  
  Victor Holanda Rusu_02.jpg
- 13:45 → 14:15
  
  Achieving High Service Availability for HPC: Monitoring and Logging on ETH Clusters
  
  Achieving a high level of service quality and availability are key goals of the central clusters of the ETH. Acting upon and monitoring collected logs and metrics is crucial to meeting these goals.
  This presentation will focus on our solutions to automating cluster maintenance.
  We will present some of our solutions in this area. One is the Cluster Monkey tools that act upon event- and metrics-driven triggers. Another is storage monitoring, which helps our users to improve their data workflow and give us insights into our upcoming storage platform.
  
  Conveners: Diego Moreno (ETH Zurich), Urban Borštnik (ETH Zurich)
  
  2019-10-03-LMETH.pdf
  
  Urban Borstnik_01.jpg
  
  Urban Borstnik_02.jpg
- 14:15 → 14:45
  
  Community Development
  
  Community Development_01.jpg
  
  Community Development_02.jpg
  
  Community Development_03.jpg
  
  Community Development_05.jpg
  
  Community Development_06.jpg
- 14:45 → 15:00
  
  Coffee Break 15m
- 15:00 → 15:45
  
  Visit of WSL: Long-term Forest Ecosystem Research (LWF)
  
  Convener: Gottardo Pestalozzi
  
  20191003_095235.jpg
  
  Visit_01.jpg
  
  Visit_02.jpg
  
  Visit_03.jpg
  
  Visit_04.jpg
  
  Visit_06.jpg
- 15:45 → 15:50
  
  Farewell and end of the meeting

Choose timezone

Logging and Monitoring

Room Englersaal

WSL Birmensdorf

Introduction

Key Questions