Virtualization - Digital Event

Europe/Zurich
Michele De Lorenzi (ETH Zurich / CSCS), Roberto Fabbretti (University of Lausanne)
Description

Even though the latest press conference by the Federal Council announced the easing of some restrictions imposed in response to the coronavirus pandemic, in order to keep the risk of coronavirus spreading as low as possible, the forum will be held in digital format only.

Introduction

Traditional HPC computing centres have mainly been used to solve scientific problems posed by the so-called "hard" sciences such as physics, chemistry, or astrophysics. Their infrastructures include most of the compute and storage resources available on most university campuses.

However, the arrival of life sciences, in particular medical sciences, and more recently human sciences has "generated" a new class of users for whom the use of traditional tools (command line, scripting in bash or python) is considered an insurmountable obstacle.

Nevertheless, the data analysis needs of these neophyte users grows exponentially. Therefore, tools trying to simplify these tasks exist but remain outside of the scope of the applications available on traditional HPC resources. This has led these researchers to install many powerful workstations under their desks that are never used optimally, are problematic to manage, and are not cost effective from a budgetary standpoint.

These “novice” users nevertheless are the main group of researchers using scientific computing resources in generalist universities. It is important to address their needs by providing them adequate and timely available resources without burdening the IT personnel with the management and support of heterogeneous and geographically scattered systems.

Key Questions

  • System virtualisation: The main goal of systems virtualisation is to optimise the use of computer cycles and memory by mutualizing them in a centrally managed infrastructure. In which cases are virtualisation technologies relevant to the public mentioned in the introduction? Which virtualisation technology do you implement? How do you efficiently deploy workstations and servers? How do you adapt the software stacks to the different needs of the users? How do you manage the lifecycle of these VM?
     
  • On premise cloud technologies: Cloud technologies bring an unprecedented ease of use to deploy applications on virtual infrastructures. However, on premise cloud infrastructure usually are very complex to manage and require a very skilled team to manage and support the underlying systems. What are your use cases for on premise cloud infrastructures? How does it compare with standard virtualization technologies? In the rapidly evolving landscape of cloud technologies which one do you use? Have you considered an exit strategy if the selected product is not any more actively maintained or only provided commercially with outrageously expensive licensing fees?
     
  • Containers: The demand to support containers for scientific computing applications is growing. There are several container technologies and orchestration platforms competing each other. To name a few we can site Docker, Singularity, Mesos for the container technologies and Swarm, Kubernetes, Chronos/Marathon for orchestration. What are the most suitable container solutions for scientific computing applications? What are the good scientific computing use cases to be run on a container cluster instead of a classical HPC machines? Do we need to manage two different clusters for classical HPC services and container services, or can we efficiently blend the two class of computing services on the same cluster? Containers and GPU computing is an ongoing trend. Nvidia supplies and support on their NGC Catalog a set of curated GPU-optimized containerized applications running on CUDA for AI, HPC and Visualization applications. Should we build our GPU container platforms around NVIDIA-Certified Systems to (as promised by the constructors) seamlessly take profit of the applications packaging and distribution solution offered by NGC or should we build our own vanilla GPU container clusters and packaging platforms?
     
  • File systems on demand: Generally HPC infrastructures rely on massive central filesystems attached to all compute nodes in order to provide high throughput and IOPS. However in some cases these filesystems themselves become the bottleneck slowing the data analysis.  Some filesystem technologies allow to federate the local disks of the compute nodes to create a transient filesystem usually on the users’ request in order to distinguish the data flux to the main storage. Do you think that such technologies are useful in an HPC environment? Which technologies have you tested and for which use case? Similar question on the container side. Docker, Mesos and Kubernetes support on demand persistent storage through CSI (the Container Storage Interface). Different HPC storage products vendors like IBM Spectrum Scale, WekaIO Matrix, BeeGFS or VAST Data among many other storage vendors have created their driver to interconnect their storage platform to a CSI interface. This provides on demand persistent file or block storage services to the containers. Is this technology mature enough to be used in a production environment? Is anyone using it and if yes what are the use cases?

 

Participants
  • Achim Gsell
  • Adam Henderson
  • Alberto Madonna
  • Alexander Kashev
  • Alexandre Wetzel
  • Alexandre Wetzel
  • Aliaksandr Yakutovich
  • Allen Neeser
  • Angelo Mangili
  • Anibal Moreno
  • Antonio Russo
  • Arnaud Fortier
  • Arnaud Hungler
  • Bastian Bukatz
  • Carlo Pignedoli
  • chris gamboni
  • Christian Bolliger
  • Darren Reed
  • Diego Moreno
  • Edoardo Baldi
  • Emmanuel Jeanvoine
  • enrico favero
  • Etienne Orliac
  • Ewan Roche
  • Felix Armborst
  • Filippo Stenico
  • Fotis Georgatos
  • GILLES FOURESTEY
  • Guilherme Peretti-Pezzi
  • Hardik Kothari
  • Heinz Stockinger
  • Henry Luetcke
  • Ilya Kolpakov
  • Ioannis Xenarios
  • Jan Wender
  • Jani Heikkinen
  • Jean-Baptiste Aubort
  • Jean-Claude De Giorgi
  • Lorenzo Cerutti
  • Luca Capello
  • Marcel Riedi
  • Marco Barroso
  • Mario Jurcevic
  • Martin Jacquot
  • Mattia Belluco
  • Maxime Martinasso
  • Mei-Chih Chang
  • Michael Rolli
  • Michele De Lorenzi
  • Nick Holway
  • Nicolas Alejandro Kowenski
  • Nicolas Buchschacher
  • Nicolas Kowenski
  • Olivier Byrde
  • Patrick Zosso
  • Pierre Berthier
  • Radim Janalík
  • Raluca Hodoroaba
  • remy ressegaire
  • Rene Windiks
  • Ricardo Silva
  • Roberto Fabbretti
  • Roman Briskine
  • Samuel Fux
  • Sean Hughes
  • Sofiane Sarni
  • Stefan Weber
  • Steven Armstrong
  • Sudershan Lakshmanan
  • Szymon Gadomski
  • Sébastien Moretti
  • Thierry Schuepbach
  • Thomas Chen
  • Thomas Kramer
  • Thomas Wüst
  • Urban Borštnik
  • Yann Sagon
    • 10:00 10:15
      Welcome and Introduction
      Conveners: Michele De Lorenzi (CSCS), Roberto Fabbretti (University of Lausanne )
    • 10:15 11:00
      Keynote Presentation: Health2030 Genome Center : Data Centric Virtualization Use Cases in Genomic Medicine

      Dataset sizes mandates to bring processing as close as possible to the data. At the Genome Center, the whole workflow, going from data generation to final delivery to the customer, including custom and potential external tweaking, has been be centered around the data, complying to the highest standards of confidentiality.

      Conveners: Arnaud Hungler (Health2030), Ioannis Xenarios (Health2030)
    • 11:00 11:30
      Automated Provisioning of Virtual Workstations and Servers for a Broad Audience of Researchers 30m
      Speaker: Roberto Fabbretti (University of Lausanne)
    • 11:30 12:00
      Containers and Virtualization for the HPC Cluster 30m

      A story on how we performed an extreme culture and technology shift. From a mix of technologies to a fully Kubernetes based infrastructure. Including virtualization with a cloud-native / cutting edge approach.

      Speaker: Nicolas Alejandro Kowenski (ETH Zurich)
    • 12:00 12:30
      Sarus: Highly Scalable Docker Containers for HPC Systems 30m

      Sarus is a container engine for HPC systems that provides the means to instantiate high-performance containers from Docker images.It has been designed to address the unique requirements of HPC containers, such as integration with hardware accelerators, quick deployments at scale, security and permissions enforcement on multi-tenant hosts, and parallel filesystems. Sarus leverages the Open Container Initiative (OCI) specifications to extend the capabilities of a standard runtime through dedicated hook programs, implementing container customization at deployment time and without the user's intervention.

      This presentation will highlight how OCI hooks can enable portable, infrastructure-agnostic images to achieve native performance on top of HPC-specific devices such as GPUs and high-speed interconnects.

      Thanks to their standalone nature and standard interface, OCI hooks can be independently developed to target specific features, and can be configured according to the characteristics of particular host systems. The same container image can thus be used across the whole development workflow, from early tests on a personal workstation to deployments at scale on flagship systems, while benefiting from the advantages of each platform.

      Speaker: Alberto Madonna (CSCS)
    • 12:30 13:00
      Lunch 30m
    • 13:00 13:30
      Virtual Coffee

      Networking tool: SpatialChat

      This tool simulates the space of a room, where people can move and join discussions in small or large groups, as well as have one-on-one conversations.

    • 13:30 14:00
      AiiDAlab – an Ecosystem for Developing, Executing, and Sharing Scientific Workflows 30m

      Cloud platforms allow users to execute tasks directly from their web browser and are a key enabling technology not only for commerce but also for computational science. Research software is often developed by scientists with limited experience in (and time for) user interface design, which can make research software difficult to install and use for novices. When combined with the increasing complexity of scientific workflows (involving many steps and software packages), setting up a computational research environment becomes a major entry barrier.

      AiiDAlab is a web platform that enables computational scientists to package scientific workflows and computational environments and share them with their collaborators and peers. By leveraging the AiiDA workflow manager and its plugin ecosystem, developers get access to a growing range of simulation codes through a python API, coupled with automatic provenance tracking of simulations for full reproducibility.

      Computational workflows can be bundled together with user-friendly graphical interfaces and made available through the AiiDAlab app store. Being fully compatible with open-science principles, AiiDAlab provides a complete infrastructure for automated workflows and provenance tracking, where incorporating new capabilities becomes intuitive, requiring only Python knowledge.

      Speaker: Aliaksandr Yakutovich (EPFL)
    • 14:00 14:30
      Dynamic Provisioning of Storage Resources: A Case Study with Burst Buffers 30m

      Complex applications and workflows needs are often exclusively expressed in terms of computational resources on HPC systems. In many cases, other resources like storage or network are not allocatable and are shared across the entire HPC system. By looking at the storage resources in particular, any workflow or application should be able to select both its preferred data manager and its required storage capability or capacity. To achieve such a goal, new mechanisms should be introduced. In this work, we present such a tool that dynamically provision a data management system on top of storage devices. We propose a proof-of-concept that is able to deploy, on-demand, a parallel filesystem across intermediate storage nodes on a Cray XC50 system. We show how this mechanism can be easily extended to support more data managers and any type of intermediate storage. Finally, we evaluate the performance of the provisioned storage system with a set of benchmarks.

      Speaker: Maxime Martinasso (CSCS)
    • 14:30 15:00
      Cloud Bursting: a First Experience 30m

      We will report on a proof-of-concept carried out by SCITAS using Google Cloud Platform (GCP) resources to analyze the state of the art and evaluate the capabilities available in GCP.

      We evaluated both the technical feasibility and the performance of the solutions from an HPC perspective (including performance benchmarks). We defined a semi-automatic procedure to deploy, in less than 10 minutes, a fully usable cluster.

      The following architectures were evaluated:

      • a cluster in the cloud, offering the same software stack as our
        bare-metal clusters (external elastic computing)

      • a hybrid approach with an on-site headnode and dynamic allocation of compute nodes in the cloud (hybrid elastic computing)

      • a hybrid approach with an on-site compute node using accelerators in the cloud (external compute resources)

      Speaker: Ricardo Silva (EPFL)
    • 15:00 15:30
      Community Development 30m

      Tool: MURAL (a digital workspace for visual collaboration)

      Speakers: Christian Bolliger (ETH Zurich), Michele De Lorenzi (CSCS)
    • 15:30 15:35
      Farewell and End of the Meeting 5m