High Performance Computing systems generate a huge amount of logs and metric data during their operations: information about resources utilization, performance, failures, errors and so on is worth to be stored and analyzed.
This kind of data is often unstructured and not easily comprehensible: finding correlations, recognizing meaningful events, discard false positives is a common challenge all HPC centers have to face.
The reward is worth the effort: post mortem investigation, problems and incidents trouble shooting, security threat hunting, early warning and alerting, applications performance analysis, evaluation of resources utilization are all contexts that take advantage of a careful elaboration of logs and metrics data.
A thorough understanding of the underlying infrastructure producing this information is essential to make sense of it especially considering the complex hardware and software stack modern large scale systems comprise.
- What are the benefits of collecting logs and metrics data
- How to correlate logs from different systems
- Centralized collection of logs and metrics: challenges and returns
- Are logs and metrics Big Data?
- How to tackle the increasing complexity of multi-layered architectures (virtualization, containers, etc.)
- Threat Intelligence: Proactively identifying unusual network activities and unauthorized accesses