IB Cluster Maintenance

NVIDIA InfiniBand Cluster Operation and Maintenance Guide

About This Document

This document is intended for network operators responsible for maintaining InfiniBand clusters.

The purpose of this document is to outline the necessary automation tools, required tests, and essential information needed when accepting a new cluster. Additionally, the document provides recommendations for monitoring and maintenance routines, along with guidance on how to obtain the necessary inputs for these procedures and how to execute the maintenance operations effectively. The content of the document is structured in a logical manner to facilitate easy reference and understanding.

This document provides links to documentation describing how to establish connections between network events and how they are reported by UFM (Unified Fabric Manager). The various scenarios have been categorized based on the anticipated likelihood of their occurrence. For each specific issue, a comprehensive set of UFM alerts that signal its presence are listed, along with the UFM settings that need configuring to receive these alerts. Detailed instances of these alerts are presented, accompanied by thorough explanations of their significance.

It is important to note that this document aligns with the software capabilities as of July 2023. It aims to provide network operators with a comprehensive resource to effectively manage and maintain InfiniBand clusters, utilizing the most up-to-date information and practices.

Related Documentation

Product

Links

NVIDIA UFM Enterprise

UFM Enterprise User Manual

UFM Quick Start Guide

UFM REST API

NVIDIA UFM Enterprise Appliance

NVIDIA UFM Enterprise Appliance Software User Manual

NVIDIA UFM Telemetry

NVIDIA UFM Telemetry Documentation

NVIDIA UFM High-Availability

NVIDIA UFM High-Availability User Guide

Last updated: