NVIDIA NVOS User Manual for InfiniBand Switches

Health Monitoring

NVOS includes a health daemon that is responsible for collecting health events in the system and monitoring various components, including hardware components such as fans, power supply units, leakage sensors, and failing docker containers.

This health daemon runs every 3 seconds, analyzing components. If it detects an issue, it appears in nv show system health and publishes a health event in gNMI.

Health daemon monitors two main items: Service and Hardware

The health daemon begins monitoring the system a few seconds after NVUE is ready to serve. As a result, system health information might not be available for approximately 11 seconds after NVUE becomes ready.

Service Monitoring

The health daemon ensures the continuous operation of essential system services. It monitors:

  • Critical Dockers and Services: Verifies that all vital dockers and services are running

Hardware Monitoring

  • Leakage Sensors: Detects any potential fluid leaks

  • Temperature Sensors: Monitors the temperature of all hardware components to prevent overheating

  • Voltage Sensors: Tracks voltage levels across hardware components to ensure proper functionality

  • Fan Speeds: Checks the speed of system fans to maintain optimal airflow and cooling

  • Power Supply Units (PSUs): Monitors the status of power supply units for stability

  • ASIC Health Status: Verifies the health of ASIC components to prevent processing issues

  • Disk: Checks the health and free space of the Disk

  • CPU: Monitors the CPU utilization and temperature 

  • Transievers: Monitors the status of the transievers connected to the system

Output Examples

Example output for a healthy system:

admin@nvos:~$ nv show system health
            operational  applied  
----------  -----------  -------  
status      OK
status-led  green



Health issues
================
No Data

Example output for a faulty system:

admin@nvos:~$ nv show system health
            operational  applied  
----------  -----------  ------- 
status      Not OK
status-led  amber



Health issues
================
    Component                     Status information
    ----------------------------  ----------------------------------------------------------
    LEAKAGE-2                     detected leakage
    PMIC 1 Temp                   temperature is too hot, temperature=2008.0, threshold=125.0

Health Monitoring Commands

Last updated: