NVIDIA NVOS User Manual for InfiniBand Switches

Event Management

The primary objective of incorporating this feature is to address particular actions (such as port disable) or events (like port fast-recovery) and broadcast them in a standardized format along with descriptions. This aims to streamline remote system state monitoring for users.

In this current release, only the CLI is supported for accessing the list of broadcasted events. However, in the upcoming release, gNMI will be expanded to facilitate the remote publication of each event to clients. To subscribe to gNMI events, see gNMI Streaming section.

Supported Events

The following table presents the supported events with their description.

Resource

Event Description

Severity

System

System fatal state detected

CRITICAL

System

System is not ready—one or more services are not up

CRITICAL

System

System is not ready—one or more services failed

MAJOR

System

Restart all syncd-ibv0 dockers

MAJOR

System

Performing reboot

MAJOR

System

Health status is not ok

WARNING

System

System is ready

INFORMATIONAL

System

System recovered from fatal state

INFORMATIONAL

System

Health status is ok

INFORMATIONAL

Sensor or service name

<Repeats a message from the system health>

WARNING

Sensor or service name

Hardware component goes back to normal / 

Service goes back to normal

INFORMATIONAL

Interface name

Interface admin state is {Up/Down}

INFORMATIONAL

Interface name

Interface operational state is {Up/Down}

INFORMATIONAL

Interface name 

Fast-recovery error event for trigger {trigger_name} was received

INFORMATIONAL

System

System reboot occurred

reason: <reboot cause>
performed by user: <user who performed reboot>
reboot time: <time when reboot occurred>

For a list of reboot causes, see Possible Reboot Causes.

INFORMATIONAL

ASIC name

PSC detected failure

MAJOR

Clearing Events

Receiving Clearing Events

For most WARNING/CRITICAL events, the system will send a "Clearing" event once the issue is resolved. If the system experiences two or more issues for a component, it will send an event about the last issue, and once all the issues are resolved, it will send a "Clearing" event for the last issue only. Once you receive a "Clearing" event, all issues for the component are resolved.

Example:

  • PSU2 is out of power

  • PSU2 is missing or not available

  • Cleared: PSU2 is missing or not available

Resending Unresolved Issues

If one of the initial issues for the component still exists after the last one was resolved, the system will resend the issue that still exists.

Example:

  • PSU2 is out of power

  • PSU2 is missing or not available

  • PSU2 is out of power

Backward Compatibility

Backward compatibility is preserved, and in the case of clearing the issue for the component, the system will also send generic events. Please consider avoiding generic messages, as they will be dropped in future releases.

Example:

  • Cleared: PSU2/FAN speed is out of range

  • Hardware component goes back to normal

System Reboot

After a reboot, the system does not clear any pre-boot events and assumes everything is cleared.

Detailed Table of Events

Event Category

Event Type ID ("event" in gNMI)

Severity 

Resource ("component" in gNMI)

Text 

Failure Reason  

Suggested Repair Flow

Fan-Related Events

Fan failure 

HEALTH_NOT_OK 

WARNING 

FAN1/1 

"FAN1/1 speed is out of range, speed=40%, range=[50,100]" 

Fan speed out of range  

  • Collect tech-support and submit NVIDIA support ticket.

  • Consider number of faulty fans: more than one fan requires immediate maintenance.

  • Power-cycle the switch.

  • If persists, submit NVIDIA support ticket to replace fan module.

Fan failure 

HEALTH_NOT_OK 

WARNING 

FAN1/1 

"FAN1/1 is not working" 

Fan status is not okay (status in the hardware) 

Fan failure 

HEALTH_NOT_OK 

WARNING 

FAN1/1 

"Failed to get actual speed data for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure 

HEALTH_NOT_OK 

WARNING 

FAN1/1 

"Failed to get target speed data for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure 

HEALTH_NOT_OK 

WARNING 

FAN1/1 

"Failed to get speed tolerance for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure 

HEALTH_NOT_OK 

WARNING 

FAN1/1 

"Failed to get speed status for FAN1/1"

Failed to get some information of fan data from hardware

Fan failure 

HEALTH_NOT_OK 

WARNING 

FAN1/1 

"FAN1/1 is missing" 

Fan is missing 

Fan failure 

HEALTH_NOT_OK 

WARNING 

FAN1/1 

"FAN1/1 direction is not aligned with exhaust direction intake" 

Fan direction is not aligned with other fans 

Fan failure 

HEALTH_NOT_OK 

WARNING 

FAN1/1 

"Invalid fan speed data for FAN1/1, speed=0x, target=50, tolerance=100" 

Invalid speed  

Fan health

HEALTH_OK

INFORMATIONAL

FAN1/1 

“HW component goes back to normal” 

Fan is back to normal state  

N/A

ASIC-Related Events

ASIC failure 

HEALTH_NOT_OK 

WARNING 

ASIC-HEALTH 

Switch ASIC in fatal state 

ASIC in fatal 


  • Collect tech-support and submit NVIDIA support ticket.

  • Reboot the system.


  • If persists, power-cycle system.

ASIC failure 

SYSTEM_FATAL_DETECTED 

CRITICAL 

System 

System fatal state detected 

Detect ASIC in fatal 

ASIC failure 

HEALTH_NOT_OK 

WARNING 

ASIC1 

ASIC1 temperature is too hot, temperature=120, threshold=105 

ASIC temp too high 

  • Collect tech-support and submit NVIDIA support ticket.

  • Continue to monitor switch temperature.

ASIC failure 

SYSTEM_FATAL_REMEDY 

MAJOR 

System 

Restart all syncd-ibv0 dockers 

ASIC in fatal preforming reboot of dockers 

N/A

ASIC failure 

SYSTEM_FATAL_REMEDY 

MAJOR 

System 

Performing reboot 

ASIC in fatal preforming reboot 

ASIC health

HEALTH_OK

INFORMATIONAL

ASIC1 

“HW component goes back to normal” 

ASIC1 is back to normal state  

ASIC health

SYSTEM_FATAL_RECOVERED 

INFORMATIONAL 

System 

System recovered from fatal state 

Recoverd from fatal 

ASIC security irregularity

ASIC_SECURITY_IRREGULARITY

MAJOR

ASIC1

"PSC detected failure"

The system has detected an irregularity in physical monitoring

N/A

Leakage-Related Events

Leakage 

LEAKAGE 

CRITICAL 

LEAKAGE-1  

Leakage detected, inspect for water leakage and consider power down switch tray 

Detected leakage 

  • Collect tech-support and submit NVIDIA support ticket.

  • For additional instructions refer to NVONLINE 1115991 chapter "NVIDIA MGX Leak Detection Strategy and Remediation"

NOTE: Relevant only for liquid-cooled-based systems.

Leakage 

HEALTH_NOT_OK 

WARNING 

LEAKAGE-1 

LEAKAGE-1  detected leakage 

Detected leakage 

Voltage-Related Events

Voltage 

HEALTH_NOT_OK 

WARNING 

<Voltage-sensor-name> 

Sensor voltage is out of range, voltage={}, range=[{},{}] 

Voltage sensor not in range 

  • Collect tech-support and submit NVIDIA support ticket.

  • Power cycle the switch.

  • If persists, consider replacing the system.

Voltage 

HEALTH_NOT_OK 

WARNING 

<Voltage-sensor-name> 

Sensor status is failed 

Voltage sensor status in hardware is failed 

Voltage

HEALTH_OK

INFORMATIONAL

<Voltage-sensor-name> 

“HW component goes back to normal” 

Voltage sensor value is back to normal state  

N/A

Temperature-Related Events

Temperature 

HEALTH_NOT_OK 

WARNING 

<Temp-sensor-name> 

<Temp-sensor-name> temperature is too hot, temperature={}, threshold={} 

Temperature too hot 

  • Collect tech-support and submit NVIDIA support ticket.

  • Power cycle the switch.

  • If persists, see if the sensor is Ambient-MNG-Temp. If it is, check the environmental conditions (CDU and DC temperature).

  • If persists, consider replacing the system.

Temperature 

HEALTH_NOT_OK 

WARNING 

<Temp-sensor-name> 

Sensor status is failed 

Sensor status in hardware is failed 

Temperature 

HEALTH_OK

INFORMATIONAL

<Temp-sensor-name> 

“HW component goes back to normal” 

Temperature sensor value is back to normal state  

N/A

System-Services-Related Events

services 

HEALTH_NOT_OK 

WARNING 

<container-name> 

Container '<container-name>' is not running 

Container is not running 

Collect tech-support and submit NVIDIA support ticket.

services 

HEALTH_OK

INFORMATIONAL

<container-name> 

“Service goes back to normal” 

Service goes back to normal state  

N/A

services

CPU_USAGE

WARNING


CPU usage x% is above expected threshold y%

CPU usage is larger than the expected usage

  • Collect techsupport and submit NVIDIA support ticket.

services

MEMORY_USAGE

WARNING


Memory usage x% is above expected threshold y%

Memory usage is larger than the expected usage

  • Container will be restarted automatically.

  • If persists, collect techsupport and submit NVIDIA support ticket.

services

CPU_USAGE

INFORMATIONAL


CPU usage is back to normal

CPU usage is back to normal state

N/A

services

MEMORY_USAGE

INFORMATIONAL


Memory usage is back to normal

Memory usage is back to normal state

N/A

System-Initialization-Related Events

Init flow 

SYSTEM_STATE_DOWN 

CRITICAL 

System 

System is not ready—one or more services are not up 

Some services are not up as part of init 

  • Collect tech-support and submit NVIDIA support ticket.

  • If sensor is Ambient-MNG-Temp:

  • Check environmental conditions (CDU (if exists) and DC temperature).

  • If persists, power-cycle the switch.

  • If still persists, replace the switch.

Init flow 

SYSTEM_STATE_FAILED 

MAJOR 

System 

System is not ready—one or more services failed 

Some services failed as part of init 

Init flow 

SYSTEM_STATE_UP

INFORMATIONAL

System 

System is ready

System finished initialization and is ready

N/A

Interface-Related Informational Events


interface 

INTERFACE_ADMIN_STATUS

INFORMATIONAL

<interface_name>

"Interface admin state is {admin_state}"

Informs of admin state change of interface 

N/A

interface 

INTERFACE_OPER_STATUS

INFORMATIONAL

<interface_name>

"Interface operational state is {up or down}"

Informs of operational state change of interface

N/A

interface 

INTERFACE_LOGICAL_STATE

INFORMATIONAL

<interface_name>

"Interface logical state is {logical_state}"

Informs of logical state change of interface

N/A

System-Health-Related Events
(The below events are summary and accompany the specific errors that were detailed above)

system 

HEALTH_SUMMARY_NOT_OK_CRITICAL 

CRITICAL 

System 

Health status is not ok 

Have some health not okay event—critical (e.g. leakage) 

Collect tech-support and submit NVIDIA support ticket.

system 

HEALTH_SUMMARY_NOT_OK 

WARNING 

System 

Health status is not ok 

Have some health not okay event—warning 

System 

HEALTH_SUMMARY_OK

INFORMATIONAL

System 

Health status is ok

System health with no issue

N/A

Transceiver-Related Events

Transceiver failure

HEALTH_NOT_OK

WARNING

sw1

"Transceiver's temperature is higher than critical threshold [actual = 100, high threshold = 80]"

Temperature is critically high

N/A

Transceiver failure

HEALTH_NOT_OK

WARNING

sw1

"Transceiver's temperature is lower than critical threshold [actual = 1, low threshold = 5]"

Temperature is critically low

N/A

Transceiver health

HEALTH_OK

INFORMATIONAL

sw1

"HW component goes back to normal"

Transceiver's temperature is good now

N/A


Event Category

Event Type ID ("event" in gNMI)

Severity 

Resource 
("component" in gNMI)

Text 

Failure Reason  

Suggested Repair Flow

Transceiver-Related Events

Transceiver failure

HEALTH_NOT_OK

WARNING

sw1

"Unsupported cable detected"

Inserted cable vendor is not recognized

N/A

Transceiver failure

HEALTH_NOT_OK

WARNING

els1

"HW Component health is not ok: Laser failure"

ELS laser isn't functioning properly

Replace ELS

Event Category

Event Type ID ("event" in gNMI)

Severity 

Resource ("component" in gNMI)

Text 

Failure Reason  

Suggested Repair Flow

Power-Supply-Unit-Related Events

PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> is missing—Unpopulated PSU slot

PSU expected to be in the system, but PSU slot is empty

Insert PSU/dummy PSU to the unpopulated PSU slot

PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> is out of power

PSU is out of power


PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> temperature is too hot, temperature={}, threshold={}

Temperature too hot


PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> voltage is out of range, voltage={}, range=[{},{}]

Voltage is out of range


PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> System power exceeds threshold ({}w)

Power exceeds threshold


PSU

HEALTH_NOT_OK

WARNING

<PSU-name>

<PSU-name> Power supply is not providing power

No power from PSU


PSU

HEALTH_OK

INFORMATIONAL

<PSU-name>

HW component goes back to normal

Health issue was resolved

N/A


Event Category

Event Type ID ("event" in gNMI)

Severity 

Resource ("component" in gNMI)

Text 

Failure Reason  

Suggested Repair Flow

PS-Redundancy-Related Events

Power redundancy policy

HEALTH_NOT_OK

WARNING

N/A

Power redundancy policy no-redundancy requires at least <minimal-number-of-PSUs-per-system> working power supplies, currently system has only <number-of-working-PSUs>

Insufficient number of PSUs relative to the current no-redundancy policy

Insert at least the minimal amount of PSUs required. Minimal PSU amount can be found using the following command:

nv show platform ps-redundancy

Power redundancy policy

HEALTH_NOT_OK

WARNING

N/A

Power redundancy policy ps-redundant requires at least <minimal-number-of-PSUs-per-system + 1> working power supplies, currently system has only <number-of-working-PSUs>

Insufficient number of PSUs relative to the current ps-redundant policy

Insert at least one more PSU than the minimal amount of PSUs required. Minimal PSU amount can be found using the following command:

nv show platform ps-redundancy

Power redundancy policy

HEALTH_NOT_OK

WARNING

N/A

Power redundancy policy grid-redundant requires all <number-of-PSUs-in-the-system> power supplies to be working, currently system has only <number-of-working-PSUs>

Insufficient number of PSUs relative to the current grid-redundant policy

Insert PSUs to all PSU slots in the system.

Power redundancy policy

HEALTH_OK

INFORMATIONAL

N/A

HW component goes back to normal

Health issue was resolved

N/A


Event Management Commands

Last updated: