NVIDIA UFM Cable Validation Tool

Hierarchical Anomalies

Cable Validation Tool (CVT) performs hierarchical anomaly detection on each monitored port using data collected from the AMBER (Advanced Monitoring of BER) telemetry. For each port, a series of health checks are evaluated in a fixed priority order. The first checkpoint that detects an issue is reported as the port's anomaly, and subsequent checks are skipped for that port. This ensures the most fundamental problem is surfaced first.

All anomalies detected by these checks are reported under the syndrome "Anomalous-port (Signal, Temperature)" in the Reports and Circuits views. The specific remediation action shown alongside the syndrome indicates which check was triggered.

The checkpoint priority order depends on the cable type:

Optical cables: EEPROM read > BER + RX power > Module temperature > Power threshold > Firmware opcode > Link state > Bad module

Copper cables: EEPROM read > BER > Module temperature > Module voltage > Firmware opcode > Link state > Bad module

NVLink cables: EEPROM read > BER > Module temperature > Module voltage > Firmware opcode > Link state


The table below lists all possible anomaly messages, the checks that produce them, and their meaning.

Anomaly Category

Message

Meaning

EEPROM Issue

Module read issue in sample. Check if module can be read correctly manually via mlxlink/ethtool -m, if not attempt to reseat the module.

Module is plugged_enable but critical EEPROM fields are unreadable — vendor name, vendor OUI, part number, serial number contain invalid/garbage characters, or module temperature is N/A, or (for optical only) RX/TX power lane values are N/A.

BER + Optical Check


Link Bit Error Rate is high, issue is confirmed on the optical line. Clean the optics, make sure Rx power readings per lane are within 2dBm of each other

BER grade is Marginal or worse AND one lane's Raw BER is 10x higher than the next, with its corresponding rx_power >2dBm lower than the best lane.

Link Bit Error Rate is high. Clean the optics

BER grade is Marginal or worse, but the rx power variance check didn't flag (and SNR media check didn't flag).

Rx power abnormal variance detected and is affecting the lane BER. Clean the optics

BER is fine, but one lane's rx_power is anomalously low relative to others (suspected bad optical side).

Optical signal received on optical side is showing high noise levels. Clean the optics

SNR media lane check flagged

BER only Check

Link Bit Error Rate is high. Reseat the connector

 BER grade is Marginal or worse on a copper or NVLink cable.

(Q3400_RA IB XDR switch uses a different mechanism to determine BER issues as listed in the next table.)

Module Temperature

Module temperature lower than range. Check rack and system cooling

module_temperature < temperature_low_th.

Module temperature exceeds range. Check rack and system cooling

module_temperature > temperature_high_th.

Optical Power Levels

No light observed on Rx/Tx. Verify connectivity and light levels on the optics and fiber

All valid lanes have TX or RX power at -40 dBm (no light), and the linkdown counter changed since last sample.

Module power out of range. Clean the optics, or replace transceiver.

At least one valid lane has TX or RX power outside the [low_th, high_th] range.

Module Voltage

Module Voltage lower than range. Check power supply

module_voltage < voltage_low_th.

Module Voltage exceeds range. Check power supply

module_voltage > voltage_high_th.

Firmware Status Opcode

<status_message>. Clean optics

Opcode 15 — firmware reports a signal integrity issue (the first part is the firmware's status_message field, e.g. "Bad signal integrity").

<status_message>. Verify cabling on both sides of the link

Opcode 57 — firmware reports a cabling/connectivity issue.

<status_message>. Verify cooling on device, and reseat transceiver.

Opcode 1030 — firmware reports a thermal/cooling issue.

<status_message>. Reseat transceiver, if the issue remains; test known working transceiver in the same port and test the current transceiver in known working switch.

Opcodes 59, 1026, 1027, 1031, 1033, 1048 — module communication issue.

<status_message>. Reseat the transceiver.

Any other non-valid opcode (not 01, or 1024) without a specific mapped action.

Link State

Link negotiation issue. Verify speed and FEC configuration on both ports in the link

phy_manager_state = "Polling" and module is not unplugged. Port is stuck negotiating.

Module communication issue. Reseat the transceiver

phy_manager_state = "Rx_disable" and module is not unplugged.

Module Operational Status

Bad module in port. Verify the module was seated correctly, replace if needed.

module_oper_status is neither plugged_enable nor unplugged (e.g. plugged_disabled, module error states).


XDR BER Anomaly Handling

For Q3400_RA IB XDR switch BER anomaly reporting follows a different path than other protocols. It uses a two-phase BER evaluation with a sliding window mechanism to reduce false positives and provide severity-graded anomalies. This switch has 4 planes and generates one amber per plane. The grade is calculated per port and per plane. And a bad grade for any of the planes creates an anomaly for the port.

The two phases are determined by time_since_last_clear:

  • Pre-staging phase (first 4 hours after counters are cleared): Used for validating newly installed or reseated hardware. Individual sample grades are evaluated directly — a single bad sample is enough to trigger an anomaly.

  • On-going phase (after 4 hours): Uses a BER sliding window that accumulates sample grades over time. Anomalies are only raised when multiple bad samples are observed within the window, and severity escalates from Warning to Error based on the count and pattern of bad samples.

All XDR BER anomalies appear under the syndrome "Anomalous-port (Signal, Temperature)" in the UI, with the phase and severity indicated by a prefix in the remediation action text (Pre-staging:Warning:, or Error:).


Stage

Message

Meaning

Pre-staging phase (time_since_last_clear <= 4 hours)


Pre-staging: Warning: Link down counter changed since last check.

Grade is Marginal — BER metrics are within thresholds, but the link-down counter changed since the last advanced stats snapshot.

Pre-staging: Link Bit Error Rate is high. Reseat the connector.

Grade is Poor — one or more BER thresholds exceeded during pre-staging (symbol_errors > 0, effective_ber > 1e-12, first_zero_hist > 9, or link_error_recovery_counter > 0).

On-going phase (time_since_last_clear > 4 hours)


Warning: Link Bit Error Rate is high. Reseat the connector.

2-3 marked (bad BER) sampling windows within the current window duration.

Error: Link Bit Error Rate is high. Reseat the connector.

4+ marked sampling windows, OR 2 consecutive windows at Warning severity. Persists until a full Good window completes.

Anomaly Clearing Behavior

During the on-going phase, a BER window spans 4 hours. A Good window is one with 0-1 marked (bad BER) samples. Once a port reaches Warning or Error severity, the anomaly persists until a full Good window completes. Mid-window, the previous window's severity carries forward even if no new bad samples have been observed yet. When a Good window completes, the history is cleared and the port returns to normal.



Last updated: