Cable Validation Tool (CVT) performs hierarchical anomaly detection on each monitored port using data collected from the AMBER (Advanced Monitoring of BER) telemetry. For each port, a series of health checks are evaluated in a fixed priority order. The first checkpoint that detects an issue is reported as the port's anomaly, and subsequent checks are skipped for that port. This ensures the most fundamental problem is surfaced first.
All anomalies detected by these checks are reported under the syndrome "Anomalous-port (Signal, Temperature)" in the Reports and Circuits views. The specific remediation action shown alongside the syndrome indicates which check was triggered.
The checkpoint priority order depends on the cable type:
Optical cables: EEPROM read > BER + RX power > Module temperature > Power threshold > Firmware opcode > Link state > Bad module
Copper cables: EEPROM read > BER > Module temperature > Module voltage > Firmware opcode > Link state > Bad module
NVLink cables: EEPROM read > BER > Module temperature > Module voltage > Firmware opcode > Link state
The table below lists all possible anomaly messages, the checks that produce them, and their meaning.
|
Anomaly Category |
Message |
Meaning |
|---|---|---|
|
EEPROM Issue |
Module read issue in sample. Check if module can be read correctly manually via mlxlink/ethtool -m, if not attempt to reseat the module. |
Module is |
|
BER + Optical Check
|
Link Bit Error Rate is high, issue is confirmed on the optical line. Clean the optics, make sure Rx power readings per lane are within 2dBm of each other |
BER grade is Marginal or worse AND one lane's Raw BER is 10x higher than the next, with its corresponding rx_power >2dBm lower than the best lane. |
|
Link Bit Error Rate is high. Clean the optics |
BER grade is Marginal or worse, but the rx power variance check didn't flag (and SNR media check didn't flag). |
|
|
Rx power abnormal variance detected and is affecting the lane BER. Clean the optics |
BER is fine, but one lane's rx_power is anomalously low relative to others (suspected bad optical side). |
|
|
Optical signal received on optical side is showing high noise levels. Clean the optics |
SNR media lane check flagged |
|
|
BER only Check |
Link Bit Error Rate is high. Reseat the connector |
BER grade is Marginal or worse on a copper or NVLink cable. (Q3400_RA IB XDR switch uses a different mechanism to determine BER issues as listed in the next table.) |
|
Module Temperature |
Module temperature lower than range. Check rack and system cooling |
|
|
Module temperature exceeds range. Check rack and system cooling |
|
|
|
Optical Power Levels |
No light observed on Rx/Tx. Verify connectivity and light levels on the optics and fiber |
All valid lanes have TX or RX power at -40 dBm (no light), and the linkdown counter changed since last sample. |
|
Module power out of range. Clean the optics, or replace transceiver. |
At least one valid lane has TX or RX power outside the |
|
|
Module Voltage |
Module Voltage lower than range. Check power supply |
|
|
Module Voltage exceeds range. Check power supply |
|
|
|
Firmware Status Opcode |
< |
Opcode |
|
<status_message>. Verify cabling on both sides of the link |
Opcode |
|
|
<status_message>. Verify cooling on device, and reseat transceiver. |
Opcode |
|
|
<status_message>. Reseat transceiver, if the issue remains; test known working transceiver in the same port and test the current transceiver in known working switch. |
Opcodes |
|
|
<status_message>. Reseat the transceiver. |
Any other non-valid opcode (not |
|
|
Link State |
Link negotiation issue. Verify speed and FEC configuration on both ports in the link |
|
|
Module communication issue. Reseat the transceiver |
|
|
|
Module Operational Status |
Bad module in port. Verify the module was seated correctly, replace if needed. |
|
XDR BER Anomaly Handling
For Q3400_RA IB XDR switch BER anomaly reporting follows a different path than other protocols. It uses a two-phase BER evaluation with a sliding window mechanism to reduce false positives and provide severity-graded anomalies. This switch has 4 planes and generates one amber per plane. The grade is calculated per port and per plane. And a bad grade for any of the planes creates an anomaly for the port.
The two phases are determined by time_since_last_clear:
-
Pre-staging phase (first 4 hours after counters are cleared): Used for validating newly installed or reseated hardware. Individual sample grades are evaluated directly — a single bad sample is enough to trigger an anomaly.
-
On-going phase (after 4 hours): Uses a BER sliding window that accumulates sample grades over time. Anomalies are only raised when multiple bad samples are observed within the window, and severity escalates from Warning to Error based on the count and pattern of bad samples.
All XDR BER anomalies appear under the syndrome "Anomalous-port (Signal, Temperature)" in the UI, with the phase and severity indicated by a prefix in the remediation action text (Pre-staging:, Warning:, or Error:).
|
Stage |
Message |
Meaning |
|---|---|---|
|
Pre-staging phase (time_since_last_clear <= 4 hours)
|
Pre-staging: Warning: Link down counter changed since last check. |
Grade is |
|
Pre-staging: Link Bit Error Rate is high. Reseat the connector. |
Grade is |
|
|
On-going phase (time_since_last_clear > 4 hours)
|
Warning: Link Bit Error Rate is high. Reseat the connector. |
2-3 marked (bad BER) sampling windows within the current window duration. |
|
Error: Link Bit Error Rate is high. Reseat the connector. |
4+ marked sampling windows, OR 2 consecutive windows at Warning severity. Persists until a full Good window completes. |
Anomaly Clearing Behavior
During the on-going phase, a BER window spans 4 hours. A Good window is one with 0-1 marked (bad BER) samples. Once a port reaches Warning or Error severity, the anomaly persists until a full Good window completes. Mid-window, the previous window's severity carries forward even if no new bad samples have been observed yet. When a Good window completes, the history is cleared and the port returns to normal.
Last updated: