InfiniBand Cluster Bring-up Procedure

UFM Telemetry​

Unified Fabric Manager Telemetry collects over 120 unique counters (BER, Temperature, Histograms, Retransmissions, and many more) for each port in the InfiniBand fabric, enabling the user to predict which cables are marginal and should be replaced during the bring-up process to avoid malfunctions in the future.
The tool collects data samples from all ports over all the cluster and save the data in csv file.​

To collect InfiniBand Link Quality metrics, perform the following:

curl http://{machine_ip}:9002/csv/xcset/low_freq_debug >> my_telemetry_file.csv

Example:
my_telemetry_file.csv

image-2024-3-26_18-28-20.png

The following table lists the link monitoring key indicators and provides their descriptions and evaluation criteria.

Parameter

Description

Evaluation Criteria

Link State

Phy_state 

Physical link state

Verify link up ( Enumeration value = 5 )

Link Quality

NDR Link Quality

Link Quality criteria depend on error correction scheme type.


Error Correction Scheme TYPE

Media Type

Post-FEC

Symbol



Normal

Warning

Error

Normal

Warning

Error

Default for DAC/ACC/AOC < 100m


Low_Latency_RS_FEC_PLR


DAC/ACC/AOC

1.00E-12

5.00E-12

1.00E-11

1.00E-15

5.00E-15

1.00E-14

Default for AOC> 100m


KP4_Standard_RS_FEC


AOC

1.00E-15

5.00E-15

1.00E-14

1.00E-15

5.00E-15

1.00E-14

DAC - directly attach copper 
ACC - active copper cable
AOC - active optical cable

Note: Minimum port up time for BER measurement - 125 minutes.

XDR Link Quality

Link Quality criteria depend on error correction scheme type.

          ********NOT OFICIAL THRESHOLDS*********   

image-2025-7-2_16-42-31.png

PHY Errors

Link_Down counter 
 

Total number of link down occurred as a result of involuntary link shutdown. 

If delta from last sample > 0:

  • Trace the event and include switch, port, date and time, link down counter.

  • If same switch and port has at least 2 link down occurrences within 24 hours, further investigation required.

  • Note:Make sure link down was due to involuntary port down from the partner side (e.g. not due to partner server reboot).The criteria intends to catch major link down events.

Cable Information

Module_Temperature 

Temperature of the transceiver - optic transceiver only

There is an alarm and threshold for each transceiver. 
Usually Warning [70c, 0c] and Alarm [80c, -10c]


 rx_power_lane_x and tx_power_lane_x

Rx power and Tx power per transceiver lane - optic transceiver only 

There is an alarm and threshold for each transceiver.


Last updated: