IBUtils2 Utility Documentation

Bit Error Rate (BER)

The Bit Error Rate (BER) is the number of bit errors per unit time divided by the total number of transferred bits during a studied time interval. BER is a unitless performance measure, often expressed as a percentage.

Parameter

Description

Notes

--get_phy_info

Collects BER information for fabric ports and checks BER validating with specific thresholds. Errors will be reported to the ibdiagnet2.log and ibdiagnet2.db_csv files.

Applicable to all EDR/HDR and future InfiniBand devices.

--ber_test

Deprecated. Provides a BER test for each port. Calculate BER for each port and check no BER value has exceeded the BER threshold. (default threshold="10^-12").

This option is available only when using SwitchX/ConnectX-4 and ConnectX-3 devices.

--ber_thresh <value>

Deprecated. Specifies the threshold value for the BER test. The reciprocal number of the BER should be provided.

For example, the value of 10^-12 should be 1000000000000 or 0xe8d4a51000 (10^12).

If the given threshold is 0, then all BER values for all ports will be reported.

This option is available only when using SwitchX/ConnectX-4 and ConnectX-3 devices.

--llr_active_cell <64|128>

Deprecated. Specifies the Link Level Retransmission (LLR) active cell size for BER test, when LLR is active in the fabric.

This option is available only when using SwitchX/ConnectX-4 and ConnectX-3 devices

Example: 

ibdiagnet --get_phy_info

Fabric Health Validation Example

For NDR/HDR/EDR links, symbol errors (NDR/HDR) or effective errors (EDR) are the actual errors seen by the application level after error correction.

The below methodology is recommended as a first step if fabric performance is degraded. 

  1. Make sure the significant traffic is running in the fabric

  2. ibdiagnet --pc  --reset_phy_info  -i  <mlx_dev>

  3. Wait for some time (5-10 minutes)

  4. ibdiagnet --get_phy_info  -i  <mlx_dev>

  5. Review ibdiagnet2.log

  6. Contact Support if Symbol/Effective BER Check finished with errors.

For detailed description of cmd line parameters, see previous chapter “Bit Error Rate”

BER check log file fragment:

-E- Symbol BER Check finished with errors 
-E- H-10/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 
-E- H-14/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 
-E- H-3/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_544_514_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12 
-E- H-7/U1/P1 - BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_271_257_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12 
-E- SW-1-0/U1/P4 - BER exceeds threshold - BER type: Symbol BER, FEC mode: RS_FEC_544_514, BER value = 1.500000e+01 / threshold = 5.000000e-12 
-E- SW-1-0/U1/P5 - BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 
 
--------------------------------------------- 
Fabric Summary 
 
Total Nodes             : 24 
IB Switches             : 8 
IB Channel Adapters     : 16 
IB Aggregation Nodes    : 0 
IB Routers              : 0 
 
Total number of links   : 32 
Links at 4x10           : 32 
 
High BER reported by 6 ports

BER check error section in db_csv file: 

START_ERRORS_SYMBOL_BER_CHECK
Scope,NodeGUID,PortGUID,PortNumber,EventName,Summary
PORT,0x0002c90000000005,0x0002c90000000006,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
PORT,0x0002c90000000015,0x0002c90000000016,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
PORT,0x0002c90000000025,0x0002c90000000026,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_544_514_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
PORT,0x0002c90000000035,0x0002c90000000036,1,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: MLNX_RS_271_257_PLR, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
PORT,0x0002c90000000049,0x0002c90000000049,4,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: RS_FEC_544_514, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
PORT,0x0002c90000000049,0x0002c90000000049,5,BER_EXCEEDS_THRESHOLD,"BER exceeds threshold - BER type: Symbol BER, FEC mode: STD-LL-RS, BER value = 1.500000e+01 / threshold = 5.000000e-12 "
END_ERRORS_SYMBOL_BER_CHECK

Last updated: