NVIDIA NVOS User Manual for InfiniBand Switches

Link Diagnostic Per Port

When debugging a system, it is important to be able to quickly identify the root of a problem. The Diagnostic commands enables an insight into the physical layer components where the user is able to see information such as a cable status (plugged/unplugged) or if Auto-Negotiation has failed.

PHY Firmware Indication

Link Diagnostic Indication

Code

Firmware PHY Indication (0–1023)

0

No issue observed

1

Port is close by command

2–4

Auto Negotiation failure

5–8

Link training failure

9–13

Logical mismatch between link partners

14

Remote fault received

15

Bad Signal integrity

16

Compliance code mismatch (protocol mismatch between cable and port)

17

Bad signal integrity

18

Internal error

19

Internal error

22

Internal error

23

Internal error

24–32 

Cable compliance code mismatch (protocol mismatch
between cable and port)

34

Speed degradation

35

Speed degradation

38

Auto Negotiation failure

39

Auto Negotiation failure

40

VPI protocol do not match

41

Port is closed, module cannot be set to the enabled rate

42

Bad signal integrity

48

Bad signal integrity

49

Bad signal integrity

50

Internal error

52

Bad signal integrity

55

Internal error

56

module_lanes_frequency_not_synced

57

Signal not detected

60

No partner detected for long time

128

Troubleshooting in process

1023

Information not available

Code

Firmware Management  Issues (1024–2047)

1024

Cable is unplugged

1025

Long range for non NVIDIA cable/module

1026

Bus stuck (I2C Data or clock shorted)

1027

Bad/unsupported EEPROM

1028

Part number list

1029

Unsupported cable

1030

Module temperature shutdown

1031

Shorted cable

1032

Power budget exceeded

1033

Management force down the port

1034

Module is disabled by command

1035

System Power is Exceeded therefore the module is powered off

1036

Module’s PMD type is not enabled (see PMTPS).

1040

pcie system power slot Exceeded

1042

Module state machine fault

1043–1046

Module’s stamping speed degeneration

1047, 1048

Modules DataPath FSM fault

1050–1053

Module Boot Error

1054

Module Forced to Low Power by command

1055

ELS laser fiber is contaminated 

1056

ELS laser failure 

1057

ELS cable unplugged 

Link Down Reason Indication

Code

Link Down Reason Indication

0

No_link_down_indication

1

 Unknown_reason

2

 Hi_SER_or_Hi_BER

3

 Block_Lock_loss

4

 Alignment_loss

5

 FEC_sync_loss

6

 PLL_lock_loss

7

 FIFO_overflow

8

 false_SKIP_condition

9

 Minor_Error_threshold_exceeded

10

 Physical_layer_retransmission_timeout

11

 Heartbeat_errors

12

 Link_Layer_credit_monitoring_watchdog

13

 Link_Layer_integrity_threshold_exceeded

14

 Link_Layer_buffer_overrun

15

 Down_by_outband_command_with_healthy_link

16

 Down_by_outband_command_for_link_with_hi_ber

17

 Down_by_inband_command_with_healthy_link

18

 Down_by_inband_command_for_link_with_hi_ber

19

 Down_by_verification_GW

20

 Received_Remote_Fault

21

 Received_TS1

22

 Down_by_management_command

23

 Cable_was_unplugged

24

 Cable_access_issue

25

 Cable_Thermal_shutdown

26

 Current_issue

27

 Power_budget

28

 Fast_recovery_raw_ber

29

 Fast_recovery_effective_ber

30

 Fast_recovery_symbol_ber

31

 Fast_recovery_credit_watchdog

32

 Peer_side_down_to_sleep_state

33

 Peer_side_down_to_disable_state

34

 Peer_side_down_to_disable_and_port_lock

35

 Peer_side_down_due_to_thermal_event

36

 Peer_side_down_due_to_force_event

37

 Peer_side_down_due_to_reset_event

38

 Reset_no_power_cycle

39

 Fast_recovery_tx_plr_trigger

40

 Down_due_to_HW_force_event

41

 Down_due_to_thermal_event

42

 L1_exit_failure

43

 too_many_link_error_recoveries

44

Down_due_to_contain_mode

45

BW_loss_threshold_exceeded

Link Diagnostic Commands

Last updated: