NVIDIA UFM Enterprise User Manual

Appendix - Supported Port Counters and Events

Port counters and events are available in the following views:

  • Events and Port Counters area, at the bottom of the UFM window

  • Error window (Error tab) in the Manage Devices tab

  • In the New Monitoring Session window, in the Monitor tab, when clicking Create New Session

  • Event Log in the Log tab (click Show Event Log)

InfiniBand Port Counters

The following tables list and describe the port counters and events currently supported:

  • InfiniBand Port Counters

  • Calculated Port Counters

InfiniBand Port Counters

Counter

Description

Xmit Data (in bytes)

Total number of data octets, divided by 4, transmitted on all VLs from the port, including all octets between (and not including) the start of packet delimiter and the VCRC, and may include packets containing errors. All link packets are excluded. Results are reported as a multiple of four octets.

Rcv Data (in bytes)

Total number of data octets, divided by 4, received on all VLs at the port.

All octets between (and not including) the start of packet delimiter and the VCRC are excluded and may include packets containing errors. All link packets are excluded. When the received packet length exceeds the maximum allowed packet length specified in C7-45: the counter may include all data octets exceeding this limit.

Results are reported as a multiple of four octets.

Xmit Packets

Total number of packets transmitted on all VLs from the port, including packets with errors and excluding link packets.

Rcv Packets

Total number of packets, including packets containing errors and excluding link packets, received from all VLs on the port.

Rcv Errors

Total number of packets containing errors that were received on the port including:

  • Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)

  • Malformed data packet errors (LVer, length, VL)

  • Malformed link packet errors (operand, length, VL)

  • ackets discarded due to buffer overrun (overflow)

Xmit Discards

Total number of outbound packets discarded by the port when the port is down or congested for the following reasons:

  • Output port is not in the active state

  • Packet length has exceeded NeighborMTU

  • Switch Lifetime Limit exceeded

  • Switch HOQ Lifetime Limit exceeded, including packets discarded while in VLStalled State.

Symbol Errors

Total number of minor link errors detected on one or more physical lanes.

Link Error Recovery

Total number of times the Port Training state machine has successfully completed the link error recovery process.

Link Error Downed

Total number of times the Port Training state machine has failed the link error recovery process and downed the link.

Local Integrity Error

The number of times that the count of local physical errors exceeded the threshold specified by LocalPhyErrors

Rcv Remote Physical Error

Total number of packets marked with the EBP delimiter received on the port.

Xmit Constraint Error

Total number of packets not transmitted from the switch physical port for the following reasons:

  • FilterRawOutbound is true and packet is raw

  • PartitionEnforcementOutbound is true and packet fails partition key check or IP version check

Rcv Constraint Error

Total number of packets received on the switch physical port that are discarded for the following reasons:

  • FilterRawInbound is true and packet is raw

  • PartitionEnforcementInbound is true and packet fails partition key check or IP version check

Excess Buffer Overrun Error

The number of times that OverrunErrors consecutive flow control update periods occurred, each having at least one overrun error

Rcv Switch Relay Error

Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons:

  • DLID mapping

  • VL mapping

  • Looping (output port = input port)

VL15 Dropped

Number of incoming VL15 packets dropped because of resource limitations (e.g., lack of buffers) in the port

XmitWait

The number of ticks during which the port selected by PortSelect had data to transmit but no data was sent during the entire tick because of insufficient credits or of lack of arbitration.

InfiniBand Calculated Port Counters

Counter

Description

Normalized XmitData

Effective port bandwidth utilization in %
XmitData incremental/ Link Capacity

Normalized Congested Bandwidth

Amount of bandwidth that was suppressed due to congestion
(XmitWait incremental/ Time) * Link Capacity
Separate counters are used for Tier 4 ports and for the rest of the ports.

Supported Traps and Events

Device events are listed as VDM or CDM in the Source column of the Events table in the UFM GUI. For information about defining event policy, see Configuring Event Management.

Alarm ID

Alarm Name

To Log

Alarm

Default Severity

Default Threshold

Default TTL

Related Object

Category

Description/Message

64

GID Address In Service

1

0

Info

1

300

Port

Fabric Notification


65

GID Address Out of Service

1

0

Warning

1

300

Port

Fabric Notification


66

New MCast Group Created

1

0

Info

1

300

Port

Fabric Notification


67

MCast Group Deleted

1

0

Info

1

300

Port

Fabric Notification


110

Symbol Error

1

1

Warning

200

300

Port

Hardware


111

Link Error Recovery

1

1

Minor

1

300

Port

Hardware


112

Link Downed

1

1

Critical

1

300

Port

Hardware


113

Port Receive Errors

1

1

Minor

5

300

Port

Hardware


114

Port Receive Remote Physical Errors

0

0

Minor

5

300

Port

Hardware


115

Port Receive Switch Relay Errors

1

1

Minor

999

300

Port

Fabric Configuration


116

Port Xmit Discards

1

1

Minor

200

300

Port

Communication Error


117

Port Xmit Constraint Errors

1

1

Minor

200

300

Port

Communication Error


118

Port Receive Constraint Errors

1

1

Minor

200

300

Port

Communication Error


119

Local Link Integrity Errors

1

1

Minor

5

300

Port

Hardware


120

Excessive Buffer Overrun Errors

1

1

Minor

100

300

Port

Communication Error


121

VL15 Dropped

1

1

Minor

50

300

Port

Communication Error


122

Congested Bandwidth (%) Threshold Reached

1

1

Minor

10

300

Port

Hardware


131

Non-optimal link width (1X instead of 4X)

1

1

Minor

1

0

Port

Hardware


132

Non-optimal link width (1X or 4X instead of 12X)

1

1

Minor

1

0

Port

Hardware


140

Excessive Buffer Overrun Threshold Reached

1

0

Minor

11

300

Port

Hardware


141

Flow Control Update Watchdog Timer Expired

1

0

Warning

1

300

Port

Hardware


144

Capability Mask Modified

1

0

Info

1

300

Port

Fabric Notification


145

System Image GUID changed

1

0

Info

1

300

Port

Communication Error


256

Bad M_Key

1

0

Minor

1

300

Port

Security


257

Bad P_Key

1

0

Minor

1

300

Port

Security


258

Bad Q_Key

1

0

Minor

1

300

Port

Security


259

Bad P_Key Switch External Port

1

0

Critical

1

300

Port

Security


301

Logical Server State Changed

1

0

Info

1

0

Logical Server

Logical Model


302

Logical Server State Change Failed

1

0

Minor

1

0

Logical Server

Logical Model


306

Logical Server Added

1

0

Info

1

0

Logical Server

Logical Model


307

Logical Server Removed

1

0

Info

1

0

Logical Server

Logical Model


308

Logical Server Resources Allocated

1

0

Info

1

0

Logical Server

Logical Model


312

Compute Resource Released

1

0

Info

1

0

Logical Server

Logical Model


313

Compute Resource Allocated

1

0

Info

1

0

Logical Server

Logical Model


314

Logical Server Additional Resources Allocated

1

0

Info

1

0

Logical Server

Logical Model


315

Logical Server Resources Released

1

0

Info

1

0

Logical Server

Logical Model


316

Logical Server Compute Resource is Down

1

1

Critical

1

0

Logical Server

Logical Model


317

Logical Server Compute Resource is Up

1

1

Warning

1

0

Logical Server

Logical Model


328

Link is Up

1

0

Info

1

0

Link

Fabric Topology


328

Link is Down

1

0

Warning

1

0

Link

Fabric Topology


331

Node is Down

1

0

Warning

1

0

Site

Fabric Topology


332

Node is Up

1

0

Info

1

300

Site

Fabric Topology


336

Port Action Succeeded

1

0

Info

1

0

Port

Maintenance


337

Port Action Failed

1

0

Minor

1

0

Port

Maintenance


338

Device Action Succeeded

1

0

Info

1

0

Port

Maintenance


339

Device Action Failed

1

0

Minor

1

0

Port

Maintenance


340

Network Interface Added

1

0

Info

1

0

Logical Server

Logical Model


341

Network Interface Removed

1

0

Info

1

0

Logical Server

Logical Model


350

Environment Added

1

0

Info

1

0

Env

Logical Model


351

Environment Removed

1

0

Info

1

0

Env

Logical Model


352

Network Added

1

0

Info

1

0

Network

Logical Model


353

Network Removed

1

0

Info

1

0

Network

Logical Model


370

Gateway Ethernet Link State Changed

1

0

Warning

1

0

Gateway

Gateway


371

Gateway Reregister Event Received

1

0

Warning

1

0

Gateway

Gateway


372

Number of Gateways Changed

1

0

Warning

1

0

Gateway

Gateway


373

Gateway will be Rebooted

1

0

Warning

1

0

Gateway

Gateway


374

Gateway Reloading Finished

1

0

Info

1

0

Gateway

Gateway


381

Switch Upgrade Failed

1

0

Info

1

0

Switch

Maintenance


383

Host Upgrade Failed

1

0

Info

1

0

Computer

Maintenance


385

Switch FW Upgrade Started

1

0

Info

1

0

Switch

Maintenance


386

Switch SW Upgrade Started

1

0

Info

1

0

Switch

Maintenance


388

Host FW Upgrade Started

1

0

Info

1

0

Computer

Maintenance


389

Host SW Upgrade Started

1

0

Info

1

0

Computer

Maintenance


391

Switch Module Removed

1

0

Info

1

0

Switch

Fabric Notification


392

Module Temperature Threshold Reached

1

0

Info

40

0

Module

Hardware


394

Module Status FAULT

1

1

Critical

1

420

Switch

Module Status


502

Device Upgrade Finished

1

0

Info

1

300

Device

Maintenance


545

SM is not responding

1

1

Critical

1

300

Grid

Maintenance


560

User Connected







Security


561

User Disconnected







Security


602

UFM Server Failover

1

1

Critical

1

0

Site

Fabric Notification


701

Non-optimal Link Speed

1

1

Minor

1

0

Port

Hardware


907

Switch is Down

1

1

Critical

1

0

Site

Fabric Topology


908

Switch is Up

1

1

Info

1

300

Site

Fabric Topology


909

Director Switch is Down

1

1

Critical

1

300

Site

Fabric Topology


910

Director Switch is Up

1

1

Info

1

0

Site

Fabric Topology


911

Module Temperature Low Threshold Reached

1

1

Warning

60

300

Module

Hardware


912

Module Temperature High Threshold Reached

1

1

Critical

60

300

Module

Hardware


913

Module High Voltage

1

1

Warning

10

420

Switch

Module Status


914

Module High Current

1

1

Warning

10

420

Switch

Module Status


915

BER_ERROR

1

1

Critical

1e-8

420

Port

Hardware


916

BER_WARNING

1

1

Warning

1e-13

420

Port

Hardware


917

SYMBOL_BER_ERROR

1

1

Critical


420

Port

Hardware


1300

SM_SAKEY_VIOLATION

1

1

Warning


5300

Port

Security


1301

SM_SGID_SPOOFED

1

1

Warning


5300

Port

Security


1302

SM_RATE_LIMIT_EXCEEDED

1

1

Warning


5300

Port

Security


1303

SM_MULTICAST_GROUPS_LIMIT_EXCEEDED

1

1

Warning


5300

Port

Security


1304

SM_SERVICES_LIMIT_EXCEEDED

1

1

Warning


5300

Port

Security


1305

SM_EVENT_SUBSCRIPTION_LIMIT_EXCEEDED

1

1

Warning


5300

Port

Security


1500

New cable detected

1

0

Info

1

0

Link

Security


1502

Cable detected in a new location

1

0

Warning

1

0

Link

Security


1503

Duplicate Cable Detected

1

0

Critical

1

0

Link

Security


1600

VS/CC Classes Key Violation







Security


Last updated: