IB Cluster Maintenance

Operations Procedures

UFM Service Verification

Confirm the operational status of the UFM service.

  • If you are using UFM Enterprise Appliance, execute the following command via the command-line interface (CLI) after logging into the UFM appliance.

show ufm status

For more information, refer to UFM General Commands

  • If you are using your own server, refer to Showing UFM Processes Status.

  • If you prefer using the web user interface:

  • Navigate to the "System Health" tab in the left menu.

  • Under the "UFM Health" section, click on "Create New Report."

  • Confirm that all fields are displaying green indicators.

For detailed instructions, refer to UFM Health Tab

It is also recommended to conduct a remote test of the REST API by querying the "UFM Health" report. For instructions, refer to Reports REST API.

Fabric Health Report Generation and Validation

To generate fabric health report and verifying all sections are green, perform the following steps using Web UI:

  • Access the "System Health" tab on the left menu.

  • Click on "Run New Report" under the "Fabric Health" section.

  • Confirm that all fields are indicating green status.

For detailed instructions, refer Fabric Health Tab

  • Additionally, within the "System Health" tab:

  • Run the available tests under "Fabric Validation."

  • Verify the outcomes as either "Pass" or "Completed with No Errors."

  • For detailed instructions, refer Fabric Validation Tab.

  • Furthermore, it is recommended to conduct remote REST API tests from a remote node. This can be done using the REST APIs described in the following links:

  • Reports REST API

  • Fabric Validation Tests REST API

Cluster Topology Validation

Once the InfiniBand cluster is built, it is essential to create a Master Topology. This Master Topology serves as a reference during cluster operation, enabling the detection of any network configuration changes. It is noteworthy that the actual cluster topology may be different from the initially planned specifications. Detecting and validating these discrepancies in topology is crucial to ensure the cluster's proper functionality.

As an example, even in cases where a known TOR switch is defected due to hardware malfunction and is planned for RMA process, the cluster can still operate, albeit with some degradation in performance and anticipated capacity.

For a more comprehensive details, refer to Topology Compare REST API.

Telemetry Metrics Collection

To collect InfiniBand ports, PHY and cables telemetry metrics, perform the following.

Access the embedded UFM Telemetry instance through an HTTP End Point using the following URL to your browser address bar:

http://$ufm_ip$:9001/labels/enterprise

Please remember to replace your UFM IP according to your IP address, for example:

http://10.209.44.100:9001/labels/enterprise

Expected results:

  • PortXmitDataExtended{device_name=””,device_type=”host”,fabric=”compute”,hostname=”swx-snap3”,level=”server”,node_desc=”swx-snap3 mlx5_0”,peer_level=”server”,port_id=”248a0703009a15fa_1”} 228011616 1648987628390

  • PortRcvDataExtended{device_name=””,device_type=”host”,fabric=”compute”,hostname=”swx-snap3”,level=”server”,node_desc=”swx-snap3 mlx5_0”,peer_level=”server”,port_id=”248a0703009a15fa_1”} 228011616 1648987628390

  • PortXmitPktsExtended{device_name=””,device_type=”host”,fabric=”compute”,hostname=”swx-snap3”,level=”server”,node_desc=”swx-snap3 mlx5_0”,peer_level=”server”,port_id=”248a0703009a15fa_1”} 791707 1648987628390

  • PortRcvPktsExtended{device_name=””,device_type=”host”,fabric=”compute”,hostname=”swx-snap3”,level=”server”,node_desc=”swx-snap3 mlx5_0”,peer_level=”server”,port_id=”248a0703009a15fa_1”} 791707 1648987628390

  • SymbolErrorCounterExtended{device_name=””,device_type=”host”,fabric=”compute”,hostname=”swx-snap3”,level=”server”,node_desc=”swx-snap3 mlx5_0”,peer_level=”server”,port_id=”248a0703009a15fa_1”} 0 1648987628390


For a more compact CSV data format, access the following endpoint:

http://$ufm_ip$:9001/labels/csv/metrics

Remember to replace "$ufm_ip$" with your actual UFM IP address. Example:

http://10.209.44.100:9001/labels/csv/metrics  

Link Monitoring Key Indicators

The following table lists the link monitoring key indicators and provides their descriptions, pass/fail criteria and monitoring intervals.

Table 1.              Link Monitoring Key Indicators

Parameter 

Parameter 

Pass/Fail Criteria 

Monitoring Interval 

Link State  

Phy_state 

Physical link state  

Verify link up 

ongoing 

Logical_state 

Logical link state 

Verify link in ACTIVE mode 

ongoing 

speed_active 

Active link speed 

Verify expected speed 

ongoing 

width_active 

Active link width 

Verify expected width 4x,  
Or for split cable - 2x 

ongoing 

BER  

Bit Error Rate 

BER Thresholds  


Symbol_BER 

BER after FEC and PLR 

1e-15 (HDR) / 1e-16 (NDR) 

ongoing 

PHY Errors 

Symbol_Errors 

Errors after FEC and PLR 

Defined by Symbol BER 

ongoing 

Link_Down counter 


The total number of times the Port Training state machine has failed the link error recovery process and downed the link. 

>1 

Need to notice the peer side as well. Make sure it is not due to reboot. 

ongoing 

LInkErrorRecoveryCounter 


The number of times the Port Training state machine has successfully completed the link error recovery process. 

Clean, no errors 

ongoing 

Chip temperature 

Temperature in C 

If temperature reached max threshold FW will do protective thermal shutdown. 

ongoing 

Device FW version 

Switch / HCA FW ver 

Verify approved version is the last released version by NVIDIA, 

Need to see the cluster have similar versions  

Days 

Cables Information 

PN 

Part number 

No check required 

Days  

SN 

Serial number 

No check required 

Days 

FW ver 


FW version  

Verify approved version is the last released version by NVIDIA 

Days  

Module temperature  

Optic module only 

There is an alarm and threshold for each transceiver. 

Usually Warning [70c, 0c] and Alarm [80c, -10c]  

ongoing

Rx power Tx power per lane  

Optic module only 

There is an alarm and threshold for each transceiver. 

Minutes  

Packet Discard

PortRcvErrors 

Total number of packets containing an error that were received on the port.  

< 10 per second (perform 2 successive samples) 

Minutes  

PortXmitDiscards 


Total number of outbound packets discarded by the port because the port is down or congested.  

< 10 per second (perform 2 successive samples) 

Minutes  


Cluster Performance Verification

The tool used for validating cluster performance is known as ClusterKit, an integral component of the HPC-X Software Toolkit.

NVIDIA® HPC-X® presents a comprehensive software bundle encompassing MPI and SHMEM communication libraries. Within this package, various acceleration components are included, enhancing the performance and scalability of applications that operate on top of these libraries. Notably, UCX (Unified Communication X) accelerates the underlying send/receive (or put/get) messages. Also included, HCOLL, which accelerates the underlying collective operations used by the MPI/PGAS languages.

For detailed documentation, along with instructions for downloading and installing HPC-X, refer to HPC-X Documentation.

 

HPC-X is Functionality Verification

To ensure the correct operation of HPC-X, a straightforward MPI test program bundled with HPC-X can be employed. Use the following procedure:

  1. Set the HPCX_HOME environment variable to point to the HPCX installation directory:

% export HPCX_HOME=<HPCX Directory>

  1. Initialize HPC-X environment variables:

% source $HPCX_HOME/hpcx-init.sh

% hpcx_load

  1. Execute the precompiled MPI test program hello_c. The MPI program can be executed using either of the following methods:

  2. Inside a SLURM allocation or job, run:

% mpirun $HPCX_MPI_TESTS_DIR/examples/hello_c

  1. Without SLURM using SSEH and explicitly setting hosts to run on:

% mpirun --host <host1,host2,…,hostN> $HPCX_MPI_TESTS_DIR/examples/hello_c


Alternatively, you can put all hostnames into a single file (hostfile) and pass that file to mpirun (see mpirun(1) man page for details):

% mpirun --hostfile <hostfile> $HPCX_MPI_TESTS_DIR/examples/hello_c


  1. The output should contain one line for every MPI process that was executed. Each line indicates the MPI rank of the process, the total number of processes, and the version of OpenMPI bundled with HPC-X. For instance:

Hello, world, I am 90 of 168, (Open MPI v4.1.5rc2, package: Open MPI root@hpc-kernel-03 Distribution, ident: 4.1.5rc2, repo rev: v4.1.5rc1-16-g5980bac633, Unreleased developer copy, 150)

The number of lines should match the number of cores in the allocation (see ‘of 168’ in the example).

  1. Check that the ClusterKit script (sh) is available. Run

ls -l $HPCX_CLUSTERKIT_DIR/bin/run_clusterkit.sh

  1. Check that the file $HPCX_CLUSTERKIT_DIR/bin/run_clusterkit.sh exists and is executable.


Running ClusterKit

Prior to executing ClusterKit, it is important to have HPC-X properly set up with initialized environment variables. Additionally, ensure that the ClusterKit script (clusterkit.sh) is accessible, as instructed in the preceding section.

ClusterKit can be run inside SLURM allocation or job or without SLURM. When operating within a SLURM allocation, employ the following command:

$HPCX_CLUSTERKIT_DIR/bin/clusterkit.sh -d mlx5_4:1 -x "-d bw"

Where -d adapter:port selects which InfiniBand adapter and port to use and -x "-d bw" sets which test to run (bandwidth test).

If running outside SLURM allocation, use

$HPCX_CLUSTERKIT_DIR/bin/clusterkit.sh -f hostfile -d mlx5_4:1 -x "-d bw"

where -f hostfile sets hostfile to use. The hostfile contains the list of nodes to use (see mpirun(1) man page for details).

You can add -D <output dir> switch to set the output directory for the run. Without it, the output will be saved into the directory composed of date and time of the run (e.g., 20230731_154932).

In the output directory two files are created: bandwidth.json and bandwidth.txt. bandwidth.json can be used for automatic processing of the results which is out of scope of this document. In bandwidth.txt see the last 3 line of text which look like:

Minimum bandwidth: 24869.6 MB/s between node14 and node28

Maximum bandwidth: 25208.7 MB/s between node02 and node13

Average bandwidth: 25002.5 MB/s

The results are in decimal Megabytes per second (106 Bytes per second).


Results Verification

Your cluster's performance is satisfactory when the minimum achieved result is at least 95% of the maximum available bandwidth, as illustrated in the table below.

For your convenience, the technology of your cluster interconnect is shown in the header of the bandwidth.txt file.

Table 2.              Expected InfiniBand Performance (for 4x Connections)

Technology

Speed, Gb/s

95% performance, MB/s

EDR

100

11 515

HDR

200

23 030

NDR

400

46 060


Review All Unhealthy Nodes

Once the UFM examines the behavior of subnet nodes, including switches and hosts, and identifies a node as "unhealthy" based on internal conditions, this node is displayed in the “Unhealthy Ports” list. Once a node is declared as "unhealthy," the Subnet Manager either ignores, reports, isolates, or disables the node. Users hold the authority to control the executed actions and the criteria that categorize a node as "unhealthy." Furthermore, the user can "clear" nodes previously labeled as "unhealthy".

To navigate through these functionalities using the Web User Interface, refer to Unhealthy Ports Window.  to review all unhealthy nodes using Web UI. Alternatiely, use the REST API from a remote node via the Unhealthy Ports REST API.

Congestion Monitoring with UFM Telemetry

Since InfiniBand is lossless, the network does not drop packets which may cause network congestion. The metric XmitWaitPerc provides the percentage of time in which ports had data to send but could not progress due to congestion. This metric can be obtained per topology layer (distance from the source hosts towards the destination hosts) or for each link separately.

If a switch port connected to a host is showing >5% of XmitWaitPerc, then the most probably cause is that the host PCIe or its memory is not healthy.

If XmitWaitPerc >5% on links/layer that are not driving a host, then that is most probably caused by traffic that exceeds the capacity of that layer. This is normal for over-subscribed networks where the total number of cables connecting the switches of that layer to the next one is smaller than the number of cables connected to previous layers. But if the network is not over-subscribed, a high XmitWaitPerc can be strong sign that adaptive routing is not used, or some many-to-one traffic patterns are used by the applications. Or that some missing (unhealthy links) makes a specific switch over-subscribed.

For more information, refer to "Top X Telemetry Sessions REST API" under Telemetry REST API.

 

Monitoring Systems Integrations

The monitoring system includes UFM Telemetry and optional streaming of its results into customer specific Network Management Systems. UFM Telemetry is responsible for collecting the vital networking metrics and forwarding them to a data-lake or other customer data analysis tools.

  • A comprehensive array of plugins, scripts, recipes and tools to facilitate UFM integration with third-party network management systems can be accessed via a publicly available GitHub repository: UFM SDK Repository.

  • Furthermore, the UFM Software Development Kit (SDK) allows extension of the capabilities of the UFM platform with additional tools.

 

The following are links for instruction detailing installation and usage of Telemetry Streaming/Forwarding Plugins:


 

Latest SW Updates

  • In case your cluster is running Long Term Support (LTS) releases, it is recommended to check NVIDIA LTS web page for the latest LTS release. This page provides the exact version per InfiniBand product (e.g., Switch, HCA, UFM, Transceiver, etc.), that was released and tested as a bundle. To gain insights into critical bug resolutions, it is recommended to visit the release notes for each specific product: Long-Term Support (LTS) Releases.

  • In case your cluster is running a different General Availability (GA) version (non-LTS), please review the latest revisions to the relevant components via the following link: NVIDIA Firmware Downloads.

  • Keep track of critical bugs and known issues in the InfiniBand Linux stack at NVIDIA MLNX_OFED Software Releases and plan for upgrades as required.

  • Keep track of critical bugs, known issues and new features at UFM Enterprise Software Documentation and plan for upgrades as required.

  • Please note that, currently, a complete maintenance window is required for device firmware upgrades. A feature leveraging switch upgrades while the switch is operational is anticipated to become available in the future.

  • For UFM and OpenSM upgrades, a staged approach can be adopted: begin by upgrading the secondary UFM, transition to it as the master, and subsequently proceed with upgrading the prior master UFM.

Cooling System Maintenance

  • For cooling system maintenance using web UI, run the check temperature test from fabric validation tab. For more information, refer to Fabric Validation Tab.

  • For cooling system maintenance using the REST APIs, issue a POST request using the following URL:

POST /ufmRest/fabricValidation/tests/CheckTemperature

Last updated: