Operations Procedures | IB Cluster Maintenance

UFM Service Verification

Confirm the operational status of the UFM service.

If you are using UFM Enterprise Appliance, execute the following command via the command-line interface (CLI) after logging into the UFM appliance.

show ufm status

For more information, refer to UFM General Commands

If you are using your own server, refer to Showing UFM Processes Status.
If you prefer using the web user interface:
Navigate to the "System Health" tab in the left menu.
Under the "UFM Health" section, click on "Create New Report."
Confirm that all fields are displaying green indicators.

For detailed instructions, refer to UFM Health Tab

It is also recommended to conduct a remote test of the REST API by querying the "UFM Health" report. For instructions, refer to Reports REST API.

Fabric Health Report Generation and Validation

To generate fabric health report and verifying all sections are green, perform the following steps using Web UI:

Access the "System Health" tab on the left menu.
Click on "Run New Report" under the "Fabric Health" section.
Confirm that all fields are indicating green status.

For detailed instructions, refer Fabric Health Tab

Additionally, within the "System Health" tab:
Run the available tests under "Fabric Validation."
Verify the outcomes as either "Pass" or "Completed with No Errors."
For detailed instructions, refer Fabric Validation Tab.
Furthermore, it is recommended to conduct remote REST API tests from a remote node. This can be done using the REST APIs described in the following links:
Reports REST API
Fabric Validation Tests REST API

Cluster Topology Validation

Once the InfiniBand cluster is built, it is essential to create a Master Topology. This Master Topology serves as a reference during cluster operation, enabling the detection of any network configuration changes. It is noteworthy that the actual cluster topology may be different from the initially planned specifications. Detecting and validating these discrepancies in topology is crucial to ensure the cluster's proper functionality.

As an example, even in cases where a known TOR switch is defected due to hardware malfunction and is planned for RMA process, the cluster can still operate, albeit with some degradation in performance and anticipated capacity.

For a more comprehensive details, refer to Topology Compare REST API.

Telemetry Metrics Collection

To collect InfiniBand ports, PHY and cables telemetry metrics, perform the following.

Access the embedded UFM Telemetry instance through an HTTP End Point using the following URL to your browser address bar:

http://$ufm_ip$:9001/labels/enterprise

Please remember to replace your UFM IP according to your IP address, for example:

http://10.209.44.100:9001/labels/enterprise

Expected results:

PortXmitDataExtended{device_name=””,device_type=”host”,fabric=”compute”,hostname=”swx-snap3”,level=”server”,node_desc=”swx-snap3 mlx5_0”,peer_level=”server”,port_id=”248a0703009a15fa_1”} 228011616 1648987628390
PortRcvDataExtended{device_name=””,device_type=”host”,fabric=”compute”,hostname=”swx-snap3”,level=”server”,node_desc=”swx-snap3 mlx5_0”,peer_level=”server”,port_id=”248a0703009a15fa_1”} 228011616 1648987628390
PortXmitPktsExtended{device_name=””,device_type=”host”,fabric=”compute”,hostname=”swx-snap3”,level=”server”,node_desc=”swx-snap3 mlx5_0”,peer_level=”server”,port_id=”248a0703009a15fa_1”} 791707 1648987628390
PortRcvPktsExtended{device_name=””,device_type=”host”,fabric=”compute”,hostname=”swx-snap3”,level=”server”,node_desc=”swx-snap3 mlx5_0”,peer_level=”server”,port_id=”248a0703009a15fa_1”} 791707 1648987628390
SymbolErrorCounterExtended{device_name=””,device_type=”host”,fabric=”compute”,hostname=”swx-snap3”,level=”server”,node_desc=”swx-snap3 mlx5_0”,peer_level=”server”,port_id=”248a0703009a15fa_1”} 0 1648987628390

For a more compact CSV data format, access the following endpoint:

http://$ufm_ip$:9001/labels/csv/metrics

Remember to replace "$ufm_ip$" with your actual UFM IP address. Example:

http://10.209.44.100:9001/labels/csv/metrics

Link Monitoring Key Indicators

The following table lists the link monitoring key indicators and provides their descriptions, pass/fail criteria and monitoring intervals.

Table 1. Link Monitoring Key Indicators

Parameter	Parameter	Pass/Fail Criteria	Monitoring Interval
Link State
Phy_state	Physical link state	Verify link up	ongoing
Logical_state	Logical link state	Verify link in ACTIVE mode	ongoing
speed_active	Active link speed	Verify expected speed	ongoing
width_active	Active link width	Verify expected width 4x, Or for split cable - 2x	ongoing
BER	Bit Error Rate	BER Thresholds
Symbol_BER	BER after FEC and PLR	1e-15 (HDR) / 1e-16 (NDR)	ongoing
PHY Errors
Symbol_Errors	Errors after FEC and PLR	Defined by Symbol BER	ongoing
Link_Down counter	The total number of times the Port Training state machine has failed the link error recovery process and downed the link.	>1 Need to notice the peer side as well. Make sure it is not due to reboot.	ongoing
LInkErrorRecoveryCounter	The number of times the Port Training state machine has successfully completed the link error recovery process.	Clean, no errors	ongoing
Chip temperature	Temperature in C	If temperature reached max threshold FW will do protective thermal shutdown.	ongoing
Device FW version	Switch / HCA FW ver	Verify approved version is the last released version by NVIDIA, Need to see the cluster have similar versions	Days
Cables Information
PN	Part number	No check required	Days
SN	Serial number	No check required	Days
FW ver	FW version	Verify approved version is the last released version by NVIDIA	Days
Module temperature	Optic module only	There is an alarm and threshold for each transceiver. Usually Warning [70c, 0c] and Alarm [80c, -10c]	ongoing
Rx power Tx power per lane	Optic module only	There is an alarm and threshold for each transceiver.	Minutes
Packet Discard
PortRcvErrors	Total number of packets containing an error that were received on the port.	< 10 per second (perform 2 successive samples)	Minutes
PortXmitDiscards	Total number of outbound packets discarded by the port because the port is down or congested.	< 10 per second (perform 2 successive samples)	Minutes

Cluster Performance Verification

The tool used for validating cluster performance is known as ClusterKit, an integral component of the HPC-X Software Toolkit.

NVIDIA® HPC-X® presents a comprehensive software bundle encompassing MPI and SHMEM communication libraries. Within this package, various acceleration components are included, enhancing the performance and scalability of applications that operate on top of these libraries. Notably, UCX (Unified Communication X) accelerates the underlying send/receive (or put/get) messages. Also included, HCOLL, which accelerates the underlying collective operations used by the MPI/PGAS languages.

For detailed documentation, along with instructions for downloading and installing HPC-X, refer to HPC-X Documentation.

HPC-X is Functionality Verification

To ensure the correct operation of HPC-X, a straightforward MPI test program bundled with HPC-X can be employed. Use the following procedure:

Set the HPCX_HOME environment variable to point to the HPCX installation directory:

% export HPCX_HOME=<HPCX Directory>

Initialize HPC-X environment variables:

% source $HPCX_HOME/hpcx-init.sh

% hpcx_load

Execute the precompiled MPI test program hello_c. The MPI program can be executed using either of the following methods:
Inside a SLURM allocation or job, run:

% mpirun $HPCX_MPI_TESTS_DIR/examples/hello_c

Without SLURM using SSEH and explicitly setting hosts to run on:

% mpirun --host <host1,host2,…,hostN> $HPCX_MPI_TESTS_DIR/examples/hello_c

Alternatively, you can put all hostnames into a single file (hostfile) and pass that file to mpirun (see mpirun(1) man page for details):

% mpirun --hostfile <hostfile> $HPCX_MPI_TESTS_DIR/examples/hello_c

The output should contain one line for every MPI process that was executed. Each line indicates the MPI rank of the process, the total number of processes, and the version of OpenMPI bundled with HPC-X. For instance:

Hello, world, I am 90 of 168, (Open MPI v4.1.5rc2, package: Open MPI root@hpc-kernel-03 Distribution, ident: 4.1.5rc2, repo rev: v4.1.5rc1-16-g5980bac633, Unreleased developer copy, 150)

The number of lines should match the number of cores in the allocation (see ‘of 168’ in the example).

Check that the ClusterKit script (sh) is available. Run

ls -l $HPCX_CLUSTERKIT_DIR/bin/run_clusterkit.sh

Check that the file $HPCX_CLUSTERKIT_DIR/bin/run_clusterkit.sh exists and is executable.

Running ClusterKit

Prior to executing ClusterKit, it is important to have HPC-X properly set up with initialized environment variables. Additionally, ensure that the ClusterKit script (clusterkit.sh) is accessible, as instructed in the preceding section.

ClusterKit can be run inside SLURM allocation or job or without SLURM. When operating within a SLURM allocation, employ the following command:

$HPCX_CLUSTERKIT_DIR/bin/clusterkit.sh -d mlx5_4:1 -x "-d bw"

Where -d adapter:port selects which InfiniBand adapter and port to use and -x "-d bw" sets which test to run (bandwidth test).

If running outside SLURM allocation, use

$HPCX_CLUSTERKIT_DIR/bin/clusterkit.sh -f hostfile -d mlx5_4:1 -x "-d bw"

where -f hostfile sets hostfile to use. The hostfile contains the list of nodes to use (see mpirun(1) man page for details).

You can add -D <output dir> switch to set the output directory for the run. Without it, the output will be saved into the directory composed of date and time of the run (e.g., 20230731_154932).

In the output directory two files are created: bandwidth.json and bandwidth.txt. bandwidth.json can be used for automatic processing of the results which is out of scope of this document. In bandwidth.txt see the last 3 line of text which look like:

Minimum bandwidth: 24869.6 MB/s between node14 and node28

Maximum bandwidth: 25208.7 MB/s between node02 and node13

Average bandwidth: 25002.5 MB/s

The results are in decimal Megabytes per second (10⁶ Bytes per second).

Results Verification

Your cluster's performance is satisfactory when the minimum achieved result is at least 95% of the maximum available bandwidth, as illustrated in the table below.

For your convenience, the technology of your cluster interconnect is shown in the header of the bandwidth.txt file.

Table 2. Expected InfiniBand Performance (for 4x Connections)

Technology	Speed, Gb/s	95% performance, MB/s
EDR	100	11 515
HDR	200	23 030
NDR	400	46 060

Review All Unhealthy Nodes

Once the UFM examines the behavior of subnet nodes, including switches and hosts, and identifies a node as "unhealthy" based on internal conditions, this node is displayed in the “Unhealthy Ports” list. Once a node is declared as "unhealthy," the Subnet Manager either ignores, reports, isolates, or disables the node. Users hold the authority to control the executed actions and the criteria that categorize a node as "unhealthy." Furthermore, the user can "clear" nodes previously labeled as "unhealthy".

To navigate through these functionalities using the Web User Interface, refer to Unhealthy Ports Window. to review all unhealthy nodes using Web UI. Alternatiely, use the REST API from a remote node via the Unhealthy Ports REST API.

Congestion Monitoring with UFM Telemetry

Since InfiniBand is lossless, the network does not drop packets which may cause network congestion. The metric XmitWaitPerc provides the percentage of time in which ports had data to send but could not progress due to congestion. This metric can be obtained per topology layer (distance from the source hosts towards the destination hosts) or for each link separately.

If a switch port connected to a host is showing >5% of XmitWaitPerc, then the most probably cause is that the host PCIe or its memory is not healthy.

If XmitWaitPerc >5% on links/layer that are not driving a host, then that is most probably caused by traffic that exceeds the capacity of that layer. This is normal for over-subscribed networks where the total number of cables connecting the switches of that layer to the next one is smaller than the number of cables connected to previous layers. But if the network is not over-subscribed, a high XmitWaitPerc can be strong sign that adaptive routing is not used, or some many-to-one traffic patterns are used by the applications. Or that some missing (unhealthy links) makes a specific switch over-subscribed.

For more information, refer to "Top X Telemetry Sessions REST API" under Telemetry REST API.

Monitoring Systems Integrations

The monitoring system includes UFM Telemetry and optional streaming of its results into customer specific Network Management Systems. UFM Telemetry is responsible for collecting the vital networking metrics and forwarding them to a data-lake or other customer data analysis tools.

A comprehensive array of plugins, scripts, recipes and tools to facilitate UFM integration with third-party network management systems can be accessed via a publicly available GitHub repository: UFM SDK Repository.
Furthermore, the UFM Software Development Kit (SDK) allows extension of the capabilities of the UFM platform with additional tools.

The following are links for instruction detailing installation and usage of Telemetry Streaming/Forwarding Plugins:

Latest SW Updates

In case your cluster is running Long Term Support (LTS) releases, it is recommended to check NVIDIA LTS web page for the latest LTS release. This page provides the exact version per InfiniBand product (e.g., Switch, HCA, UFM, Transceiver, etc.), that was released and tested as a bundle. To gain insights into critical bug resolutions, it is recommended to visit the release notes for each specific product: Long-Term Support (LTS) Releases.
In case your cluster is running a different General Availability (GA) version (non-LTS), please review the latest revisions to the relevant components via the following link: NVIDIA Firmware Downloads.
Keep track of critical bugs and known issues in the InfiniBand Linux stack at NVIDIA MLNX_OFED Software Releases and plan for upgrades as required.
Keep track of critical bugs, known issues and new features at UFM Enterprise Software Documentation and plan for upgrades as required.
Please note that, currently, a complete maintenance window is required for device firmware upgrades. A feature leveraging switch upgrades while the switch is operational is anticipated to become available in the future.
For UFM and OpenSM upgrades, a staged approach can be adopted: begin by upgrading the secondary UFM, transition to it as the master, and subsequently proceed with upgrading the prior master UFM.

Cooling System Maintenance

For cooling system maintenance using web UI, run the check temperature test from fabric validation tab. For more information, refer to Fabric Validation Tab.
For cooling system maintenance using the REST APIs, issue a POST request using the following URL:

POST /ufmRest/fabricValidation/tests/CheckTemperature

For more information, refer to Fabric Validation Tests REST API.

Last updated: August 14, 2023