NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

Known Issues

Internal Reference Number

Issues

5037230

Description: During a UFM upgrade from a previous version, the sharp_am service may log an "Invalid parameter stochastic_rounding_enabled" error, causing the service to stop and automatically restart.

Workaround: No action required. No impact on system performance or functionality; service resumes normal operation after the restart.

Keywords: UFM; sharp_am; upgrade

Discovered in Release: 3.15.0

-

Description: Using SHARP_am with switch firmware version 31.2014.3000 or later requires SHARP_am version 3.11.0 or newer.

Workaround: N/A

Keywords: SHARP_am; switch firmware

Discovered in Release: 3.9.0

3340353

Description: When reconfiguring a standby management host to operate as a compute host, it will not be able to run SHARP jobs unless sharp_am is restarted.

In case that a host runs the SM process, it will automatically be detected by the master SM as a standby SM and be reported as a standby management host.

Note that restart is not required if ignore_sm_guids is set to FALSE.

Workaround: N/A

Keywords: active; standby; compute host; ignore_sm_guids

Discovered in Release: 3.3.0

3371820

Description: Congestion Control cannot be configured on the same SLs used by sharp_am.

Workaround: N/A

Keywords: Congestion control; SL

Discovered in Release: 3.3.0

3305335

Description: When running mpirun with multiple groups, the following error message might be received:

[error] - AM QPAlloc confirm QP MAD response status 0x1c00

This message is received due to to the fact that multiple unserialized MAD requests are run in parallel.

Workaround: Set the SHARP_COLL_SERIALIZE_MADS environment variable to TRUE when running mpirun.

Keywords: mpirun; SHARP_COLL_SERIALIZE_MADS

Discovered in Release: 3.2.0

3225401

Description: Dynamic trees creation feature does not support a case in which all root switches are down and restarted. If such a scenario takes place, sharp_am should be restarted once the root switches are up and running.

Workaround: N/A

Keywords: Aggregation Manager; sharp_am; dynamic trees

Discovered in Release: 3.1.0

3237831

Description: SHARP does not support reassignment of LID values.
In case LID reassignment is desired, make sure to stop all SHARP jobs, reassign LIDs via OpenSM, and restart sharp_am once the reassignment is done.

Workaround: N/A

Keywords: Aggregation Manager; OpenSM

Discovered in Release: 3.1.0

3048427

Description: In the case that a switch split mode is modified (off/on), sharp_am does not handle the new number of supported ports unless it is restarted.

Workaround: Restart sharp_am after changing a switch split mode definition.

Keywords: Aggregation Manager; split mode

Discovered in Release: 2.7.0

3051699

Description: Changing the configuration of SHARP switch ports using device_configuration_file does not take effect on disconnected split ports. If these ports are connected later, they will remain with their default configuration.

Workaround: If the new configuration is desired for the split ports, make sure to restart the Aggregation Manager after connecting a split port to a host.

Keywords: Aggregation Manager; split port

Discovered in Release: 2.7.0

-




Description: On multi PKEY environment, UCX in SHARP can use only the default PKEY (PKEY at index 0).

Workaround: Use sockets for communication over non-default PKEY.

Keywords: Configuration, SMX, UCX, PKEY

Discovered in Release: 2.4.3

-

Description: High Availability for the Aggregation Manager is not supported in HPC-X/DOCA-Host packages at this time. As a result, only one instance of the Aggregation Manager can operate within the InfiniBand fabric. When there is a handover or failover of the Subnet Manager, a new instance of the Aggregation Manager should be initiated on the host where the new Master Subnet Manager is active.

Workaround: Use Aggregation Manager in UFM.

Keywords: Aggregation Manager

-

Description: Aggregation manager should run on the same Host where the Master Subnet Manager (SM) is running.

Workaround: N/A

Keywords: Aggregation Manager

-

Description: Aggregation Manager should be started after completion of fabric configuration by the Subnet Manager.

Workaround: N/A

Keywords: Aggregation Manager

-

Description: Only Fat-Tree, Quasi-Fat-Tree, and Dragonfly+ topologies are supported by the Aggregation Manager.

Workaround: N/A

Keywords: Fabric Topology

-

Description: Only IB fabrics where all compute nodes are connected to NVIDIA SHARP capable switches are supported by the Aggregation Manager.

Workaround: Manually configure mapping between the compute port and the Aggregation Node.

Keywords: Fabric Topology

-

Description: Upon changes in configuration file beyond parameters in 3.3, Aggregation Manager should be restarted to deploy new configuration.

Workaround: N/A

Keywords: Configuration

 

Last updated: