InfiniBand Cluster Bring-up Procedure

SM Logs

SM logs include details of reported errors, all errors reported in opensm.log should be treated as indicators of IB fabric health.

SM logs path:

  • When only OpenSM is running without UFM: /var/log/opensm.log

  • When OpenSM is running with UFM on a Docker, enter the container: 

    docker exec -it ufm bash
    

    the path is: /opt/ufm/files/log/opensm.log

The SM log file should include the message "SUBNET UP" if OpenSM was able to set up the subnet correctly.

Logs Parameters

The SM log file size can be changed.​ You can choose how often a new SM log file will be created: daily, weekly (default), monthly.​

The SM log file will reach its maximum log size, or it will obey the rotational periodically order.​

  1. Modify the OpenSM log maximum file size:​ 

    vi /opt/ufm/files/conf/opensm/opensm.conf
    ​log_max_size
    


  2. Modify the OpenSM log frequency rotation: 

    vi /etc/logrotate.d/opensm
    


Useful Commands

Locate the subnet manager:

[root@fit229 ~]# sminfo
sminfo: sm lid 8 sm guid 0xa088c203007cdd36, activity count 47086 priority 15 state 3 SMINFO_MASTER

Query node description:

[root@fit229 ~]# smpquery nd 8
Node Description:...................fit232 mlx5_0

Common Errors

Error

Description

TIMEOUT

Timeout in the network, look for a bad cable

trap128 

The link state is changed. If this occurs too often on the same cable, make sure the cable is not corrupted

trap131

A bad cable connected

trap 144

Change in either link width/speed or node description

traps 257-259

Bad partitions

Example (Error trap 128):

Check the error by running the next command, if a port LinkDownedCounter is too big, it means the cable is corrupted.
for i in {1..<ports amount>};do echo Port:$i;perfquery <LID>$i | grep LinkDownedCounter;done

Apr 16 22:11:41 477567 [DA9C8640] 0x02 -> log_notice: Reporting Generic Notice type:1 num:128 (Link state change) from LID:4 GID:fe80::900a:8403:b3:c540

[root@l-qa-203 ~]# for i in {1..64};do echo Port:$i;perfquery 4 $i | grep LinkDownedCounter;done
Port:1
LinkDownedCounter:...............2
Port:2
LinkDownedCounter:...............0
Port:3
LinkDownedCounter:...............154222
Port:4
..


Last updated: