Deploying UFM Telemetry
Deploying UFM Telemetry can be done in the following modes:
-
Bare Metal - Bringup Mode
-
Docker Container Mode
- Software Management | id (1.18.2)SoftwareManagement DockerContainerMode HighAvailability
-
Bare Metal Mode
-
Bare Metal Mode - High Availability
Bare Metal - Bringup Mode
NVIDIA UFM Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.
To deploy the UFM Telemetry in Bringup mode, perform the following steps:
-
Make sure the following prerequisites are installed:Python3Python3-venvSupervisor
-
Copy the tarball package to the targeted location.
-
Extract the package.
tar -xf ufm_telemetry-<version>.tar.gz
-
Start collection.
./bin/run_bringup .sh CollectX: collection_start This collects port counter and cable data every minute, uses HCA mlx5_0 and writes data to ./collection_data/clx-bringup-X for a period of 24hrs. CollectX: help collection_start Usage: options defaults Description ------- -------- ----------- collection_start time|duration=n [s|m|h|d] 24h Session duration sample_rate=n [s|m|h|d] 60 seconds Data sample rate guids=[guid_list|guid_file] None Target devices guid counter_set=[file.xcset] None Counter list to be collected hca=hca_name mlx5_0 Device to access the fabric cable|cable_info=[yes|no|once] yes Collect cable info nvlink_info=[yes|no] no Collect NVLink info disconnected_cable=[yes|no] no Collect disconnected cables info reset_counters=t false Reset counters of fabric devices compress_data=[yes|no] yes Compress files (if write_files=true) mads_retries=n 2 Set number of retries for MADs mads_timeout=n (msec) 500 Set timeout for MADs force_hca=t f Avoid HCA state check
Docker Container Mode
NVIDIA UFM Telemetry is packaged as a docker image that should be loaded and deployed on a Linux machine with docker installed. This section describes how to deploy the UFM Telemetry docker image on a Linux machine.
To deploy the UFM telemetry, perform the following steps:
-
Make sure that docker is installed on the Linux machine.
[root@r-ufm ~]# docker –version
-
Start the docker service.
[root@r-ufm ~]# sudo service docker start
-
Pull the image.
[root@r-ufm ~]# export image=mellanox/ufm-telemetry:<version> [root@r-ufm ~]# sudo docker pull $image
-
Create the default .ini files and place them in the local directory mapped to /config in the container and initialize the container configuration.
root@r-ufm ~]# sudo docker run -v /opt/ufm-telemetry/conf:/config --rm -d $image /get_collectx_configs.sh "sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"
This collects port counter data every 5 minutes and uses HCA mlx5_0. It also collects cable info on the 1st, 3rd, and 5th day of the week at midnight, where:
-
sample_rate: Frequency of collecting port counters
-
hca: Card to use
-
cable_info_schedule: Time of collecting cable info data (optional)
-
-
Create a container of UFM telemetry.
root@r-ufm ~]# sudo docker run --net=host --uts=host --ipc=host \ --ulimit stack=67108864 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \ --device=/dev/infiniband/ -v "/opt/ufm-telemetry/conf:/config" -v "/tmp/data:/data" -v "/opt/ufm/files/licenses:/opt/ufm/files/licenses/" --rm --name ufm-telemetry -d $image
-
Verify that UFM Telemetry is running.
-
Make sure the UFM Telemetry container is up.
[root@r-ufm ~]# docker ps
-
If the container name exists, access the shell of the container.
[root@r-ufm ~]# docker exec -it ufm-telemetry bash
-
Review your configurations under
/config/launch_ibdiagnet_config.ini.
-
-
View the UFM Telemetry configuration files.
root@ r-ufm ~]# ls -l /config/ -rw-r--r-- 1 3478 101 396 Apr 15 21:04 clx_config.ini -rw-r--r-- 1 3478 101 2987 Apr 15 21:04 collectx.ini -rw-r--r-- 1 3478 101 4257 Apr 15 21:04 launch_ibdiagnet_config.ini -rw-r--r-- 1 3478 101 1912 Apr 16 12:03 supervisord.conf
-
To watch and review the execution of the various components, you can check the log files under
/var/log. Each component has a dedicated log file. Running the "ls -l" command will display all files under the folder. The following output shows only the relevant log files (other files have been omitted).[root@r-ufm ~]# ls -l /var/log -rw-r--r-- 1 root root 128393 Apr 3 10:49 launch_cableinfo.log -rw-r--r-- 1 root root 467 Apr 3 09:35 launch_compression.log -rw-r--r-- 1 root root 194566 Apr 3 10:49 launch_ibdiagnet.log -rw-r--r-- 1 root root 798 Apr 3 09:35 launch_retention.log -rw-r--r-- 1 root root 1729 Apr 3 09:56 supervisord.log
-
To exit the UFM Telemetry docker context, run "exit" to return to the Linux machine context.
-
To access the UFM Telemetry CLI, run the following command on the Linux machine:
[root@r-ufm ~]# docker exec -it ufm-telemetry clxcli
Docker Container Mode - High Availability
Requirements:
-
An important requirement for the HA solution is to prepare a dedicated partition for DRBD to work with. Example of such a requirement: /dev/sda4.
-
Install pcs and drbd-utils on both servers (using “
yum” or “apt-get install”, based on your OS.
On RH/CentOS, please run “yum install pcs drbd84-utils kmod-drbd84.
Procedure:
-
Load (pull) the latest UFM Telemetry Docker image on both servers.
docker pull mellanox/ufm-telemetry:latest
-
Run the Telemetry configuration command on both servers.
docker run --rm -i --name=config-telemetry \ -v /opt/ufm-telemetry/conf:/config \ -v /etc/systemd/system:/etc/systemd/system \ -v /var/run/docker.sock:/var/run/docker.sock \ mellanox/ufm-telemetry:latest \ /get_collectx_configs.sh \ --gen_service \ --config=ufm_telemetry
-
Refresh systemd on both servers:
systemctl daemon-reload
-
Create the
/opt/ufm-telemetry/licenses/directory on the master server and copy the UFM Telemetry license file there. -
Download UFM-HA Package on both servers from this link.
-
Extract the HA package to
/tmp/,and from there, run the installation command on both servers as follows:In the below commands, "disk", the partition name, is assumed as /dev/sda4.
./install -l /opt/ufm-telemetry/ -d /dev/sda4 -p telemetry
-
Run the UFM-HA configuration command ONLY on the master server, as follows:
configure_ha_nodes.sh \ --cluster-password 12345678 \ --master-ip 192.168.10.1 \ --standby-ip 192.168.10.2 \ --virtual-ip 192.168.10.5
The
cluster-passwordmust be at least 8 characters long.
Change the values of in the above command with your server' information.
-
Start UFM Telemetry HA cluster. Run:
ufm_ha_cluster start
Bare Metal Mode
NVIDIA® UFM® Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.
To deploy the UFM Telemetry:
-
Ensure the following prerequisites are installed:Python3Python3-venvSupervisor
-
Copy the tarball package to the target location.
-
Extract package.
tar -xf ufm_telemetry-<version>.tar.gz
-
Initialize and configure.
./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1"
This collects port counter data every 5 minutes, and uses HCA mlx5_0 and writes data to /tmp/clx_data.
-
Start data collection.
supervisord --config /tmp/ufm_telemetry/conf/supervisord.conf
Bare Metal Mode - High Availability
NVIDIA® UFM® Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.
To deploy the UFM Telemetry:
-
Ensure the following prerequisites are installed:Python3Python3-venvSupervisor
-
Copy the tarball package to the target location.
-
Extract package.
tar -xf ufm_telemetry -<version>.tar.gz
-
Initialize and configure.
./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1" --gen_systemd_service
This collects port counter data every 5 minutes, and uses HCA mlx5_0 and writes data to /tmp/clx_data.
-
Download UFM-HA Package on both servers from this link.
-
Extract the HA package to
/tmp/,and from there, run the installation command on both servers as follows:In the below commands, "disk", the partition name, is assumed as /dev/sda4.
./install -l /opt/ufm-telemetry/ -d /dev/sda4 -p telemetry
-
Run the UFM-HA configuration command ONLY on the master server, as follows:
configure_ha_nodes.sh \ --cluster-password 12345678 \ --master-ip 192.168.10.1 \ --standby-ip 192.168.10.2 \ --virtual-ip 192.168.10.5
The
cluster-passwordmust be at least 8 characters long.
Change the values of in the above command with your server' information.
-
Start UFM Telemetry HA cluster. Run:
ufm_ha_cluster start
To check the status of your UFM Telemetry HA cluster, run:
ufm_ha_cluster status
To perform failover, run:
ufm_ha_cluster failover
To perform takeover, run:
ufm_ha_cluster takeover
Upgrading UFM Telemetry Software
Upgrading UFM Telemetry requires removing the previous package, pulling the new version of the UFM telemetry package, configuring the telemetry, and starting it from the new installation package.
The upgrade procedure can done in the three modes:
Bare Metal - Bringup Mode
-
Stop previous collection. Run:
./bin/run_bringup.sh CollectX: collection_stop
-
Follow instructions described in Deploying UFM Telemetry - Bare Metal Mode with the new UFM Telemetry version.
-
If needed, apply the previous configuration changes.
Docker Container Mode
-
Stop the previous ufm-telemetry container.
[root@r-ufm ~]# docker stop ufm-telemetry
-
Pull the new UFM Telemetry image.
[root@r-ufm ~]# export image=mellanox/ufm-telemetry:rhel7.3_x86_64_ofed5.1-2.3.7_release_1.6_latest [root@r-ufm ~]# docker pull $image
-
Create a container for new UFM Telemetry.
[root@r-ufm ~]# docker run --net=host --uts=host --ipc=host \ --ulimit stack=67108864 --ulimit memlock=-1 \ --security-opt seccomp=unconfined --cap-add=SYS_ADMIN \ --device=/dev/infiniband/ -v "/opt/ufm-telemetry/conf:/config" -v "/tmp/data:/data" --rm --name ufm-telemetry -d $image
-
Configure the UFM Telemetry based on the new configurations.
[root@r-ufm ~]# docker run -v /opt/ufm-telemetry/conf:/config --rm -d $image /get_collectx_configs.sh sample_rate=300;hca=mlx5_0;cable_info_schedule=1/00:00,3/00:00,5/00:00"
Bare Metal Mode
-
Stop previous collection. Run:
kill $SUPERVISORD_PID # send sigterm to the supervisord proc
-
Follow instructions described in Deploying UFM Telemetry - Bringup Mode with the new UFM Telemetry version.
-
If needed, apply the previous configuration changes.
Last updated: