Deploying UFM Telemetry
Deploying UFM Telemetry can be done in the following modes:
Bare Metal - Bringup Mode
NVIDIA UFM Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.
To deploy the UFM Telemetry in Bringup mode, perform the following steps:
-
Make sure the following prerequisites are installed:
-
Python3
-
Python3-venv
-
Supervisor
-
-
Copy the tarball package to the targeted location.
-
Extract the package.
tar -xf ufm_telemetry-<version>.tar.gz -
Start collection.
./bin/run_bringup .sh CollectX: collection_start This collects port counter and cable data every minute, uses HCA mlx5_0 and writes data to ./collection_data/clx-bringup-X for a period of 24hrs. CollectX: help collection_start Usage: options defaults Description ------- -------- ----------- collection_start time|duration=n [s|m|h|d] 24h Session duration sample_rate=n [s|m|h|d] 60 seconds Data sample rate guids=[guid_list|guid_file] None Target devices guid counter_set=[file.xcset] None Counter list to be collected hca=hca_name mlx5_0 Device to access the fabric cable|cable_info=[yes|no|once] yes Collect cable info nvlink_info=[yes|no] no Collect NVLink info disconnected_cable=[yes|no] no Collect disconnected cables info reset_counters=t false Reset counters of fabric devices compress_data=[yes|no] yes Compress files (if write_files=true) mads_retries=n 2 Set number of retries for MADs mads_timeout=n (msec) 500 Set timeout for MADs force_hca=t f Avoid HCA state check
Bare Metal Mode
NVIDIA® UFM® Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.
To deploy the UFM Telemetry:
-
Ensure the following prerequisites are installed:
-
Python3
-
Python3-venv
-
Supervisor
-
-
Copy the tarball package to the target location.
-
Extract package.
tar -xf ufm_telemetry-<version>.tar.gz -
Initialize and configure.
./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1"This collects port counter data every 5 minutes, and uses HCA mlx5_0 and writes data to /tmp/clx_data.
-
Start data collection.
supervisord --config /tmp/ufm_telemetry/conf/supervisord.conf
Bare Metal Mode - High Availability
NVIDIA® UFM® Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.
To deploy the UFM Telemetry:
-
Ensure the following prerequisites are installed:
-
Python3
-
Python3-venv
-
Supervisor
-
-
Copy the tarball package to the target location.
-
Extract package.
tar -xf ufm_telemetry -<version>.tar.gz -
Initialize and configure.
./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1" --gen_systemd_serviceThis collects port counter data every 5 minutes, and uses HCA mlx5_0 and writes data to /tmp/clx_data.
-
Download UFM-HA Package on both servers from this link.
-
Extract the HA package to
/tmp/,and from there, run the installation command on both servers as follows:In the below commands, "disk", the partition name, is assumed as /dev/sda4.
./install -l /opt/ufm-telemetry/ -d /dev/sda4 -p telemetry -
Run the UFM-HA configuration command ONLY on the master server, as follows:
configure_ha_nodes.sh \ --cluster-password 12345678 \ --master-ip 192.168.10.1 \ --standby-ip 192.168.10.2 \ --virtual-ip 192.168.10.5The
cluster-passwordmust be at least 8 characters long.Change the values of in the above command with your server' information.
-
Start UFM Telemetry HA cluster. Run:
ufm_ha_cluster start
To check the status of your UFM Telemetry HA cluster, run:
ufm_ha_cluster status
To perform failover, run:
ufm_ha_cluster failover
To perform takeover, run:
ufm_ha_cluster takeover
Upgrading UFM Telemetry Software
Upgrading UFM Telemetry requires removing the previous package, pulling the new version of the UFM telemetry package, configuring the telemetry, and starting it from the new installation package.
The upgrade procedure can done in the three modes:
Bare Metal - Bringup Mode
-
Stop previous collection. Run:
./bin/run_bringup.sh CollectX: collection_stop -
Follow instructions described in Deploying UFM Telemetry - Bare Metal Mode with the new UFM Telemetry version.
-
If needed, apply the previous configuration changes.
Bare Metal Mode
-
Stop previous collection. Run:
kill $SUPERVISORD_PID # send sigterm to the supervisord proc -
Follow instructions described in Deploying UFM Telemetry - Bringup Mode with the new UFM Telemetry version.
-
If needed, apply the previous configuration changes.
Last updated: