UFM Telemetry Documentation

Software Management

Deploying UFM Telemetry

Deploying UFM Telemetry can be done in the following modes: 

Bare Metal - Bringup Mode

NVIDIA UFM Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.

To deploy the UFM Telemetry in Bringup mode, perform the following steps:

  1. Make sure the following prerequisites are installed:

    1. Python3

    2. Python3-venv

    3. Supervisor

  2. Copy the tarball package to the targeted location.

  3. Extract the package.

    tar -xf  ufm_telemetry-<version>.tar.gz 
    
  4. Start collection. 

    ./bin/run_bringup .sh
    
    CollectX: collection_start
    
    This collects port counter and cable data every minute, uses HCA mlx5_0 and writes data to ./collection_data/clx-bringup-X for a period of 24hrs.
    
    CollectX: help collection_start
    
    Usage:
    
                            options                          defaults      Description
                            -------                          --------      -----------
          collection_start  time|duration=n [s|m|h|d]        24h           Session duration
                            sample_rate=n [s|m|h|d]          60 seconds    Data sample rate
                            guids=[guid_list|guid_file]      None          Target devices guid
                            counter_set=[file.xcset]         None          Counter list to be collected
                            hca=hca_name                     mlx5_0        Device to access the fabric
                            cable|cable_info=[yes|no|once]   yes           Collect cable info
                            nvlink_info=[yes|no]             no            Collect NVLink info
                            disconnected_cable=[yes|no]      no            Collect disconnected cables info
                            reset_counters=t                 false         Reset counters of fabric devices
                            compress_data=[yes|no]           yes           Compress files (if write_files=true)
                            mads_retries=n                   2             Set number of retries for MADs
                            mads_timeout=n (msec)            500           Set timeout for MADs
                            force_hca=t                      f             Avoid HCA state check
    

Bare Metal Mode 

NVIDIA® UFM® Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.

To deploy the UFM Telemetry:

  1. Ensure the following prerequisites are installed:

    1. Python3

    2. Python3-venv

    3. Supervisor

  2. Copy the tarball package to the target location.

  3. Extract package.

    tar -xf ufm_telemetry-<version>.tar.gz 
    
  4. Initialize and configure.

    ./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1"
    

    This collects port counter data every 5 minutes, and uses HCA mlx5_0 and writes data to /tmp/clx_data.

  5. Start data collection.

    supervisord --config /tmp/ufm_telemetry/conf/supervisord.conf
    

Bare Metal Mode - High Availability

NVIDIA® UFM® Telemetry can be obtained as a tarball for installation on a Linux machine with all prerequisites installed.

To deploy the UFM Telemetry:

  1. Ensure the following prerequisites are installed:

    1. Python3

    2. Python3-venv

    3. Supervisor

  2. Copy the tarball package to the target location.

  3. Extract package.

    tar -xf ufm_telemetry -<version>.tar.gz
    
  4. Initialize and configure.

    ./bin/initialize_telemetry.sh --telemetry-dir /tmp/ufm_telemetry --config "hca=mlx5_0;sample_rate=300;data_dir=/tmp/clx_data;plugin_env_CLX_FILE_WRITE_ENABLED=1" --gen_systemd_service
    

    This collects port counter data every 5 minutes, and uses HCA mlx5_0 and writes data to /tmp/clx_data.

  5. Download UFM-HA Package on both servers from this link

  6. Extract the HA package to /tmp/, and from there, run the installation command on both servers as follows: 

     In the below commands, "disk", the partition name, is assumed as /dev/sda4.

    ./install -l /opt/ufm-telemetry/ -d /dev/sda4 -p telemetry
    
  7. Run the UFM-HA configuration command ONLY on the master server, as follows: 

    configure_ha_nodes.sh \
    --cluster-password 12345678 \
    --master-ip 192.168.10.1 \
    --standby-ip 192.168.10.2 \
    --virtual-ip 192.168.10.5
    

    The cluster-password must be at least 8 characters long.

    Change the values of in the above command with your server' information.

  8. Start UFM Telemetry HA cluster. Run:

    ufm_ha_cluster start
    

To check the status of your UFM Telemetry HA cluster, run: 

ufm_ha_cluster status

To perform failover, run: 

ufm_ha_cluster failover

To perform takeover, run: 

ufm_ha_cluster takeover

Upgrading UFM Telemetry Software

Upgrading UFM Telemetry requires removing the previous package, pulling the new version of the UFM telemetry package, configuring the telemetry, and starting it from the new installation package.

The upgrade procedure can done in the three modes:

Bare Metal - Bringup Mode

  1. Stop previous collection. Run: 

    ./bin/run_bringup.sh
    CollectX: collection_stop
    
  2. Follow instructions described in Deploying UFM Telemetry - Bare Metal Mode with the new UFM Telemetry version.

  3. If needed, apply the previous configuration changes.

Bare Metal Mode

  1. Stop previous collection. Run:

    kill $SUPERVISORD_PID # send sigterm to the supervisord proc
    
  2. Follow instructions described in Deploying UFM Telemetry - Bringup Mode with the new UFM Telemetry version.

  3. If needed, apply the previous configuration changes.


Last updated: