NVIDIA UFM Enterprise User Manual

UFM Telemetry Manager (UTM) Plugin

Plugin Release Notes

Changes and New Features

Plugin Version

Feature

1.25.1-3

N/A

Bug Fixes

Plugin Version

Bug Fix

1.25.1-3

N/A


Overview

The UFM Telemetry Manager (UTM) plugin partitions IB-fabric monitoring across multiple UFM Telemetry Instances (TIs) for high-scale clusters. UTM assigns fabric ports to TIs deterministically using consistent hashing, optionally with redundancy, and manages their lifecycle: health monitoring, port assignment updates, and targeted restarts on fabric topology changes.

Key capabilities:

  • Stable port distribution: each port is assigned to a specific TI by consistent hashing, so the port-to-TI mapping does not reshuffle on every TI restart.

  • Configurable redundancy: a port can be monitored by multiple TIs simultaneously (port_redundancy_factor), so a TI failure causes zero monitoring gap on its ports.

  • Targeted restart: when a topology change adds new ports, only the TIs that own the new ports are restarted; unaffected TIs keep collecting uninterrupted.

  • TI failure handling: failed TIs are kept in the active assignment during a grace period to absorb transient failures; if the TI does not recover, its ports are redistributed across the surviving TIs.

UTM runs two telemetry groups by default: primary (high-frequency port counters) and secondary (broader counter set, lower frequency). Each group independently covers 100% of the fabric. UFM controls how many instances are created in each via primary_count / secondary_count in gv.cfg [Telemetry]; see [UFM Clustered Telemetry].


Deployment

UTM is deployed as a UFM plugin. Two deployment paths are supported.

UFM Plugin Mode

The UTM plugin can be added either via the Command Line Interface or the Web UI.

CLI Deployment

To add the plugin:

/opt/ufm/scripts/manage_ufm_plugins.sh add -p utm

To remove the plugin:

/opt/ufm/scripts/manage_ufm_plugins.sh remove -p utm

Web-UI Deployment

  1. Navigate to the UFM Web UI and click Settings in the left panel.

  2. Open the Plugin Management tab.

  3. Right-click on the UTM plugin row and select Add.

  4. Open Telemetry Status in the left panel to access the UTM UI.

To stop the plugin: in Plugin Management, right-click the UTM row and select Disable.

Kubernetes Deployment

For deploying UTM in Kubernetes alongside UFM Enterprise, see the UFM Clustered Telemetry on Kubernetes section of [UFM Clustered Telemetry].


Configuration

The UTM configuration file utm_config.ini is located at /opt/ufm/files/conf/plugins/utm/utm_config.ini. UTM restarts its main process automatically when the file changes.

Key Tunables

The settings most operators tune. Anything not listed here ships with a sensible default and should not normally need to change.

Section

Key

Default

Description

[general]

port_redundancy_factor

1

Number of TIs each port is monitored by. Set to ≥2 to eliminate the monitoring gap on TI failure. Values larger than the number of live TIs are clamped at runtime; invalid values (≤0 or non-numeric) fall back to 1.

[general]

fabric_update_interval

180

Seconds between fabric snapshot fetches from UFM.

[general]

clear_cache_on_rebalance

false

When true, UTM clears stale telemetry data cache on TIs after a rebalance (removes data for ports no longer assigned to the TI).

[general]

log_level

info

Log verbosity (debug, info, warning, error).

[telemetry_instances]

server_<N> / group_<name>_server_<N>

127.0.0.1:9001 / 9002

TI URLs. Servers under group_<name>_server_* are placed in the named group; bare server_<N>= entries go to default.

[high_availability]

enable_ha

0

Set to 1 for HA active/active or active/standby deployments.

[high_availability]

primary_ports / secondary_ports

9001 / 9002

Port ranges for the primary and secondary groups in HA mode.

Authentication

UTM authenticates to the UFM REST API using either token or username/password. Token is preferred where available; username/password is the fallback.

Token authentication (recommended):

Write the API token to a file and point ufm_token_file at it:

[ufm]
ufm_token_file = /config/ufm_token

If the file exists and is non-empty, UTM uses token auth automatically.

Username/password authentication (fallback):

For non-default UFM credentials:

[ufm]
ufm_user = <user>
ufm_pass = <password>

UTM falls back to username/password when no token file is configured.


GUI

The Telemetry Status page is accessible from the UFM Web UI sidebar under Telemetry Status, or directly at http://<utm-host>:8888/files/index.html.

The page contains:

  • Top pane: general info; controls to add a TI URL for monitoring; refresh-interval selector.

  • Group panes: one panel per telemetry group, showing every TI in the group with status and counters.

  • Bottom pane: system events with history navigation.

TI Status Fields

Field

Description

URL

TI URL (http://<host>:<port>).

Group

Telemetry group the TI belongs to (e.g. primary, secondary).

Mode

managed or platform.

Status

Down, Initializing, Running, Paused, or Restarting.

Uptime

TI uptime in human-readable format.

Collected ports

Ports successfully collected in the last sample (with +N_old_ports for unchanged data not re-exported).

Configured ports

Ports configured to be sampled by this TI.

Enabled / Discovered ports

Enabled and discovered ports of the fabric (per UTM's view).

Iteration time

Total iteration time of the last data-collection cycle.

TI Management Actions

Right-click a TI row to:

  • Pause: pause a running TI; its ports are redistributed to other TIs in the group.

  • Resume: resume a paused TI.

  • Exclude: pause and remove the TI from its group (the TI itself stays on the host). Empty groups are removed automatically.


REST API

All GUI features (TI management, monitoring, configuration) are accessible via REST.

Accessing the API

In UFM plugin mode (proxied through UFM):

curl -k -u <user>:<pass> https://<UFM_HOST>/ufmRest/plugin/utm/<COMMAND>

Direct (e.g. K8s pod, port-forward):

curl http://<UTM_HOST>:8888/<COMMAND>

In plugin mode UTM listens on plain HTTP on port 8888; HTTPS termination is handled by UFM's proxy.

Common Commands

The examples below use the direct form; substitute the proxied form for plugin mode.

# List all UTM endpoints
curl http://127.0.0.1:8888/help

# Status of monitored TIs
curl http://127.0.0.1:8888/status

# Add an externally-running TI to a monitoring group
curl 'http://127.0.0.1:8888/add_server?url=http://127.0.0.1:9001&group=primary'

# Pause / resume / remove a monitored TI
curl 'http://127.0.0.1:8888/pause_server?url=http://127.0.0.1:9001'
curl 'http://127.0.0.1:8888/start_server?url=http://127.0.0.1:9001'
curl 'http://127.0.0.1:8888/remove_server?url=http://127.0.0.1:9001'

# Spawn TIs by count, with automatic round-robin HCA allocation
curl -X POST 'http://127.0.0.1:8888/host/create_sessions?group=primary&count=2&sample_rate=30'

# Stop a TI by session id
curl 'http://127.0.0.1:8888/host/remove_telemetry?session_id=<id>'

POST /host/create_sessions response codes: 200 at least one session created, 400 invalid params, 409 group already has running instances, 500 all failed, 503 no HCA in Active+LinkUp state.

Last updated: