UFM Clustered Telemetry | NVIDIA UFM Enterprise User Manual

UFM Clustered Telemetry is an advanced feature that enables distributed telemetry data collection across multiple network adapters (HCAs) in your InfiniBand fabric. This feature provides improved performance and scalability for large-scale deployments through workload distribution.

Key Benefits

Better Performance: Workload distribution across multiple instances reduces collection bottlenecks
HCA Utilization: Leverages multiple network adapters for parallel data collection
Scalability: Handles larger fabric deployments more efficiently
Flexibility: Customizable instance distribution based on your infrastructure

Prerequisites

UFM Telemetry Manager (UTM) Plugin must be deployed and enabled

Deployment Types

UFM Clustered Telemetry supports two deployment scenarios. Choose the appropriate configuration method based on your deployment type:

Single node (Standalone)	Single UFM node, telemetry collected locally	Manual `gv.cfg` edit	`127.0.0.1`
HA Cluster	Multiple nodes with shared configuration, telemetry aggregated across all nodes	`configure_utm_mode.py` script	`0.0.0.0`

Key Differences


Bind Address	`127.0.0.1` (localhost only)	`0.0.0.0` (external access)
`additional_cset_urls`	Not required	Required (all node IPs)
Configuration Scope	Single node	Shared across cluster

Important: Choose the correct configuration method for your deployment. Using the wrong method may result in inaccessible telemetry endpoints or duplicate data collection.

Switching From Legacy to Clustered Telemetry (UTM) Mode

Standalone Deployment Configuration

This section applies to single node (standalone) deployments where UFM runs on a single node.

Step 1: Deploy UTM Plugin

Navigate to Settings > Plugin Management in the UFM WebUI
Locate the UFM Telemetry Manager (UTM) plugin
Click Enable to activate the plugin

Step 2: Configure Telemetry Mode

Edit the UFM configuration file:

vi /opt/ufm/files/conf/gv.cfg

Locate the [Telemetry] section and set the following parameters:

[Telemetry]
telemetry_legacy_mode = false

Step 3: (Optional) Configure Session Distribution

By default, UFM creates one primary and one secondary telemetry session and lets UTM allocate them to the locally detected HCAs in round-robin order. You can tune this without any matrix file by adding the following to gv.cfg:

[Telemetry]
primary_count = 2
secondary_count = 1

For fine-grained per-HCA control (for example, to pin specific instances to specific HCAs), enable the legacy matrix mode — see UFM Clustered Telemetry#Configuration Options - Instance Matrix..

Step 4: Start or Restart UFM

If UFM is not running, start it:

/etc/init.d/ufmd start

If UFM is already running, restart to apply changes:

/etc/init.d/ufmd restart

Alternatively, restart only the telemetry service:

/etc/init.d/ufmd ufm_telemetry_stop /etc/init.d/ufmd ufm_telemetry_start

HA Cluster Deployment Configuration

This section applies to High Availability (HA) cluster deployments (only for Active-Active deployments!) where multiple nodes share a common gv.cfg The configuration file and telemetry needs to be aggregated across all cluster nodes.

The configure_utm_mode.py script automates the configuration by:

Setting bind addresses to 0.0.0.0 for external telemetry access
Configuring additional_cset_urls in gv.cfg for multi-node telemetry aggregation
Managing legacy mode flags
Updating environment files for proper endpoint configuration

Prerequisites

HA cluster must be configured in active-active mode.
/var/lib/ufm_ha/ha_state file present (or explicit node IPs available)
UTM Plugin deployed (can be enabled before or after configuration)
UFM configured in Infra mode

Recommended: Configure Before Starting UFM

This is the preferred approach as it avoids unnecessary service restarts.

Step 1: Deploy UTM Plugin
Ensure the UTM plugin is deployed on all cluster nodes. You can deploy via CLI or by using ufm_infra_feature_flag.py .
Step 2: Run the Configuration Script
Option A: Auto-detect node IPs from HA state file

/opt/ufm/files/scripts/configure_utm_mode.py --enable

Option B: Specify node IPs explicitly

/opt/ufm/files/scripts/configure_utm_mode.py --enable --node-ips 10.212.23.1,10.209.226.30

Step 3: Start UFM Services.
Start UFM services on all cluster nodes: ufm_ha_cluster start

Alternative: Configure After UFM is Running

If UFM is already running, you can still configure UTM mode and restart the services.

Step 1: Verify UTM Plugin is Enabled
Ensure the UTM plugin is enabled in Settings > Plugin Management.
Step 2: Run the Configuration Script
/opt/ufm/files/scripts/configure_utm_mode.py --enable --node-ips 10.212.23.1,10.209.226.30
Or with auto-detection:
/opt/ufm/files/scripts/configure_utm_mode.py --enable
Step 3: Restart UFM Services on All Nodes
systemctl restart ufm-enterprise systemctl restart ufm-infra

Script Usage

Enable UTM Mode
Enable with auto-detected node IPs:
./configure_utm_mode.py --enable
Enable with explicit node IPs:
./configure_utm_mode.py --enable --node-ips 10.20.30.40,10.20.30.50
Disable UTM Mode
Revert to legacy mode:
./configure_utm_mode.py --disable
Show Current Status

Display current telemetry configuration:

./configure_utm_mode.py --status

Command-Line Options

Flag		Description
`--enable`	`-e`	Enable UTM mode for telemetry
`--disable`	`-d`	Disable UTM mode (revert to legacy mode)
`--status`	`-s`	Show current telemetry configuration status
`--node-ips IPs`		Comma-separated list of cluster node IPs. If not provided, auto-detects from `/var/lib/ufm_ha/ha_state`
`--skip-additional-urls`		Skip updating `additional_cset_urls` configuration
`--config-file PATH`		Path to gv.cfg file (default: `/opt/ufm/files/conf/gv.cfg`)
`--log-level LEVEL`		Set logging level: DEBUG, INFO, WARNING, ERROR (default: INFO)

Configuration Changes

When enabling UTM mode for HA, the script modifies the following parameters:

gv.cfg [Telemetry] section:

Flag
`telemetry_legacy_mode`	`true`	`false`
`primary_ip_bind_addr`	`127.0.0.1`	`0.0.0.0`
`secondary_ip_bind_addr`	`127.0.0.1`	`0.0.0.0`
`additional_cset_urls`	(empty)	Space-separated cluster URLs

Environment files:

Flag
`primary_env.cfg`	`PROMETHEUS_ENDPOINT=`http://127.0.0.1:9001	`PROMETHEUS_ENDPOINT=`http://0.0.0.0:9001
`secondary_env.cfg`	`PROMETHEUS_ENDPOINT=`http://127.0.0.1:9002	`PROMETHEUS_ENDPOINT=`http://0.0.0.0:9002

Example Output

Enable command output:

============================================================
UTM mode has been enabled successfully.
============================================================

Configuration changes (shared gv.cfg):
  - telemetry_legacy_mode = false
  - primary_ip_bind_addr = 0.0.0.0
  - secondary_ip_bind_addr = 0.0.0.0
  - additional_cset_urls configured with cluster nodes:
      http://10.20.30.1:9001/csv/cset/converted_enterprise
      http://10.20.30.2:9001/csv/cset/converted_enterprise

Note: Local node URLs are filtered at runtime by agent_manager.py
      to avoid duplicate telemetry collection.

------------------------------------------------------------
IMPORTANT: Please restart UFM services on all nodes to apply changes:
  systemctl restart ufm-enterprise
  systemctl restart ufm-infra
------------------------------------------------------------

Status command output:

=== Current Telemetry Configuration ===

  telemetry_legacy_mode = false
  primary_ip_bind_addr = 0.0.0.0
  secondary_ip_bind_addr = 0.0.0.0
  additional_cset_urls = http://10.20.30.1:9001/csv/cset/converted_enterprise http://10.20.30.2:9001/csv/cset/converted_enterprise

=== Mode Status ===

  Current Mode: UTM (non-legacy)

=== Environment Files ===

  Primary: PROMETHEUS_ENDPOINT=http://0.0.0.0:9001
  Secondary: PROMETHEUS_ENDPOINT=http://0.0.0.0:9002

Configuration Options - Instance Matrix (Legacy, Opt-In)

The instance-matrix configuration is a legacy advanced option. It is disabled by default (use_matrix = false in gv.cfg [Telemetry]). Set use_matrix = true to activate everything described in this section. Most deployments should use count-based configuration (see Step 3 – Customize Session Distribution) instead.

Both standalone and HA deployments can customize how telemetry instances are distributed across HCAs.

Automatic Matrix Generation

When UFM starts in UTM mode, it automatically detects available HCAs and creates a default configuration (applies only with use_matrix=true).

Default Behavior:

Detects all available HCAs on the system
Creates 1 primary and 1 secondary telemetry instance on the first HCA
Configuration is stored in: /opt/ufm/files/conf/utm/{hostname}_instances_matrix.json

Example Auto-Generated Matrix:

{
  "mlx5_0": { "primary": 1, "secondary": 1 },
  "mlx5_1": { "primary": 0, "secondary": 0 }
}

Custom Configuration

For advanced deployments, customize the distribution of telemetry instances across HCAs using the generate_telemetry_config.sh script.

Auto-Detect and Create Configuration

Automatically detect HCAs and create a configuration file:

/opt/ufm/scripts/generate_telemetry_config.sh --auto-detect /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

Manual Custom Configuration

Specify custom instance counts per HCA using the format HCA_NAME:PRIMARY_COUNT:SECONDARY_COUNT:

/opt/ufm/scripts/generate_telemetry_config.sh \
  /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json \
  mlx5_0:2:1 \
  mlx5_1:0:2 \
  mlx5_2:1:0

This creates:

mlx5_0: 2 primary instances, 1 secondary instance
mlx5_1: 0 primary instances, 2 secondary instances
mlx5_2: 1 primary instance, 0 secondary instances

Example Custom Matrix:

{
  "mlx5_0": { "primary": 2, "secondary": 1 },
  "mlx5_1": { "primary": 0, "secondary": 2 },
  "mlx5_2": { "primary": 1, "secondary": 0 }
}

Validate Configuration

Verify your matrix file is correctly formatted:

/opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

Get Help

Display usage information and options:

/opt/ufm/scripts/generate_telemetry_config.sh --help

After modifying the matrix configuration file, you must restart UFM for changes to take effect.

Advanced Configuration Parameters

The following optional parameters in gv.cfg allow fine-tuning of telemetry behavior. Most users should use the default values.

Parameter	Section	Default	Description
`dashboard_interval`	`[Server]`	30	Sample rate (seconds) for primary telemetry instances
`secondary_sample_rate`	`[Telemetry]`	300	Sample rate (seconds) for secondary telemetry instances
`telemetry_legacy_mode`	`[Telemetry]`	true	Set to `false` to enable UTM mode
`primary_count`	`[Telemetry]`	1	Number of primary (high-frequency) telemetry instances UFM asks UTM to create. HCAs are allocated round-robin by UTM.
`secondary_count`	`[Telemetry]`	1	Number of secondary (low-frequency) telemetry instances UFM asks UTM to create.
`use_matrix`	`[Telemetry]`	false	Advanced. When `true`, UFM uses the per-HCA matrix JSON (see Configuration Options – Instance Matrix (Legacy, Opt-In)) instead of count-based session creation.

Note: Changing sample rates affects data frequency and may impact system performance. Consult with NVIDIA support before modifying these values in production environments.

Port Allocation

Default Port Allocation

Primary Telemetry: Base port 9001
Secondary Telemetry: Base port 9002

Multi-Instance Port Strategy

When multiple instances are configured, ports are allocated using an interleaved strategy:

Primary instances: Odd ports (9001, 9003, 9005, 9007...)
Secondary instances: Even ports (9002, 9004, 9006, 9008...)

Example - 2 primary + 2 secondary instances:

Primary: ports 9001, 9003
Secondary: ports 9002, 9004

Port Allocation with Proxy Mode

When enable_utm_proxy = true, ports 9001 and 9002 are reserved for the UTM HTTP proxy, and telemetry instances start from offset ports:

Primary instances: 9003, 9005, 9007, 9009...
Secondary instances: 9004, 9006, 9008, 9010...

Troubleshooting

Verify Telemetry Status

Check if telemetry instances are running:

ps aux | grep -E "(utm|telemetry)" | grep -v grep

Check Matrix Configuration (legacy matrix mode only)

Validate the instance matrix file:

If use_matrix = true is set in gv.cfg [Telemetry], validate the matrix file:

/opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

View Current Mode

Use the configuration script to display current settings:

/opt/ufm/files/scripts/configure_utm_mode.py --status

Check Lock Files

If telemetry startup hangs, check for stale lock files:

ls -la /tmp/utm_matrix_*.lock

UFM Clustered Telemetry on Kubernetes

This section describes how to configure and run Clustered Telemetry when UFM Enterprise and the UTM plugin are deployed as Kubernetes pods using Helm, how to change UTM configuration, how to enable token‑based authentication between UTM and UFM, and how to validate the setup end‑to‑end.

The Kubernetes deployment is a third deployment scenario alongside Standalone and HA Cluster. On Kubernetes, UFM and UTM run as separate pods and communicate through in‑cluster Services. This brings two operational differences the previous sections do not cover:

The operator does not edit gv.cfg or utm_config.ini directly on a host; configuration is injected through the Helm values file and Kubernetes resources.
UTM can no longer reach UFM on localhost; it must authenticate to UFM through the UFM Service (HTTPS on port 443) using a token.

Prerequisites

Before deploying the UTM plugin on Kubernetes, verify the following:

Kubernetes cluster (full Kubernetes or k3s) with at least one node that exposes InfiniBand HCAs. Run kubectl get nodes and make sure the nodes are Ready.
NVIDIA Network Operator installed in the cluster with a NicClusterPolicy whose status.state is ready. This exposes the nvidia.com/hostdev (or rdma/rdma_shared_device_a) resource on the nodes that will run UTM.
UFM Enterprise deployed in the cluster using its Helm chart. The UTM plugin reuses UFM's shared PVC, so UFM must be installed first. The UFM pod should be 1/1 Running before proceeding.
Shared PVC (ReadWriteMany) created by the UFM Helm chart (default name <ufm-release-name>-ufm-enterprise-files). For UTM configuration to survive helm uninstall/helm install cycles, the StorageClass should use reclaimPolicy: Retain.
Generic UFM plugin Helm chart (ufm-plugin-helm-template) available locally or as a packaged release artifact. This chart deploys any UFM plugin — UTM is deployed as one of its entries.
UTM plugin image loaded into the container runtime on every node that may run the UTM pod (the nodes with IB HCAs).

Note: The UTM Helm-based deployment on Kubernetes requires UFM to be installed via the UFM Helm chart. It is not compatible with Standalone or HA Cluster deployments described in the previous sections.

Architecture Overview

┌───────────────────────────────────────────────────────────────────────┐ │ Kubernetes cluster │ │ │ │ ┌──────────────────┐ ┌──────────────────────────┐ │ │ │ UFM pod │◄───────────────┤ UTM pod │ │ │ │ 443 / 80 │ HTTPS / │ 8888 (management) │ │ │ │ (Apache) │ ufmRest/… │ 9001–9010 (telemetry) │ │ │ └──────────────────┘ + token └──────────────────────────┘ │ │ │ │ │ │ └─────── shared PVC (RWX) ────────────┘ │ │ conf/plugins/utm, log/plugins/utm │ └───────────────────────────────────────────────────────────────────────┘

UFM reaches UTM through the UTM Service (<ufm-release>-plugin-utm) on port 8888 for management and 9001–9010 for telemetry collection.
UTM reaches UFM through the UFM Service (<ufm-release>) on port 443 (HTTPS through Apache), authenticated with a UFM API token.
Both pods mount the same UFM PVC under different sub‑paths. UTM's /config and /log directories live on the PVC and persist across pod restarts.

Step 1: Prepare the UTM Plugin Values File

Create a values file that will be passed to helm install. Below is the minimal working configuration; everything else in this guide is an incremental addition to this file.

# utm-plugin-values.yaml ufmFullname: "ufm-ufm-enterprise" rdma: resourceName: "nvidia.com/hostdev" # Or rdma/rdma_shared_device_a resourceCount: "1" plugins: entries: utm: image: mellanox/ufm-plugin-utm tag: "<version>" # e.g. "1.25.1-2" port: 8888 ports: [9001, 9002, 9003, 9004, 9005, 9006, 9007, 9008, 9009, 9010] strategy: Recreate startupProbe: httpGet: { path: /help, port: 8888 } failureThreshold: 30 periodSeconds: 10 livenessProbe: httpGet: { path: /help, port: 8888 } periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: { path: /status, port: 8888 } periodSeconds: 10 timeoutSeconds: 10 failureThreshold: 3 env: - name: UTM_CONFIG_OVERRIDES value: | [general] log_level=info volumes: - name: utm-data emptyDir: {} volumeMounts: - name: utm-data mountPath: /data

Field	Description
`ufmFullname`	UFM Helm release full name. Used to derive the PVC name and UFM Service DNS. Must match the name used when installing UFM.
`plugins.entries.utm.image` / `tag`	UTM container image and tag. The tag must correspond to an image loaded on all IB-capable nodes.
`plugins.entries.utm.port`	UTM management API port (`8888`). Exposed through the plugin Service; also targeted by the health probes.
`plugins.entries.utm.ports`	Telemetry instance ports. The generic plugin chart exposes them through the plugin Service. Primary instances use odd ports, secondary instances use even ports.
`plugins.entries.utm.env[UTM_CONFIG_OVERRIDES]`	INI content merged into `/config/utm_config.ini` on first pod start. See Step 3.

Step 2: Install the UTM Plugin

Install the Helm chart, passing the values file:

helm install ufm-plugins <plugin-chart-path-or-tgz> \ -f utm-plugin-values.yaml \ -n ufm-enterprise

If the plugin chart is already installed (for example because other plugins have been deployed), add the UTM entry with --reuse-values:

helm upgrade ufm-plugins <plugin-chart-path-or-tgz> \ --reuse-values \ -f utm-plugin-values.yaml \ -n ufm-enterprise

Watch the pod come up:

kubectl get pods -n ufm-enterprise -l app=<ufm-release>-plugin-utm -w

Expected progression: Init:0/1 → PodInitializing → Running 1/1.

Note: The UTM init container performs a hard InfiniBand validation on every pod start. If no /dev/infiniband/uverbs* devices are visible to the container, the init container fails immediately with a clear error. Verify that the NVIDIA Network Operator and the NicClusterPolicy are configured correctly if the init container fails.

Step 3: Configure UTM

UTM configuration lives in /config/utm_config.ini inside the pod. On the first pod start, the init container builds this file from three layers in this order:

Image defaults — the factory utm_config.ini shipped in the container.
User seed — content of the UTM_CONFIG_OVERRIDES environment variable.

Keys that are reserved for Kubernetes correctness (see table below) are rejected and logged as a warning.
Kubernetes invariants — force_as_plugin=1, `[high_availability]

enable_ha=0, [http_proxy] enable=0`. These are applied last and always win over the user layer.

After a successful first init, the init container writes the marker file /config/.utm_initialized on the PVC. On every subsequent pod start, the init container detects the marker and skips the rebuild — /config is now owned by the operator.

Path	Persistent?	Regenerated on first init?	Who owns it
`/config`	yes (PVC sub-path `conf/plugins/utm`)	yes, on first init only	operator after first install
`/log`	yes (PVC sub-path `log/plugins/utm`)	no	UTM (logs and state)
`/data`	no (`emptyDir`)	n/a	UTM (ephemeral telemetry data)
`/etc/utm/secrets`	yes (Secret, when mounted)	no	operator (via Kubernetes Secret)

Seeding Configuration at First Install

Add INI content to UTM_CONFIG_OVERRIDES before running helm install. The content is merged into /config/utm_config.ini exactly once.

plugins: entries: utm: env: - name: UTM_CONFIG_OVERRIDES value: | [general] log_level=debug state_update_interval=30 [ib_trap] enable_ib_trap=1 ib_trap_hca=mlx5_1

Note: Changes to UTM_CONFIG_OVERRIDES after the first install are ignored on subsequent pod restarts. To apply a new seed, reset the configuration — see Resetting Configuration below.

Reserved Keys

The following keys cannot be overridden through UTM_CONFIG_OVERRIDES. Attempts are logged by the init container as [UTM Init] WARN: refusing user override of reserved key <section>/<key> and silently skipped. This protects the pod from configurations that would break Kubernetes integration.

Section	Key	Why locked
`[general]`	`port`	Must match the Kubernetes Service (8888).
`[general]`	`log_path`, `data_path`	Must match the `/log` / `/data` volume mounts.
`[general]`	`clx_restart_file`	Must match the PVC path UFM writes to signal topology changes.
`[general]`	`force_as_plugin`	Required to force UTM into HTTP mode inside Kubernetes.
`[high_availability]`	`enable_ha`, `master_host_file`, `ha_hosts_file`	Kubernetes handles HA; UTM HA must stay disabled.
`[http_proxy]`	`enable`	Must be `0` so ports 9001+ stay free for telemetry instances.

Changing Configuration After Install

Because /config is preserved after the first pod start, change any non‑reserved setting in place:

POD=$(kubectl get pod -n ufm-enterprise \ -l app=<ufm-release>-plugin-utm \ -o jsonpath='{.items[0].metadata.name}') # Example: switch log level to debug kubectl exec -n ufm-enterprise "$POD" -c utm -- \ sed -i 's/^log_level=.*/log_level=debug/' /config/utm_config.ini # Apply the change without recreating the pod (init does not rerun) kubectl exec -n ufm-enterprise "$POD" -c utm -- \ supervisorctl -c /config/supervisord.conf restart utm

The change persists on the PVC and survives pod restarts, helm upgrade, and PVC‑retained reinstalls.

Resetting Configuration

Delete the marker and recreate the pod. The init container runs a full first‑time init again, re-seeding /config from the current image defaults and the current value of UTM_CONFIG_OVERRIDES.

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise

Step 4: Configure Token-Based Authentication (Recommended)

In Kubernetes, UTM and UFM are separate pods. UTM cannot authenticate to UFM as it does in Standalone or HA Cluster deployments (through localhost with an internal header); it must use an API token over HTTPS against the UFM Service. This section walks through enabling token authentication.

If no token is configured, UTM falls back to basic authentication with the default UFM credentials. This is acceptable for evaluation but not recommended for production use.

Step 4.1: Mint a Token on UFM

Execute inside the UFM pod and capture the returned access_token:

UFM_POD=$(kubectl get pod -n ufm-enterprise \ -l app.kubernetes.io/name=ufm-enterprise \ -o jsonpath='{.items[0].metadata.name}') TOKEN=$(kubectl exec -n ufm-enterprise "$UFM_POD" -- \ curl -sk -XPOST https://127.0.0.1/ufmRest/app/tokens -u admin:<password> \ | python3 -c "import sys,json;print(json.load(sys.stdin)['access_token'])") echo "$TOKEN"

Replace admin:<password> with valid UFM credentials. The default is admin:123456; change this to match your environment.

Step 4.2: Store the Token as a Kubernetes Secret

kubectl create secret generic utm-ufm-token \ --from-literal=token="$TOKEN" \ -n ufm-enterprise

The Secret name utm-ufm-token is referenced by the volume entry in Step 4.3. Use the same name or update the volume entry to match.

Step 4.3: Mount the Secret into the UTM Pod

Extend the volumes and volumeMounts sections of utm-plugin-values.yaml to mount the Secret at /etc/utm/secrets. The optional: true flag guarantees the pod still starts if the Secret is absent, in which case UTM falls back to basic auth.

plugins: entries: utm: volumes: - name: utm-data emptyDir: {} - name: utm-ufm-token secret: secretName: utm-ufm-token optional: true items: - key: token path: ufm_token volumeMounts: - name: utm-data mountPath: /data - name: utm-ufm-token mountPath: /etc/utm/secrets readOnly: true

Step 4.4: Point UTM at the Token

Add a [ufm] block to UTM_CONFIG_OVERRIDES telling UTM which UFM Service to contact, on which port, and where to find the token file:

plugins: entries: utm: env: - name: UTM_CONFIG_OVERRIDES value: | [general] log_level=info [ufm] ufm=https://<ufm-release> ufm_port=443 ufm_rest_api_port=443 ufm_token_file=/etc/utm/secrets/ufm_token

Replace <ufm-release> with the UFM Service DNS name (for example, ufm-ufm-enterprise).

Key	Value	Purpose
`ufm`	`https://<ufm-release>`	UFM Service DNS, HTTPS scheme.
`ufm_port`	`443`	Used for fabric snapshot requests in Kubernetes mode.
`ufm_rest_api_port`	`443`	Used for UFM REST calls.
`ufm_token_file`	`/etc/utm/secrets/ufm_token`	Path of the file mounted from the `utm-ufm-token` Secret.

Apply the change and reinstall (or reset and restart if UTM is already running):

# First install helm install ufm-plugins <plugin-chart> \ -f utm-plugin-values.yaml \ -n ufm-enterprise # Already running — reset and restart so init re-seeds with the new [ufm] block kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise

Rotating the Token

Mint a new token, update the Secret, and restart the UTM pod. The init container does not need to rerun — only the UTM process needs to re‑read the token file.

NEW=$(kubectl exec "$UFM_POD" -- \ curl -sk -XPOST https://127.0.0.1/ufmRest/app/tokens -u admin:<password> \ | python3 -c "import sys,json;print(json.load(sys.stdin)['access_token'])") kubectl create secret generic utm-ufm-token \ --from-literal=token="$NEW" \ --dry-run=client -o yaml | kubectl apply -n ufm-enterprise -f - kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise

Disabling Token Authentication

Delete the Secret and remove the [ufm] block from UTM_CONFIG_OVERRIDES, then reset the configuration so UTM falls back to basic authentication on the next pod start:

kubectl delete secret utm-ufm-token -n ufm-enterprise # Edit utm-plugin-values.yaml to remove the [ufm] block helm upgrade ufm-plugins <plugin-chart> \ --reuse-values -f utm-plugin-values.yaml \ -n ufm-enterprise kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise

Step 5: Validating the Setup

Work through these checks in order. Each check verifies a specific capability of the deployment.

5.1 Pod Health

kubectl get pod -n ufm-enterprise -l app=<ufm-release>-plugin-utm

Expected: READY 1/1, STATUS Running, RESTARTS 0.

5.2 Init Container Output

The init container log summarises the entire configuration pipeline. On a fresh install, expect output similar to:

[UTM Init] K8s mode detected — validating InfiniBand devices... [UTM Init] IB devices found [UTM Init] Seeding user config overrides from env var UTM_CONFIG_OVERRIDES [UTM Init] user-override: general/log_level=info [UTM Init] user-override: ufm/ufm=https://<ufm-release> [UTM Init] user-override: ufm/ufm_rest_api_port=443 [UTM Init] user-override: ufm/ufm_port=443 [UTM Init] user-override: ufm/ufm_token_file=/etc/utm/secrets/ufm_token [UTM Init] Applying K8s-required overrides [UTM Init] k8s-override: general/force_as_plugin=1 [UTM Init] k8s-override: high_availability/enable_ha=0 [UTM Init] k8s-override: http_proxy/enable=0 [UTM Init] First-time init complete

kubectl logs -c utm-init -n ufm-enterprise -l app=<ufm-release>-plugin-utm

On a subsequent restart the expected output is:

[UTM Init] K8s mode detected — validating InfiniBand devices... [UTM Init] IB devices found [UTM Init] /config/.utm_initialized present — /config already initialized; preserving existing configuration [UTM Init] To reset, delete this marker and restart the pod.

5.3 Generated Configuration

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ cat /config/utm_config.ini

Confirm that:

[general] has force_as_plugin = 1, and your expected log_level.
[high_availability] has enable_ha = 0.
[http_proxy] has enable = 0.
If token authentication is enabled: [ufm] has ufm, ufm_port, ufm_rest_api_port, and ufm_token_file as configured.

5.4 UTM Management API

From inside the pod:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ python3 -c "import urllib.request; \ print(urllib.request.urlopen('http://127.0.0.1:8888/help').read().decode()[:200])" kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ python3 -c "import urllib.request; \ print(urllib.request.urlopen('http://127.0.0.1:8888/status').read().decode())"

The /help endpoint returns a short help text. The /status endpoint returns a JSON document listing the telemetry groups and their instances.

5.5 Token Authentication

If token authentication is enabled, verify the token is loaded and UFM accepts it:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ grep -E "auth_method|Loaded UFM auth token" /log/utm.log | tail -5

Expected entries include:

[utils.py] INFO Loaded UFM auth token from /etc/utm/secrets/ufm_token auth_method: token [ufm_api.py] INFO UFM API Request Status [200] in ... sec

If the log shows auth_method: basic even though a Secret was created, confirm the Secret is mounted inside the pod:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ sh -c '[ -s /etc/utm/secrets/ufm_token ] && echo "token present" || echo "no token"'

5.6 Fabric Data Collection

UTM must be able to load the fabric snapshot from UFM. Expected log entries:

[fabric_snapshot.py] INFO Fabric state was updated [fabric_snapshot.py] INFO Loaded FULL FABRIC: <N> guid/port instances from UFM: https://<ufm-release> [fabric_delegator.py] INFO Total size of network: <M> ports

Check directly:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ grep -E "Loaded FULL FABRIC|Fabric state was updated" /log/utm.log | tail -5

5.7 Telemetry Endpoints

Telemetry instances are exposed on ports 9001–9010 through the plugin Service. From inside the cluster, fetch a sample:

kubectl run telemetry-check --rm -it --restart=Never \ --image=curlimages/curl -n ufm-enterprise -- \ curl -s http://<ufm-release>-plugin-utm:9001/csv/cset/converted_enterprise | head -20

A non-empty CSV response with counter values confirms the primary telemetry instance is collecting. Repeat on port 9002 for the secondary instance.

Troubleshooting

Pod Stuck in `Init:Error`

kubectl logs -c utm-init -n ufm-enterprise -l app=<ufm-release>-plugin-utm

Typical causes:

Log snippet	Cause	Fix
`ERROR: No IB devices found (/dev/infiniband/uverbs*)`	Node lacks IB access.	Verify `NicClusterPolicy` is `ready`; confirm nvidia.com/hostdev is allocated on the node.
`ERROR: Missing /config/utm_config.ini`	PVC mount issue.	Verify PVC is `Bound` and the sub-path for the plugin exists.
`WARN: refusing user override of reserved key ...`	Reserved key in `UTM_CONFIG_OVERRIDES`.	Harmless — the key is skipped. Remove it if you want a clean log.

Pod `Running` but Not `Ready`

UTM is running but the readiness probe fails. Inspect the UTM application log:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ tail -100 /log/utm.log

Common causes are missing fabric snapshot (UFM unreachable, wrong [ufm] settings) or the telemetry instances failing to bind to their ports.

`UTM_CONFIG_OVERRIDES` Changes Not Taking Effect

The init marker prevents re-seeding after first install. Reset:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise

Token Authentication Returns 401

The token is either invalid, expired, or was not mounted. Verify:

Secret exists: kubectl get secret utm-ufm-token -n ufm-enterprise.
Token file is non-empty inside the pod:

kubectl exec ... -- sh -c '[ -s /etc/utm/secrets/ufm_token ] && echo ok'.
Mint a new token on UFM and update the Secret.

UFM Service Unreachable from UTM

From inside the UTM pod:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ python3 -c "import urllib.request, ssl; \ print(urllib.request.urlopen('https://<ufm-release>/ufmRest/app/ufm_version', \ context=ssl._create_unverified_context()).read().decode())"

If this fails, the UFM Service name in [ufm] ufm is wrong, or the UFM pod is not Ready. Confirm:

kubectl get svc -n ufm-enterprise kubectl get pods -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise

Known Limitations

Changes to UTM_CONFIG_OVERRIDES after first install are ignored until the marker is reset. This is intentional — /config is treated as operator-owned state after the first init.
UTM image upgrades do not re-seed /config. If an upgraded image ships new defaults or adjusts Kubernetes invariants, reset the marker and restart the pod.
ClusterIP-only telemetry endpoints. Ports 9001–9010 are reachable from within the cluster. External monitoring tools need a NodePort or Ingress in addition; this is not covered by this guide.
No automatic token rotation. Tokens are long-lived. Rotate periodically by following the procedure in Rotating the Token.
Cross-reinstall persistence requires reclaimPolicy: Retain. With Delete, the PVC is wiped on helm uninstall, and the next helm install starts with an empty /config.
K8S deployment- UTM (UFM Telemetry Manager) deployment on Kubernetes is supported only in non-XDR environments. Deploying UTM on Kubernetes in an XDR environment is not supported in this release.

Last updated: June 03, 2026

Key Benefits

Prerequisites

Deployment Types

Key Differences

Switching From Legacy to Clustered Telemetry (UTM) Mode

Standalone Deployment Configuration

Step 1: Deploy UTM Plugin

Step 2: Configure Telemetry Mode

Step 3: (Optional) Configure Session Distribution

Step 4: Start or Restart UFM

HA Cluster Deployment Configuration

Prerequisites

Recommended: Configure Before Starting UFM

Alternative: Configure After UFM is Running

Script Usage

Command-Line Options

Configuration Changes

Example Output

Configuration Options - Instance Matrix (Legacy, Opt-In)

Automatic Matrix Generation

Custom Configuration

Auto-Detect and Create Configuration

Manual Custom Configuration

Validate Configuration

Get Help

Advanced Configuration Parameters

Default Port Allocation

Multi-Instance Port Strategy

Port Allocation with Proxy Mode

Troubleshooting

Verify Telemetry Status

Check Matrix Configuration (legacy matrix mode only)

View Current Mode

Check Lock Files

UFM Clustered Telemetry on Kubernetes

Prerequisites

Architecture Overview

Step 1: Prepare the UTM Plugin Values File

Step 2: Install the UTM Plugin

Step 3: Configure UTM

Seeding Configuration at First Install

Reserved Keys

Changing Configuration After Install

Resetting Configuration

Step 4: Configure Token-Based Authentication (Recommended)

Step 4.1: Mint a Token on UFM

Step 4.2: Store the Token as a Kubernetes Secret

Step 4.3: Mount the Secret into the UTM Pod

Step 4.4: Point UTM at the Token

Rotating the Token

Disabling Token Authentication

Step 5: Validating the Setup

5.1 Pod Health

5.2 Init Container Output

5.3 Generated Configuration

5.4 UTM Management API

5.5 Token Authentication

5.6 Fabric Data Collection

5.7 Telemetry Endpoints

Troubleshooting

Pod Stuck in Init:Error

Pod Running but Not Ready

UTM_CONFIG_OVERRIDES Changes Not Taking Effect

Token Authentication Returns 401

UFM Service Unreachable from UTM

Known Limitations

Pod Stuck in `Init:Error`

Pod `Running` but Not `Ready`

`UTM_CONFIG_OVERRIDES` Changes Not Taking Effect