NVIDIA UFM Enterprise User Manual

UFM Clustered Telemetry

UFM Clustered Telemetry is an advanced feature that enables distributed telemetry data collection across multiple network adapters (HCAs) in your InfiniBand fabric. This feature provides improved performance and scalability for large-scale deployments through workload distribution.

Key Benefits

  • Better Performance: Workload distribution across multiple instances reduces collection bottlenecks

  • HCA Utilization: Leverages multiple network adapters for parallel data collection

  • Scalability: Handles larger fabric deployments more efficiently

  • Flexibility: Customizable instance distribution based on your infrastructure

Prerequisites

  • UFM Telemetry Manager (UTM) Plugin must be deployed and enabled

Deployment Types

UFM Clustered Telemetry supports two deployment scenarios. Choose the appropriate configuration method based on your deployment type:

Single node (Standalone)

Single UFM node, telemetry collected locally

Manual gv.cfg edit

127.0.0.1

HA Cluster

Multiple nodes with shared configuration, telemetry aggregated across all nodes

configure_utm_mode.py script

0.0.0.0

Key Differences




Bind Address

127.0.0.1 (localhost only)

0.0.0.0 (external access)

additional_cset_urls

Not required

Required (all node IPs)

Configuration Scope

Single node

Shared across cluster


Important: Choose the correct configuration method for your deployment. Using the wrong method may result in inaccessible telemetry endpoints or duplicate data collection.

Switching From Legacy to Clustered Telemetry (UTM) Mode

Standalone Deployment Configuration

This section applies to single node (standalone) deployments where UFM runs on a single node.

Step 1: Deploy UTM Plugin

  1. Navigate to Settings > Plugin Management in the UFM WebUI

  2. Locate the UFM Telemetry Manager (UTM) plugin

  3. Click Enable to activate the plugin

Step 2: Configure Telemetry Mode

Edit the UFM configuration file:

vi /opt/ufm/files/conf/gv.cfg

Locate the [Telemetry] section and set the following parameters:

[Telemetry]
telemetry_legacy_mode = false

 

Step 3: (Optional) Configure Session Distribution

By default, UFM creates one primary and one secondary telemetry session and lets UTM allocate them to the locally detected HCAs in round-robin order. You can tune this without any matrix file by adding the following to gv.cfg:

[Telemetry]
primary_count = 2
secondary_count = 1

For fine-grained per-HCA control (for example, to pin specific instances to specific HCAs), enable the legacy matrix mode — see UFM Clustered Telemetry#Configuration Options - Instance Matrix..

Step 4: Start or Restart UFM

If UFM is not running, start it:

/etc/init.d/ufmd start

If UFM is already running, restart to apply changes:

/etc/init.d/ufmd restart

Alternatively, restart only the telemetry service:

/etc/init.d/ufmd ufm_telemetry_stop /etc/init.d/ufmd ufm_telemetry_start

 

HA Cluster Deployment Configuration

This section applies to High Availability (HA) cluster deployments (only for Active-Active deployments!) where multiple nodes share a common gv.cfg The configuration file and telemetry needs to be aggregated across all cluster nodes.

The configure_utm_mode.py script automates the configuration by:

  • Setting bind addresses to 0.0.0.0 for external telemetry access

  • Configuring additional_cset_urls in gv.cfg for multi-node telemetry aggregation

  • Managing legacy mode flags

  • Updating environment files for proper endpoint configuration

Prerequisites

  • HA cluster must be configured in active-active mode.

  • /var/lib/ufm_ha/ha_state file present (or explicit node IPs available)

  • UTM Plugin deployed (can be enabled before or after configuration)

  • UFM configured in Infra mode

Recommended: Configure Before Starting UFM

This is the preferred approach as it avoids unnecessary service restarts.

  • Step 1: Deploy UTM Plugin
    Ensure the UTM plugin is deployed on all cluster nodes. You can deploy via CLI or by using ufm_infra_feature_flag.py .

  • Step 2: Run the Configuration Script
    Option A: Auto-detect node IPs from HA state file

    /opt/ufm/files/scripts/configure_utm_mode.py --enable

    Option B: Specify node IPs explicitly

    /opt/ufm/files/scripts/configure_utm_mode.py --enable --node-ips 10.212.23.1,10.209.226.30

  • Step 3: Start UFM Services.
    Start UFM services on all cluster nodesufm_ha_cluster start


Alternative: Configure After UFM is Running

If UFM is already running, you can still configure UTM mode and restart the services.

  • Step 1: Verify UTM Plugin is Enabled
    Ensure the UTM plugin is enabled in Settings > Plugin Management.

  • Step 2: Run the Configuration Script
    /opt/ufm/files/scripts/configure_utm_mode.py --enable --node-ips 10.212.23.1,10.209.226.30

  • Or with auto-detection:
    /opt/ufm/files/scripts/configure_utm_mode.py --enable

  • Step 3: Restart UFM Services on All Nodes
    systemctl restart ufm-enterprise systemctl restart ufm-infra

Script Usage

  • Enable UTM Mode

  • Enable with auto-detected node IPs:
    ./configure_utm_mode.py --enable

  • Enable with explicit node IPs:
    ./configure_utm_mode.py --enable --node-ips 10.20.30.40,10.20.30.50

  • Disable UTM Mode

  • Revert to legacy mode:
    ./configure_utm_mode.py --disable

  • Show Current Status

Display current telemetry configuration:

./configure_utm_mode.py --status

Command-Line Options

Flag


Description

--enable

-e

Enable UTM mode for telemetry

--disable

-d

Disable UTM mode (revert to legacy mode)

--status

-s

Show current telemetry configuration status

--node-ips IPs


Comma-separated list of cluster node IPs. If not provided, auto-detects from /var/lib/ufm_ha/ha_state

--skip-additional-urls


Skip updating additional_cset_urls configuration

--config-file PATH


Path to gv.cfg file (default: /opt/ufm/files/conf/gv.cfg)

--log-level LEVEL


Set logging level: DEBUG, INFO, WARNING, ERROR (default: INFO)

Configuration Changes

When enabling UTM mode for HA, the script modifies the following parameters:

gv.cfg [Telemetry] section:

Flag



telemetry_legacy_mode

true

false

primary_ip_bind_addr

127.0.0.1

0.0.0.0

secondary_ip_bind_addr

127.0.0.1

0.0.0.0

additional_cset_urls

(empty)

Space-separated cluster URLs

Environment files:

Flag



primary_env.cfg

PROMETHEUS_ENDPOINT=http://127.0.0.1:9001

PROMETHEUS_ENDPOINT=http://0.0.0.0:9001

secondary_env.cfg

PROMETHEUS_ENDPOINT=http://127.0.0.1:9002

PROMETHEUS_ENDPOINT=http://0.0.0.0:9002

Example Output

Enable command output: 

============================================================
UTM mode has been enabled successfully.
============================================================

Configuration changes (shared gv.cfg):
  - telemetry_legacy_mode = false
  - primary_ip_bind_addr = 0.0.0.0
  - secondary_ip_bind_addr = 0.0.0.0
  - additional_cset_urls configured with cluster nodes:
      http://10.20.30.1:9001/csv/cset/converted_enterprise
      http://10.20.30.2:9001/csv/cset/converted_enterprise

Note: Local node URLs are filtered at runtime by agent_manager.py
      to avoid duplicate telemetry collection.

------------------------------------------------------------
IMPORTANT: Please restart UFM services on all nodes to apply changes:
  systemctl restart ufm-enterprise
  systemctl restart ufm-infra
------------------------------------------------------------

Status command output:

=== Current Telemetry Configuration ===

  telemetry_legacy_mode = false
  primary_ip_bind_addr = 0.0.0.0
  secondary_ip_bind_addr = 0.0.0.0
  additional_cset_urls = http://10.20.30.1:9001/csv/cset/converted_enterprise http://10.20.30.2:9001/csv/cset/converted_enterprise

=== Mode Status ===

  Current Mode: UTM (non-legacy)

=== Environment Files ===

  Primary: PROMETHEUS_ENDPOINT=http://0.0.0.0:9001
  Secondary: PROMETHEUS_ENDPOINT=http://0.0.0.0:9002


Configuration Options - Instance Matrix (Legacy, Opt-In)

The instance-matrix configuration is a legacy advanced option. It is disabled by default (use_matrix = false in gv.cfg [Telemetry]). Set use_matrix = true to activate everything described in this section. Most deployments should use count-based configuration (see Step 3 – Customize Session Distribution) instead.

Both standalone and HA deployments can customize how telemetry instances are distributed across HCAs.

Automatic Matrix Generation

When UFM starts in UTM mode, it automatically detects available HCAs and creates a default configuration (applies only with use_matrix=true).

Default Behavior:

  • Detects all available HCAs on the system

  • Creates 1 primary and 1 secondary telemetry instance on the first HCA

  • Configuration is stored in: /opt/ufm/files/conf/utm/{hostname}_instances_matrix.json

Example Auto-Generated Matrix

{
  "mlx5_0": { "primary": 1, "secondary": 1 },
  "mlx5_1": { "primary": 0, "secondary": 0 }
}

Custom Configuration

For advanced deployments, customize the distribution of telemetry instances across HCAs using the generate_telemetry_config.sh script.

Auto-Detect and Create Configuration

Automatically detect HCAs and create a configuration file:

/opt/ufm/scripts/generate_telemetry_config.sh --auto-detect /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

Manual Custom Configuration

Specify custom instance counts per HCA using the format HCA_NAME:PRIMARY_COUNT:SECONDARY_COUNT

/opt/ufm/scripts/generate_telemetry_config.sh \
  /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json \
  mlx5_0:2:1 \
  mlx5_1:0:2 \
  mlx5_2:1:0

This creates:

  • mlx5_0: 2 primary instances, 1 secondary instance

  • mlx5_1: 0 primary instances, 2 secondary instances

  • mlx5_2: 1 primary instance, 0 secondary instances

Example Custom Matrix:

{
  "mlx5_0": { "primary": 2, "secondary": 1 },
  "mlx5_1": { "primary": 0, "secondary": 2 },
  "mlx5_2": { "primary": 1, "secondary": 0 }
}

Validate Configuration

Verify your matrix file is correctly formatted:

/opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

Get Help

Display usage information and options:

/opt/ufm/scripts/generate_telemetry_config.sh --help

After modifying the matrix configuration file, you must restart UFM for changes to take effect.

 

Advanced Configuration Parameters

The following optional parameters in gv.cfg allow fine-tuning of telemetry behavior. Most users should use the default values.

Parameter

Section

Default

Description

dashboard_interval

[Server]

30

Sample rate (seconds) for primary telemetry instances

secondary_sample_rate

[Telemetry]

300

Sample rate (seconds) for secondary telemetry instances

telemetry_legacy_mode

[Telemetry]

true

Set to false to enable UTM mode

primary_count

[Telemetry]

 1

Number of primary (high-frequency) telemetry instances UFM asks UTM to create. HCAs are allocated round-robin by UTM.

secondary_count

[Telemetry]

 1

Number of secondary (low-frequency) telemetry instances UFM asks UTM to create.

use_matrix

[Telemetry]

false

Advanced. When true, UFM uses the per-HCA matrix JSON (see Configuration Options – Instance Matrix (Legacy, Opt-In)) instead of count-based session creation.


Note: Changing sample rates affects data frequency and may impact system performance. Consult with NVIDIA support before modifying these values in production environments.

 

Port Allocation

Default Port Allocation

  • Primary Telemetry: Base port 9001

  • Secondary Telemetry: Base port 9002

Multi-Instance Port Strategy

When multiple instances are configured, ports are allocated using an interleaved strategy:

  • Primary instances: Odd ports (9001, 9003, 9005, 9007...)

  • Secondary instances: Even ports (9002, 9004, 9006, 9008...)

Example - 2 primary + 2 secondary instances:

  • Primary: ports 9001, 9003

  • Secondary: ports 9002, 9004

Port Allocation with Proxy Mode

When enable_utm_proxy = true, ports 9001 and 9002 are reserved for the UTM HTTP proxy, and telemetry instances start from offset ports:

  • Primary instances: 9003, 9005, 9007, 9009...

  • Secondary instances: 9004, 9006, 9008, 9010...

Troubleshooting

Verify Telemetry Status

Check if telemetry instances are running:

ps aux | grep -E "(utm|telemetry)" | grep -v grep

Check Matrix Configuration (legacy matrix mode only)

Validate the instance matrix file:

If use_matrix = true is set in gv.cfg [Telemetry], validate the matrix file:

/opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json

View Current Mode

Use the configuration script to display current settings:

/opt/ufm/files/scripts/configure_utm_mode.py --status

Check Lock Files

If telemetry startup hangs, check for stale lock files:

ls -la /tmp/utm_matrix_*.lock

UFM Clustered Telemetry on Kubernetes

This section describes how to configure and run Clustered Telemetry when UFM Enterprise and the UTM plugin are deployed as Kubernetes pods using Helm, how to change UTM configuration, how to enable token‑based authentication between UTM and UFM, and how to validate the setup end‑to‑end.

The Kubernetes deployment is a third deployment scenario alongside Standalone and HA Cluster. On Kubernetes, UFM and UTM run as separate pods and communicate through in‑cluster Services. This brings two operational differences the previous sections do not cover:

  • The operator does not edit gv.cfg or utm_config.ini directly on a host; configuration is injected through the Helm values file and Kubernetes resources.

  • UTM can no longer reach UFM on localhost; it must authenticate to UFM through the UFM Service (HTTPS on port 443) using a token.

Prerequisites

Before deploying the UTM plugin on Kubernetes, verify the following:

  • Kubernetes cluster (full Kubernetes or k3s) with at least one node that exposes InfiniBand HCAs. Run kubectl get nodes and make sure the nodes are Ready.

  • NVIDIA Network Operator installed in the cluster with a NicClusterPolicy whose status.state is ready. This exposes the nvidia.com/hostdev (or rdma/rdma_shared_device_a) resource on the nodes that will run UTM.

  • UFM Enterprise deployed in the cluster using its Helm chart. The UTM plugin reuses UFM's shared PVC, so UFM must be installed first. The UFM pod should be 1/1 Running before proceeding.

  • Shared PVC (ReadWriteMany) created by the UFM Helm chart (default name <ufm-release-name>-ufm-enterprise-files). For UTM configuration to survive helm uninstall/helm install cycles, the StorageClass should use reclaimPolicy: Retain.

  • Generic UFM plugin Helm chart (ufm-plugin-helm-template) available locally or as a packaged release artifact. This chart deploys any UFM plugin — UTM is deployed as one of its entries.

  • UTM plugin image loaded into the container runtime on every node that may run the UTM pod (the nodes with IB HCAs).

Note: The UTM Helm-based deployment on Kubernetes requires UFM to be installed via the UFM Helm chart. It is not compatible with Standalone or HA Cluster deployments described in the previous sections.

Architecture Overview

┌───────────────────────────────────────────────────────────────────────┐ │ Kubernetes cluster │ │ │ │ ┌──────────────────┐ ┌──────────────────────────┐ │ │ │ UFM pod │◄───────────────┤ UTM pod │ │ │ │ 443 / 80 │ HTTPS / │ 8888 (management) │ │ │ │ (Apache) │ ufmRest/… │ 9001–9010 (telemetry) │ │ │ └──────────────────┘ + token └──────────────────────────┘ │ │ │ │ │ │ └─────── shared PVC (RWX) ────────────┘ │ │ conf/plugins/utm, log/plugins/utm │ └───────────────────────────────────────────────────────────────────────┘

  • UFM reaches UTM through the UTM Service (<ufm-release>-plugin-utm) on port 8888 for management and 9001–9010 for telemetry collection.

  • UTM reaches UFM through the UFM Service (<ufm-release>) on port 443 (HTTPS through Apache), authenticated with a UFM API token.

  • Both pods mount the same UFM PVC under different sub‑paths. UTM's /config and /log directories live on the PVC and persist across pod restarts.

Step 1: Prepare the UTM Plugin Values File

Create a values file that will be passed to helm install. Below is the minimal working configuration; everything else in this guide is an incremental addition to this file.

# utm-plugin-values.yaml ufmFullname: "ufm-ufm-enterprise" rdma: resourceName: "nvidia.com/hostdev" # Or rdma/rdma_shared_device_a resourceCount: "1" plugins: entries: utm: image: mellanox/ufm-plugin-utm tag: "<version>" # e.g. "1.25.1-2" port: 8888 ports: [9001, 9002, 9003, 9004, 9005, 9006, 9007, 9008, 9009, 9010] strategy: Recreate startupProbe: httpGet: { path: /help, port: 8888 } failureThreshold: 30 periodSeconds: 10 livenessProbe: httpGet: { path: /help, port: 8888 } periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: { path: /status, port: 8888 } periodSeconds: 10 timeoutSeconds: 10 failureThreshold: 3 env: - name: UTM_CONFIG_OVERRIDES value: | [general] log_level=info volumes: - name: utm-data emptyDir: {} volumeMounts: - name: utm-data mountPath: /data

Field

Description

ufmFullname

UFM Helm release full name. Used to derive the PVC name and UFM Service DNS. Must match the name used when installing UFM.

plugins.entries.utm.image / tag

UTM container image and tag. The tag must correspond to an image loaded on all IB-capable nodes.

plugins.entries.utm.port

UTM management API port (8888). Exposed through the plugin Service; also targeted by the health probes.

plugins.entries.utm.ports

Telemetry instance ports. The generic plugin chart exposes them through the plugin Service. Primary instances use odd ports, secondary instances use even ports.

plugins.entries.utm.env[UTM_CONFIG_OVERRIDES]

INI content merged into /config/utm_config.ini on first pod start. See Step 3.

Step 2: Install the UTM Plugin

Install the Helm chart, passing the values file:

helm install ufm-plugins <plugin-chart-path-or-tgz> \ -f utm-plugin-values.yaml \ -n ufm-enterprise

If the plugin chart is already installed (for example because other plugins have been deployed), add the UTM entry with --reuse-values:

helm upgrade ufm-plugins <plugin-chart-path-or-tgz> \ --reuse-values \ -f utm-plugin-values.yaml \ -n ufm-enterprise

Watch the pod come up:

kubectl get pods -n ufm-enterprise -l app=<ufm-release>-plugin-utm -w

Expected progression: Init:0/1PodInitializingRunning 1/1.

Note: The UTM init container performs a hard InfiniBand validation on every pod start. If no /dev/infiniband/uverbs* devices are visible to the container, the init container fails immediately with a clear error. Verify that the NVIDIA Network Operator and the NicClusterPolicy are configured correctly if the init container fails.

Step 3: Configure UTM

UTM configuration lives in /config/utm_config.ini inside the pod. On the first pod start, the init container builds this file from three layers in this order:

  1. Image defaults — the factory utm_config.ini shipped in the container.

  2. User seed — content of the UTM_CONFIG_OVERRIDES environment variable.

    Keys that are reserved for Kubernetes correctness (see table below) are rejected and logged as a warning.

  3. Kubernetes invariantsforce_as_plugin=1, `[high_availability]

    enable_ha=0, [http_proxy] enable=0`. These are applied last and always win over the user layer.

After a successful first init, the init container writes the marker file /config/.utm_initialized on the PVC. On every subsequent pod start, the init container detects the marker and skips the rebuild — /config is now owned by the operator.

Path

Persistent?

Regenerated on first init?

Who owns it

/config

yes (PVC sub-path conf/plugins/utm)

yes, on first init only

operator after first install

/log

yes (PVC sub-path log/plugins/utm)

no

UTM (logs and state)

/data

no (emptyDir)

n/a

UTM (ephemeral telemetry data)

/etc/utm/secrets

yes (Secret, when mounted)

no

operator (via Kubernetes Secret)

Seeding Configuration at First Install

Add INI content to UTM_CONFIG_OVERRIDES before running helm install. The content is merged into /config/utm_config.ini exactly once.

plugins: entries: utm: env: - name: UTM_CONFIG_OVERRIDES value: | [general] log_level=debug state_update_interval=30 [ib_trap] enable_ib_trap=1 ib_trap_hca=mlx5_1

Note: Changes to UTM_CONFIG_OVERRIDES after the first install are ignored on subsequent pod restarts. To apply a new seed, reset the configuration — see Resetting Configuration below.

Reserved Keys

The following keys cannot be overridden through UTM_CONFIG_OVERRIDES. Attempts are logged by the init container as [UTM Init] WARN: refusing user override of reserved key <section>/<key> and silently skipped. This protects the pod from configurations that would break Kubernetes integration.

Section

Key

Why locked

[general]

port

Must match the Kubernetes Service (8888).

[general]

log_path, data_path

Must match the /log / /data volume mounts.

[general]

clx_restart_file

Must match the PVC path UFM writes to signal topology changes.

[general]

force_as_plugin

Required to force UTM into HTTP mode inside Kubernetes.

[high_availability]

enable_ha, master_host_file, ha_hosts_file

Kubernetes handles HA; UTM HA must stay disabled.

[http_proxy]

enable

Must be 0 so ports 9001+ stay free for telemetry instances.

Changing Configuration After Install

Because /config is preserved after the first pod start, change any non‑reserved setting in place:

POD=$(kubectl get pod -n ufm-enterprise \ -l app=<ufm-release>-plugin-utm \ -o jsonpath='{.items[0].metadata.name}') # Example: switch log level to debug kubectl exec -n ufm-enterprise "$POD" -c utm -- \ sed -i 's/^log_level=.*/log_level=debug/' /config/utm_config.ini # Apply the change without recreating the pod (init does not rerun) kubectl exec -n ufm-enterprise "$POD" -c utm -- \ supervisorctl -c /config/supervisord.conf restart utm

The change persists on the PVC and survives pod restarts, helm upgrade, and PVC‑retained reinstalls.

Resetting Configuration

Delete the marker and recreate the pod. The init container runs a full first‑time init again, re-seeding /config from the current image defaults and the current value of UTM_CONFIG_OVERRIDES.

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise

Step 4: Configure Token-Based Authentication (Recommended)

In Kubernetes, UTM and UFM are separate pods. UTM cannot authenticate to UFM as it does in Standalone or HA Cluster deployments (through localhost with an internal header); it must use an API token over HTTPS against the UFM Service. This section walks through enabling token authentication.

If no token is configured, UTM falls back to basic authentication with the default UFM credentials. This is acceptable for evaluation but not recommended for production use.

Step 4.1: Mint a Token on UFM

Execute inside the UFM pod and capture the returned access_token:

UFM_POD=$(kubectl get pod -n ufm-enterprise \ -l app.kubernetes.io/name=ufm-enterprise \ -o jsonpath='{.items[0].metadata.name}') TOKEN=$(kubectl exec -n ufm-enterprise "$UFM_POD" -- \ curl -sk -XPOST https://127.0.0.1/ufmRest/app/tokens -u admin:<password> \ | python3 -c "import sys,json;print(json.load(sys.stdin)['access_token'])") echo "$TOKEN"

Replace admin:<password> with valid UFM credentials. The default is admin:123456; change this to match your environment.

Step 4.2: Store the Token as a Kubernetes Secret

kubectl create secret generic utm-ufm-token \ --from-literal=token="$TOKEN" \ -n ufm-enterprise

The Secret name utm-ufm-token is referenced by the volume entry in Step 4.3. Use the same name or update the volume entry to match.

Step 4.3: Mount the Secret into the UTM Pod

Extend the volumes and volumeMounts sections of utm-plugin-values.yaml to mount the Secret at /etc/utm/secrets. The optional: true flag guarantees the pod still starts if the Secret is absent, in which case UTM falls back to basic auth.

plugins: entries: utm: volumes: - name: utm-data emptyDir: {} - name: utm-ufm-token secret: secretName: utm-ufm-token optional: true items: - key: token path: ufm_token volumeMounts: - name: utm-data mountPath: /data - name: utm-ufm-token mountPath: /etc/utm/secrets readOnly: true

Step 4.4: Point UTM at the Token

Add a [ufm] block to UTM_CONFIG_OVERRIDES telling UTM which UFM Service to contact, on which port, and where to find the token file:

plugins: entries: utm: env: - name: UTM_CONFIG_OVERRIDES value: | [general] log_level=info [ufm] ufm=https://<ufm-release> ufm_port=443 ufm_rest_api_port=443 ufm_token_file=/etc/utm/secrets/ufm_token

Replace <ufm-release> with the UFM Service DNS name (for example, ufm-ufm-enterprise).

Key

Value

Purpose

ufm

https://<ufm-release>

UFM Service DNS, HTTPS scheme.

ufm_port

443

Used for fabric snapshot requests in Kubernetes mode.

ufm_rest_api_port

443

Used for UFM REST calls.

ufm_token_file

/etc/utm/secrets/ufm_token

Path of the file mounted from the utm-ufm-token Secret.

Apply the change and reinstall (or reset and restart if UTM is already running):

# First install helm install ufm-plugins <plugin-chart> \ -f utm-plugin-values.yaml \ -n ufm-enterprise # Already running — reset and restart so init re-seeds with the new [ufm] block kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise

Rotating the Token

Mint a new token, update the Secret, and restart the UTM pod. The init container does not need to rerun — only the UTM process needs to re‑read the token file.

NEW=$(kubectl exec "$UFM_POD" -- \ curl -sk -XPOST https://127.0.0.1/ufmRest/app/tokens -u admin:<password> \ | python3 -c "import sys,json;print(json.load(sys.stdin)['access_token'])") kubectl create secret generic utm-ufm-token \ --from-literal=token="$NEW" \ --dry-run=client -o yaml | kubectl apply -n ufm-enterprise -f - kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise

Disabling Token Authentication

Delete the Secret and remove the [ufm] block from UTM_CONFIG_OVERRIDES, then reset the configuration so UTM falls back to basic authentication on the next pod start:

kubectl delete secret utm-ufm-token -n ufm-enterprise # Edit utm-plugin-values.yaml to remove the [ufm] block helm upgrade ufm-plugins <plugin-chart> \ --reuse-values -f utm-plugin-values.yaml \ -n ufm-enterprise kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise

Step 5: Validating the Setup

Work through these checks in order. Each check verifies a specific capability of the deployment.

5.1 Pod Health

kubectl get pod -n ufm-enterprise -l app=<ufm-release>-plugin-utm

Expected: READY 1/1, STATUS Running, RESTARTS 0.

5.2 Init Container Output

The init container log summarises the entire configuration pipeline. On a fresh install, expect output similar to:

[UTM Init] K8s mode detected — validating InfiniBand devices... [UTM Init] IB devices found [UTM Init] Seeding user config overrides from env var UTM_CONFIG_OVERRIDES [UTM Init] user-override: general/log_level=info [UTM Init] user-override: ufm/ufm=https://<ufm-release> [UTM Init] user-override: ufm/ufm_rest_api_port=443 [UTM Init] user-override: ufm/ufm_port=443 [UTM Init] user-override: ufm/ufm_token_file=/etc/utm/secrets/ufm_token [UTM Init] Applying K8s-required overrides [UTM Init] k8s-override: general/force_as_plugin=1 [UTM Init] k8s-override: high_availability/enable_ha=0 [UTM Init] k8s-override: http_proxy/enable=0 [UTM Init] First-time init complete

kubectl logs -c utm-init -n ufm-enterprise -l app=<ufm-release>-plugin-utm

On a subsequent restart the expected output is:

[UTM Init] K8s mode detected — validating InfiniBand devices... [UTM Init] IB devices found [UTM Init] /config/.utm_initialized present — /config already initialized; preserving existing configuration [UTM Init] To reset, delete this marker and restart the pod.

5.3 Generated Configuration

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ cat /config/utm_config.ini

Confirm that:

  • [general] has force_as_plugin = 1, and your expected log_level.

  • [high_availability] has enable_ha = 0.

  • [http_proxy] has enable = 0.

  • If token authentication is enabled: [ufm] has ufm, ufm_port, ufm_rest_api_port, and ufm_token_file as configured.

5.4 UTM Management API

From inside the pod:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ python3 -c "import urllib.request; \ print(urllib.request.urlopen('http://127.0.0.1:8888/help').read().decode()[:200])" kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ python3 -c "import urllib.request; \ print(urllib.request.urlopen('http://127.0.0.1:8888/status').read().decode())"

The /help endpoint returns a short help text. The /status endpoint returns a JSON document listing the telemetry groups and their instances.

5.5 Token Authentication

If token authentication is enabled, verify the token is loaded and UFM accepts it:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ grep -E "auth_method|Loaded UFM auth token" /log/utm.log | tail -5

Expected entries include:

[utils.py] INFO Loaded UFM auth token from /etc/utm/secrets/ufm_token auth_method: token [ufm_api.py] INFO UFM API Request Status [200] in ... sec

If the log shows auth_method: basic even though a Secret was created, confirm the Secret is mounted inside the pod:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ sh -c '[ -s /etc/utm/secrets/ufm_token ] && echo "token present" || echo "no token"'

5.6 Fabric Data Collection

UTM must be able to load the fabric snapshot from UFM. Expected log entries:

[fabric_snapshot.py] INFO Fabric state was updated [fabric_snapshot.py] INFO Loaded FULL FABRIC: <N> guid/port instances from UFM: https://<ufm-release> [fabric_delegator.py] INFO Total size of network: <M> ports

Check directly:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ grep -E "Loaded FULL FABRIC|Fabric state was updated" /log/utm.log | tail -5

5.7 Telemetry Endpoints

Telemetry instances are exposed on ports 9001–9010 through the plugin Service. From inside the cluster, fetch a sample:

kubectl run telemetry-check --rm -it --restart=Never \ --image=curlimages/curl -n ufm-enterprise -- \ curl -s http://<ufm-release>-plugin-utm:9001/csv/cset/converted_enterprise | head -20

A non-empty CSV response with counter values confirms the primary telemetry instance is collecting. Repeat on port 9002 for the secondary instance.

Troubleshooting

Pod Stuck in Init:Error

kubectl logs -c utm-init -n ufm-enterprise -l app=<ufm-release>-plugin-utm

Typical causes:

Log snippet

Cause

Fix

ERROR: No IB devices found (/dev/infiniband/uverbs*)

Node lacks IB access.

Verify NicClusterPolicy is ready; confirm nvidia.com/hostdev is allocated on the node.

ERROR: Missing /config/utm_config.ini

PVC mount issue.

Verify PVC is Bound and the sub-path for the plugin exists.

WARN: refusing user override of reserved key ...

Reserved key in UTM_CONFIG_OVERRIDES.

Harmless — the key is skipped. Remove it if you want a clean log.

Pod Running but Not Ready

UTM is running but the readiness probe fails. Inspect the UTM application log:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ tail -100 /log/utm.log

Common causes are missing fabric snapshot (UFM unreachable, wrong [ufm] settings) or the telemetry instances failing to bind to their ports.

UTM_CONFIG_OVERRIDES Changes Not Taking Effect

The init marker prevents re-seeding after first install. Reset:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise

Token Authentication Returns 401

The token is either invalid, expired, or was not mounted. Verify:

  1. Secret exists: kubectl get secret utm-ufm-token -n ufm-enterprise.

  2. Token file is non-empty inside the pod:

    kubectl exec ... -- sh -c '[ -s /etc/utm/secrets/ufm_token ] && echo ok'.

  3. Mint a new token on UFM and update the Secret.

UFM Service Unreachable from UTM

From inside the UTM pod:

kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ python3 -c "import urllib.request, ssl; \ print(urllib.request.urlopen('https://<ufm-release>/ufmRest/app/ufm_version', \ context=ssl._create_unverified_context()).read().decode())"

If this fails, the UFM Service name in [ufm] ufm is wrong, or the UFM pod is not Ready. Confirm:

kubectl get svc -n ufm-enterprise kubectl get pods -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise

Known Limitations

  • Changes to UTM_CONFIG_OVERRIDES after first install are ignored until the marker is reset. This is intentional — /config is treated as operator-owned state after the first init.

  • UTM image upgrades do not re-seed /config. If an upgraded image ships new defaults or adjusts Kubernetes invariants, reset the marker and restart the pod.

  • ClusterIP-only telemetry endpoints. Ports 9001–9010 are reachable from within the cluster. External monitoring tools need a NodePort or Ingress in addition; this is not covered by this guide.

  • No automatic token rotation. Tokens are long-lived. Rotate periodically by following the procedure in Rotating the Token.

  • Cross-reinstall persistence requires reclaimPolicy: Retain. With Delete, the PVC is wiped on helm uninstall, and the next helm install starts with an empty /config.

  • K8S deployment- UTM (UFM Telemetry Manager) deployment on Kubernetes is supported only in non-XDR environments. Deploying UTM on Kubernetes in an XDR environment is not supported in this release.


Last updated: