UFM Clustered Telemetry is an advanced feature that enables distributed telemetry data collection across multiple network adapters (HCAs) in your InfiniBand fabric. This feature provides improved performance and scalability for large-scale deployments through workload distribution.
Key Benefits
-
Better Performance: Workload distribution across multiple instances reduces collection bottlenecks
-
HCA Utilization: Leverages multiple network adapters for parallel data collection
-
Scalability: Handles larger fabric deployments more efficiently
-
Flexibility: Customizable instance distribution based on your infrastructure
Prerequisites
-
UFM Telemetry Manager (UTM) Plugin must be deployed and enabled
Deployment Types
UFM Clustered Telemetry supports two deployment scenarios. Choose the appropriate configuration method based on your deployment type:
|
Single node (Standalone) |
Single UFM node, telemetry collected locally |
Manual |
|
|---|---|---|---|
|
HA Cluster |
Multiple nodes with shared configuration, telemetry aggregated across all nodes |
|
|
Key Differences
|
|
|
|
|---|---|---|
|
Bind Address |
|
|
|
|
Not required |
Required (all node IPs) |
|
Configuration Scope |
Single node |
Shared across cluster |
Important: Choose the correct configuration method for your deployment. Using the wrong method may result in inaccessible telemetry endpoints or duplicate data collection.
Switching From Legacy to Clustered Telemetry (UTM) Mode
Standalone Deployment Configuration
This section applies to single node (standalone) deployments where UFM runs on a single node.
Step 1: Deploy UTM Plugin
-
Navigate to Settings > Plugin Management in the UFM WebUI
-
Locate the UFM Telemetry Manager (UTM) plugin
-
Click Enable to activate the plugin
Step 2: Configure Telemetry Mode
Edit the UFM configuration file:
vi /opt/ufm/files/conf/gv.cfg
Locate the [Telemetry] section and set the following parameters:
[Telemetry]
telemetry_legacy_mode = false
Step 3: (Optional) Configure Session Distribution
By default, UFM creates one primary and one secondary telemetry session and lets UTM allocate them to the locally detected HCAs in round-robin order. You can tune this without any matrix file by adding the following to gv.cfg:
[Telemetry]
primary_count = 2
secondary_count = 1
For fine-grained per-HCA control (for example, to pin specific instances to specific HCAs), enable the legacy matrix mode — see UFM Clustered Telemetry#Configuration Options - Instance Matrix..
Step 4: Start or Restart UFM
If UFM is not running, start it:
/etc/init.d/ufmd start
If UFM is already running, restart to apply changes:
/etc/init.d/ufmd restart
Alternatively, restart only the telemetry service:
/etc/init.d/ufmd ufm_telemetry_stop /etc/init.d/ufmd ufm_telemetry_start
HA Cluster Deployment Configuration
This section applies to High Availability (HA) cluster deployments (only for Active-Active deployments!) where multiple nodes share a common gv.cfg The configuration file and telemetry needs to be aggregated across all cluster nodes.
The configure_utm_mode.py script automates the configuration by:
-
Setting bind addresses to
0.0.0.0for external telemetry access -
Configuring
additional_cset_urlsin gv.cfg for multi-node telemetry aggregation -
Managing legacy mode flags
-
Updating environment files for proper endpoint configuration
Prerequisites
-
HA cluster must be configured in active-active mode.
-
/var/lib/ufm_ha/ha_statefile present (or explicit node IPs available) -
UTM Plugin deployed (can be enabled before or after configuration)
-
UFM configured in Infra mode
Recommended: Configure Before Starting UFM
This is the preferred approach as it avoids unnecessary service restarts.
-
Step 1: Deploy UTM Plugin
Ensure the UTM plugin is deployed on all cluster nodes. You can deploy via CLI or by using ufm_infra_feature_flag.py . -
Step 2: Run the Configuration Script
Option A: Auto-detect node IPs from HA state file/opt/ufm/files/scripts/configure_utm_mode.py --enableOption B: Specify node IPs explicitly
/opt/ufm/files/scripts/configure_utm_mode.py --enable --node-ips 10.212.23.1,10.209.226.30
-
Step 3: Start UFM Services.
Start UFM services on all cluster nodes:ufm_ha_cluster start
Alternative: Configure After UFM is Running
If UFM is already running, you can still configure UTM mode and restart the services.
-
Step 1: Verify UTM Plugin is Enabled
Ensure the UTM plugin is enabled in Settings > Plugin Management. -
Step 2: Run the Configuration Script
/opt/ufm/files/scripts/configure_utm_mode.py --enable --node-ips 10.212.23.1,10.209.226.30 -
Or with auto-detection:
/opt/ufm/files/scripts/configure_utm_mode.py --enable -
Step 3: Restart UFM Services on All Nodes
systemctl restart ufm-enterprise systemctl restart ufm-infra
Script Usage
-
Enable UTM Mode
-
Enable with auto-detected node IPs:
./configure_utm_mode.py --enable -
Enable with explicit node IPs:
./configure_utm_mode.py --enable --node-ips 10.20.30.40,10.20.30.50 -
Disable UTM Mode
-
Revert to legacy mode:
./configure_utm_mode.py --disable -
Show Current Status
Display current telemetry configuration:
./configure_utm_mode.py --status
Command-Line Options
|
Flag |
|
Description |
|---|---|---|
|
|
|
Enable UTM mode for telemetry |
|
|
|
Disable UTM mode (revert to legacy mode) |
|
|
|
Show current telemetry configuration status |
|
|
|
Comma-separated list of cluster node IPs. If not provided, auto-detects from |
|
|
|
Skip updating |
|
|
|
Path to gv.cfg file (default: |
|
|
|
Set logging level: DEBUG, INFO, WARNING, ERROR (default: INFO) |
Configuration Changes
When enabling UTM mode for HA, the script modifies the following parameters:
gv.cfg [Telemetry] section:
|
Flag |
|
|
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(empty) |
Space-separated cluster URLs |
Environment files:
|
Flag |
|
|
|---|---|---|
|
|
|
|
|
|
|
|
Example Output
Enable command output:
============================================================
UTM mode has been enabled successfully.
============================================================
Configuration changes (shared gv.cfg):
- telemetry_legacy_mode = false
- primary_ip_bind_addr = 0.0.0.0
- secondary_ip_bind_addr = 0.0.0.0
- additional_cset_urls configured with cluster nodes:
http://10.20.30.1:9001/csv/cset/converted_enterprise
http://10.20.30.2:9001/csv/cset/converted_enterprise
Note: Local node URLs are filtered at runtime by agent_manager.py
to avoid duplicate telemetry collection.
------------------------------------------------------------
IMPORTANT: Please restart UFM services on all nodes to apply changes:
systemctl restart ufm-enterprise
systemctl restart ufm-infra
------------------------------------------------------------
Status command output:
=== Current Telemetry Configuration ===
telemetry_legacy_mode = false
primary_ip_bind_addr = 0.0.0.0
secondary_ip_bind_addr = 0.0.0.0
additional_cset_urls = http://10.20.30.1:9001/csv/cset/converted_enterprise http://10.20.30.2:9001/csv/cset/converted_enterprise
=== Mode Status ===
Current Mode: UTM (non-legacy)
=== Environment Files ===
Primary: PROMETHEUS_ENDPOINT=http://0.0.0.0:9001
Secondary: PROMETHEUS_ENDPOINT=http://0.0.0.0:9002
Configuration Options - Instance Matrix (Legacy, Opt-In)
The instance-matrix configuration is a legacy advanced option. It is disabled by default (use_matrix = false in gv.cfg [Telemetry]). Set use_matrix = true to activate everything described in this section. Most deployments should use count-based configuration (see Step 3 – Customize Session Distribution) instead.
Both standalone and HA deployments can customize how telemetry instances are distributed across HCAs.
Automatic Matrix Generation
When UFM starts in UTM mode, it automatically detects available HCAs and creates a default configuration (applies only with use_matrix=true).
Default Behavior:
-
Detects all available HCAs on the system
-
Creates 1 primary and 1 secondary telemetry instance on the first HCA
-
Configuration is stored in:
/opt/ufm/files/conf/utm/{hostname}_instances_matrix.json
Example Auto-Generated Matrix:
{
"mlx5_0": { "primary": 1, "secondary": 1 },
"mlx5_1": { "primary": 0, "secondary": 0 }
}
Custom Configuration
For advanced deployments, customize the distribution of telemetry instances across HCAs using the generate_telemetry_config.sh script.
Auto-Detect and Create Configuration
Automatically detect HCAs and create a configuration file:
/opt/ufm/scripts/generate_telemetry_config.sh --auto-detect /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json
Manual Custom Configuration
Specify custom instance counts per HCA using the format HCA_NAME:PRIMARY_COUNT:SECONDARY_COUNT:
/opt/ufm/scripts/generate_telemetry_config.sh \
/opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json \
mlx5_0:2:1 \
mlx5_1:0:2 \
mlx5_2:1:0
This creates:
-
mlx5_0: 2 primary instances, 1 secondary instance
-
mlx5_1: 0 primary instances, 2 secondary instances
-
mlx5_2: 1 primary instance, 0 secondary instances
Example Custom Matrix:
{
"mlx5_0": { "primary": 2, "secondary": 1 },
"mlx5_1": { "primary": 0, "secondary": 2 },
"mlx5_2": { "primary": 1, "secondary": 0 }
}
Validate Configuration
Verify your matrix file is correctly formatted:
/opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json
Get Help
Display usage information and options:
/opt/ufm/scripts/generate_telemetry_config.sh --help
After modifying the matrix configuration file, you must restart UFM for changes to take effect.
Advanced Configuration Parameters
The following optional parameters in gv.cfg allow fine-tuning of telemetry behavior. Most users should use the default values.
|
Parameter |
Section |
Default |
Description |
|---|---|---|---|
|
|
|
30 |
Sample rate (seconds) for primary telemetry instances |
|
|
|
300 |
Sample rate (seconds) for secondary telemetry instances |
|
|
|
true |
Set to |
|
|
|
1 |
Number of primary (high-frequency) telemetry instances UFM asks UTM to create. HCAs are allocated round-robin by UTM. |
|
|
|
1 |
Number of secondary (low-frequency) telemetry instances UFM asks UTM to create. |
|
|
|
false |
Advanced. When |
Note: Changing sample rates affects data frequency and may impact system performance. Consult with NVIDIA support before modifying these values in production environments.
Port Allocation
Default Port Allocation
-
Primary Telemetry: Base port 9001
-
Secondary Telemetry: Base port 9002
Multi-Instance Port Strategy
When multiple instances are configured, ports are allocated using an interleaved strategy:
-
Primary instances: Odd ports (9001, 9003, 9005, 9007...)
-
Secondary instances: Even ports (9002, 9004, 9006, 9008...)
Example - 2 primary + 2 secondary instances:
-
Primary: ports 9001, 9003
-
Secondary: ports 9002, 9004
Port Allocation with Proxy Mode
When enable_utm_proxy = true, ports 9001 and 9002 are reserved for the UTM HTTP proxy, and telemetry instances start from offset ports:
-
Primary instances: 9003, 9005, 9007, 9009...
-
Secondary instances: 9004, 9006, 9008, 9010...
Troubleshooting
Verify Telemetry Status
Check if telemetry instances are running:
ps aux | grep -E "(utm|telemetry)" | grep -v grep
Check Matrix Configuration (legacy matrix mode only)
Validate the instance matrix file:
If use_matrix = true is set in gv.cfg [Telemetry], validate the matrix file:
/opt/ufm/scripts/generate_telemetry_config.sh --validate /opt/ufm/files/conf/utm/$(hostname)_instances_matrix.json
View Current Mode
Use the configuration script to display current settings:
/opt/ufm/files/scripts/configure_utm_mode.py --status
Check Lock Files
If telemetry startup hangs, check for stale lock files:
ls -la /tmp/utm_matrix_*.lock
UFM Clustered Telemetry on Kubernetes
This section describes how to configure and run Clustered Telemetry when UFM Enterprise and the UTM plugin are deployed as Kubernetes pods using Helm, how to change UTM configuration, how to enable token‑based authentication between UTM and UFM, and how to validate the setup end‑to‑end.
The Kubernetes deployment is a third deployment scenario alongside Standalone and HA Cluster. On Kubernetes, UFM and UTM run as separate pods and communicate through in‑cluster Services. This brings two operational differences the previous sections do not cover:
-
The operator does not edit
gv.cfgorutm_config.inidirectly on a host; configuration is injected through the Helm values file and Kubernetes resources. -
UTM can no longer reach UFM on
localhost; it must authenticate to UFM through the UFM Service (HTTPS on port 443) using a token.
Prerequisites
Before deploying the UTM plugin on Kubernetes, verify the following:
-
Kubernetes cluster (full Kubernetes or k3s) with at least one node that exposes InfiniBand HCAs. Run
kubectl get nodesand make sure the nodes areReady. -
NVIDIA Network Operator installed in the cluster with a
NicClusterPolicywhosestatus.stateisready. This exposes the nvidia.com/hostdev (orrdma/rdma_shared_device_a) resource on the nodes that will run UTM. -
UFM Enterprise deployed in the cluster using its Helm chart. The UTM plugin reuses UFM's shared PVC, so UFM must be installed first. The UFM pod should be
1/1 Runningbefore proceeding. -
Shared PVC (ReadWriteMany) created by the UFM Helm chart (default name
<ufm-release-name>-ufm-enterprise-files). For UTM configuration to survivehelm uninstall/helm installcycles, the StorageClass should usereclaimPolicy: Retain. -
Generic UFM plugin Helm chart (
ufm-plugin-helm-template) available locally or as a packaged release artifact. This chart deploys any UFM plugin — UTM is deployed as one of its entries. -
UTM plugin image loaded into the container runtime on every node that may run the UTM pod (the nodes with IB HCAs).
Note: The UTM Helm-based deployment on Kubernetes requires UFM to be installed via the UFM Helm chart. It is not compatible with Standalone or HA Cluster deployments described in the previous sections.
Architecture Overview
┌───────────────────────────────────────────────────────────────────────┐ │ Kubernetes cluster │ │ │ │ ┌──────────────────┐ ┌──────────────────────────┐ │ │ │ UFM pod │◄───────────────┤ UTM pod │ │ │ │ 443 / 80 │ HTTPS / │ 8888 (management) │ │ │ │ (Apache) │ ufmRest/… │ 9001–9010 (telemetry) │ │ │ └──────────────────┘ + token └──────────────────────────┘ │ │ │ │ │ │ └─────── shared PVC (RWX) ────────────┘ │ │ conf/plugins/utm, log/plugins/utm │ └───────────────────────────────────────────────────────────────────────┘
-
UFM reaches UTM through the UTM Service (
<ufm-release>-plugin-utm) on port 8888 for management and 9001–9010 for telemetry collection. -
UTM reaches UFM through the UFM Service (
<ufm-release>) on port 443 (HTTPS through Apache), authenticated with a UFM API token. -
Both pods mount the same UFM PVC under different sub‑paths. UTM's
/configand/logdirectories live on the PVC and persist across pod restarts.
Step 1: Prepare the UTM Plugin Values File
Create a values file that will be passed to helm install. Below is the minimal working configuration; everything else in this guide is an incremental addition to this file.
# utm-plugin-values.yaml ufmFullname: "ufm-ufm-enterprise" rdma: resourceName: "nvidia.com/hostdev" # Or rdma/rdma_shared_device_a resourceCount: "1" plugins: entries: utm: image: mellanox/ufm-plugin-utm tag: "<version>" # e.g. "1.25.1-2" port: 8888 ports: [9001, 9002, 9003, 9004, 9005, 9006, 9007, 9008, 9009, 9010] strategy: Recreate startupProbe: httpGet: { path: /help, port: 8888 } failureThreshold: 30 periodSeconds: 10 livenessProbe: httpGet: { path: /help, port: 8888 } periodSeconds: 30 timeoutSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: { path: /status, port: 8888 } periodSeconds: 10 timeoutSeconds: 10 failureThreshold: 3 env: - name: UTM_CONFIG_OVERRIDES value: | [general] log_level=info volumes: - name: utm-data emptyDir: {} volumeMounts: - name: utm-data mountPath: /data
|
Field |
Description |
|---|---|
|
|
UFM Helm release full name. Used to derive the PVC name and UFM Service DNS. Must match the name used when installing UFM. |
|
|
UTM container image and tag. The tag must correspond to an image loaded on all IB-capable nodes. |
|
|
UTM management API port ( |
|
|
Telemetry instance ports. The generic plugin chart exposes them through the plugin Service. Primary instances use odd ports, secondary instances use even ports. |
|
|
INI content merged into |
Step 2: Install the UTM Plugin
Install the Helm chart, passing the values file:
helm install ufm-plugins <plugin-chart-path-or-tgz> \ -f utm-plugin-values.yaml \ -n ufm-enterprise
If the plugin chart is already installed (for example because other plugins have been deployed), add the UTM entry with --reuse-values:
helm upgrade ufm-plugins <plugin-chart-path-or-tgz> \ --reuse-values \ -f utm-plugin-values.yaml \ -n ufm-enterprise
Watch the pod come up:
kubectl get pods -n ufm-enterprise -l app=<ufm-release>-plugin-utm -w
Expected progression: Init:0/1 → PodInitializing → Running 1/1.
Note: The UTM init container performs a hard InfiniBand validation on every pod start. If no
/dev/infiniband/uverbs*devices are visible to the container, the init container fails immediately with a clear error. Verify that the NVIDIA Network Operator and theNicClusterPolicyare configured correctly if the init container fails.
Step 3: Configure UTM
UTM configuration lives in /config/utm_config.ini inside the pod. On the first pod start, the init container builds this file from three layers in this order:
-
Image defaults — the factory
utm_config.inishipped in the container. -
User seed — content of the
UTM_CONFIG_OVERRIDESenvironment variable.Keys that are reserved for Kubernetes correctness (see table below) are rejected and logged as a warning.
-
Kubernetes invariants —
force_as_plugin=1, `[high_availability]enable_ha=0
,[http_proxy] enable=0`. These are applied last and always win over the user layer.
After a successful first init, the init container writes the marker file /config/.utm_initialized on the PVC. On every subsequent pod start, the init container detects the marker and skips the rebuild — /config is now owned by the operator.
|
Path |
Persistent? |
Regenerated on first init? |
Who owns it |
|---|---|---|---|
|
|
yes (PVC sub-path |
yes, on first init only |
operator after first install |
|
|
yes (PVC sub-path |
no |
UTM (logs and state) |
|
|
no ( |
n/a |
UTM (ephemeral telemetry data) |
|
|
yes (Secret, when mounted) |
no |
operator (via Kubernetes Secret) |
Seeding Configuration at First Install
Add INI content to UTM_CONFIG_OVERRIDES before running helm install. The content is merged into /config/utm_config.ini exactly once.
plugins: entries: utm: env: - name: UTM_CONFIG_OVERRIDES value: | [general] log_level=debug state_update_interval=30 [ib_trap] enable_ib_trap=1 ib_trap_hca=mlx5_1
Note: Changes to
UTM_CONFIG_OVERRIDESafter the first install are ignored on subsequent pod restarts. To apply a new seed, reset the configuration — see Resetting Configuration below.
Reserved Keys
The following keys cannot be overridden through UTM_CONFIG_OVERRIDES. Attempts are logged by the init container as [UTM Init] WARN: refusing user override of reserved key <section>/<key> and silently skipped. This protects the pod from configurations that would break Kubernetes integration.
|
Section |
Key |
Why locked |
|---|---|---|
|
|
|
Must match the Kubernetes Service (8888). |
|
|
|
Must match the |
|
|
|
Must match the PVC path UFM writes to signal topology changes. |
|
|
|
Required to force UTM into HTTP mode inside Kubernetes. |
|
|
|
Kubernetes handles HA; UTM HA must stay disabled. |
|
|
|
Must be |
Changing Configuration After Install
Because /config is preserved after the first pod start, change any non‑reserved setting in place:
POD=$(kubectl get pod -n ufm-enterprise \ -l app=<ufm-release>-plugin-utm \ -o jsonpath='{.items[0].metadata.name}') # Example: switch log level to debug kubectl exec -n ufm-enterprise "$POD" -c utm -- \ sed -i 's/^log_level=.*/log_level=debug/' /config/utm_config.ini # Apply the change without recreating the pod (init does not rerun) kubectl exec -n ufm-enterprise "$POD" -c utm -- \ supervisorctl -c /config/supervisord.conf restart utm
The change persists on the PVC and survives pod restarts, helm upgrade, and PVC‑retained reinstalls.
Resetting Configuration
Delete the marker and recreate the pod. The init container runs a full first‑time init again, re-seeding /config from the current image defaults and the current value of UTM_CONFIG_OVERRIDES.
kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise
Step 4: Configure Token-Based Authentication (Recommended)
In Kubernetes, UTM and UFM are separate pods. UTM cannot authenticate to UFM as it does in Standalone or HA Cluster deployments (through localhost with an internal header); it must use an API token over HTTPS against the UFM Service. This section walks through enabling token authentication.
If no token is configured, UTM falls back to basic authentication with the default UFM credentials. This is acceptable for evaluation but not recommended for production use.
Step 4.1: Mint a Token on UFM
Execute inside the UFM pod and capture the returned access_token:
UFM_POD=$(kubectl get pod -n ufm-enterprise \ -l app.kubernetes.io/name=ufm-enterprise \ -o jsonpath='{.items[0].metadata.name}') TOKEN=$(kubectl exec -n ufm-enterprise "$UFM_POD" -- \ curl -sk -XPOST https://127.0.0.1/ufmRest/app/tokens -u admin:<password> \ | python3 -c "import sys,json;print(json.load(sys.stdin)['access_token'])") echo "$TOKEN"
Replace admin:<password> with valid UFM credentials. The default is admin:123456; change this to match your environment.
Step 4.2: Store the Token as a Kubernetes Secret
kubectl create secret generic utm-ufm-token \ --from-literal=token="$TOKEN" \ -n ufm-enterprise
The Secret name utm-ufm-token is referenced by the volume entry in Step 4.3. Use the same name or update the volume entry to match.
Step 4.3: Mount the Secret into the UTM Pod
Extend the volumes and volumeMounts sections of utm-plugin-values.yaml to mount the Secret at /etc/utm/secrets. The optional: true flag guarantees the pod still starts if the Secret is absent, in which case UTM falls back to basic auth.
plugins: entries: utm: volumes: - name: utm-data emptyDir: {} - name: utm-ufm-token secret: secretName: utm-ufm-token optional: true items: - key: token path: ufm_token volumeMounts: - name: utm-data mountPath: /data - name: utm-ufm-token mountPath: /etc/utm/secrets readOnly: true
Step 4.4: Point UTM at the Token
Add a [ufm] block to UTM_CONFIG_OVERRIDES telling UTM which UFM Service to contact, on which port, and where to find the token file:
plugins: entries: utm: env: - name: UTM_CONFIG_OVERRIDES value: | [general] log_level=info [ufm] ufm=https://<ufm-release> ufm_port=443 ufm_rest_api_port=443 ufm_token_file=/etc/utm/secrets/ufm_token
Replace <ufm-release> with the UFM Service DNS name (for example, ufm-ufm-enterprise).
|
Key |
Value |
Purpose |
|---|---|---|
|
|
|
UFM Service DNS, HTTPS scheme. |
|
|
|
Used for fabric snapshot requests in Kubernetes mode. |
|
|
|
Used for UFM REST calls. |
|
|
|
Path of the file mounted from the |
Apply the change and reinstall (or reset and restart if UTM is already running):
# First install helm install ufm-plugins <plugin-chart> \ -f utm-plugin-values.yaml \ -n ufm-enterprise # Already running — reset and restart so init re-seeds with the new [ufm] block kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise
Rotating the Token
Mint a new token, update the Secret, and restart the UTM pod. The init container does not need to rerun — only the UTM process needs to re‑read the token file.
NEW=$(kubectl exec "$UFM_POD" -- \ curl -sk -XPOST https://127.0.0.1/ufmRest/app/tokens -u admin:<password> \ | python3 -c "import sys,json;print(json.load(sys.stdin)['access_token'])") kubectl create secret generic utm-ufm-token \ --from-literal=token="$NEW" \ --dry-run=client -o yaml | kubectl apply -n ufm-enterprise -f - kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise
Disabling Token Authentication
Delete the Secret and remove the [ufm] block from UTM_CONFIG_OVERRIDES, then reset the configuration so UTM falls back to basic authentication on the next pod start:
kubectl delete secret utm-ufm-token -n ufm-enterprise # Edit utm-plugin-values.yaml to remove the [ufm] block helm upgrade ufm-plugins <plugin-chart> \ --reuse-values -f utm-plugin-values.yaml \ -n ufm-enterprise kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise
Step 5: Validating the Setup
Work through these checks in order. Each check verifies a specific capability of the deployment.
5.1 Pod Health
kubectl get pod -n ufm-enterprise -l app=<ufm-release>-plugin-utm
Expected: READY 1/1, STATUS Running, RESTARTS 0.
5.2 Init Container Output
The init container log summarises the entire configuration pipeline. On a fresh install, expect output similar to:
[UTM Init] K8s mode detected — validating InfiniBand devices... [UTM Init] IB devices found [UTM Init] Seeding user config overrides from env var UTM_CONFIG_OVERRIDES [UTM Init] user-override: general/log_level=info [UTM Init] user-override: ufm/ufm=https://<ufm-release> [UTM Init] user-override: ufm/ufm_rest_api_port=443 [UTM Init] user-override: ufm/ufm_port=443 [UTM Init] user-override: ufm/ufm_token_file=/etc/utm/secrets/ufm_token [UTM Init] Applying K8s-required overrides [UTM Init] k8s-override: general/force_as_plugin=1 [UTM Init] k8s-override: high_availability/enable_ha=0 [UTM Init] k8s-override: http_proxy/enable=0 [UTM Init] First-time init complete
kubectl logs -c utm-init -n ufm-enterprise -l app=<ufm-release>-plugin-utm
On a subsequent restart the expected output is:
[UTM Init] K8s mode detected — validating InfiniBand devices... [UTM Init] IB devices found [UTM Init] /config/.utm_initialized present — /config already initialized; preserving existing configuration [UTM Init] To reset, delete this marker and restart the pod.
5.3 Generated Configuration
kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ cat /config/utm_config.ini
Confirm that:
-
[general]hasforce_as_plugin = 1, and your expectedlog_level. -
[high_availability]hasenable_ha = 0. -
[http_proxy]hasenable = 0. -
If token authentication is enabled:
[ufm]hasufm,ufm_port,ufm_rest_api_port, andufm_token_fileas configured.
5.4 UTM Management API
From inside the pod:
kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ python3 -c "import urllib.request; \ print(urllib.request.urlopen('http://127.0.0.1:8888/help').read().decode()[:200])" kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ python3 -c "import urllib.request; \ print(urllib.request.urlopen('http://127.0.0.1:8888/status').read().decode())"
The /help endpoint returns a short help text. The /status endpoint returns a JSON document listing the telemetry groups and their instances.
5.5 Token Authentication
If token authentication is enabled, verify the token is loaded and UFM accepts it:
kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ grep -E "auth_method|Loaded UFM auth token" /log/utm.log | tail -5
Expected entries include:
[utils.py] INFO Loaded UFM auth token from /etc/utm/secrets/ufm_token auth_method: token [ufm_api.py] INFO UFM API Request Status [200] in ... sec
If the log shows auth_method: basic even though a Secret was created, confirm the Secret is mounted inside the pod:
kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ sh -c '[ -s /etc/utm/secrets/ufm_token ] && echo "token present" || echo "no token"'
5.6 Fabric Data Collection
UTM must be able to load the fabric snapshot from UFM. Expected log entries:
[fabric_snapshot.py] INFO Fabric state was updated [fabric_snapshot.py] INFO Loaded FULL FABRIC: <N> guid/port instances from UFM: https://<ufm-release> [fabric_delegator.py] INFO Total size of network: <M> ports
Check directly:
kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ grep -E "Loaded FULL FABRIC|Fabric state was updated" /log/utm.log | tail -5
5.7 Telemetry Endpoints
Telemetry instances are exposed on ports 9001–9010 through the plugin Service. From inside the cluster, fetch a sample:
kubectl run telemetry-check --rm -it --restart=Never \ --image=curlimages/curl -n ufm-enterprise -- \ curl -s http://<ufm-release>-plugin-utm:9001/csv/cset/converted_enterprise | head -20
A non-empty CSV response with counter values confirms the primary telemetry instance is collecting. Repeat on port 9002 for the secondary instance.
Troubleshooting
Pod Stuck in Init:Error
kubectl logs -c utm-init -n ufm-enterprise -l app=<ufm-release>-plugin-utm
Typical causes:
|
Log snippet |
Cause |
Fix |
|---|---|---|
|
|
Node lacks IB access. |
Verify |
|
|
PVC mount issue. |
Verify PVC is |
|
|
Reserved key in |
Harmless — the key is skipped. Remove it if you want a clean log. |
Pod Running but Not Ready
UTM is running but the readiness probe fails. Inspect the UTM application log:
kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ tail -100 /log/utm.log
Common causes are missing fabric snapshot (UFM unreachable, wrong [ufm] settings) or the telemetry instances failing to bind to their ports.
UTM_CONFIG_OVERRIDES Changes Not Taking Effect
The init marker prevents re-seeding after first install. Reset:
kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ rm -f /config/.utm_initialized kubectl rollout restart deployment/<ufm-release>-plugin-utm -n ufm-enterprise
Token Authentication Returns 401
The token is either invalid, expired, or was not mounted. Verify:
-
Secret exists:
kubectl get secret utm-ufm-token -n ufm-enterprise. -
Token file is non-empty inside the pod:
kubectl exec ... -- sh -c '[ -s /etc/utm/secrets/ufm_token ] && echo ok'. -
Mint a new token on UFM and update the Secret.
UFM Service Unreachable from UTM
From inside the UTM pod:
kubectl exec -n ufm-enterprise deploy/<ufm-release>-plugin-utm -c utm -- \ python3 -c "import urllib.request, ssl; \ print(urllib.request.urlopen('https://<ufm-release>/ufmRest/app/ufm_version', \ context=ssl._create_unverified_context()).read().decode())"
If this fails, the UFM Service name in [ufm] ufm is wrong, or the UFM pod is not Ready. Confirm:
kubectl get svc -n ufm-enterprise kubectl get pods -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise
Known Limitations
-
Changes to
UTM_CONFIG_OVERRIDESafter first install are ignored until the marker is reset. This is intentional —/configis treated as operator-owned state after the first init. -
UTM image upgrades do not re-seed
/config. If an upgraded image ships new defaults or adjusts Kubernetes invariants, reset the marker and restart the pod. -
ClusterIP-only telemetry endpoints. Ports 9001–9010 are reachable from within the cluster. External monitoring tools need a NodePort or Ingress in addition; this is not covered by this guide.
-
No automatic token rotation. Tokens are long-lived. Rotate periodically by following the procedure in Rotating the Token.
-
Cross-reinstall persistence requires
reclaimPolicy: Retain. WithDelete, the PVC is wiped onhelm uninstall, and the nexthelm installstarts with an empty/config. -
K8S deployment- UTM (UFM Telemetry Manager) deployment on Kubernetes is supported only in non-XDR environments. Deploying UTM on Kubernetes in an XDR environment is not supported in this release.
Last updated: