UFM on Kubernetes | NVIDIA UFM Enterprise User Manual

Overview

UFM Enterprise supports deployment on Kubernetes clusters using Helm charts. This deployment method provides:

Declarative Configuration: Define your UFM deployment using Helm values
Simplified Operations: Use standard Kubernetes tools for deployment, upgrades, and management
Plugin Support: Deploy UFM plugins as separate pods via a dedicated Helm chart

Supported Environments

Kubernetes Version

Kubernetes 1.28 or later.

Node Operating Systems

UFM on Kubernetes supports the same operating systems as UFM Enterprise. See the Installation Notes for the complete list of supported operating systems.

Hardware Requirements

UFM on Kubernetes has the same hardware requirements as UFM Enterprise. See the Installation Notes for detailed specifications.

Prerequisites

Before deploying UFM Enterprise on Kubernetes, ensure the following requirements are met:

Kubernetes Cluster

Kubernetes cluster version 1.28 or later
kubectl configured with cluster access
Cluster admin permissions for installation

Helm

Helm 3.x installed on the management workstation:

helm version

Storage

A StorageClass that supports ReadWriteMany access mode
Minimum 10GB storage capacity

NVIDIA Network Operator (Required)

UFM cannot function without access to InfiniBand devices. The NVIDIA Network Operator must be installed and configured before installing the UFM Helm chart.

1. Install Network Operator:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install network-operator nvidia/network-operator \
  --namespace nvidia-network-operator \
  --create-namespace \
  --version 25.7.0 \
  --set nfd.enabled=true \
  --set ofedDriver.deploy=false \
  --set sriovDevicePlugin.deploy=true \
  --set secondaryNetwork.deploy=true \
  --set secondaryNetwork.multus.deploy=true \
  --wait --timeout 5m

Note: Set ofedDriver.deploy=false if OFED/DOCA drivers are already installed on the host.

2. Create NicClusterPolicy:

The sriovDevicePlugin must be enabled so the nodes expose nvidia.com/hostdev. The rdmaSharedDevicePlugin must be enabled to expose rdma/hca_shared, which is required for OpenSM to access InfiniBand character devices.

kubectl apply -f - <<EOF
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  secondaryNetwork:
    multus:
      image: multus-cni
      repository: ghcr.io/k8snetworkplumbingwg
      version: v4.1.0
    cniPlugins:
      image: plugins
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
  sriovDevicePlugin:
    image: sriov-network-device-plugin
    repository: nvcr.io/nvidia/mellanox
    version: network-operator-v25.7.0
    config: |
      {
        "resourceList": [
          {
            "resourcePrefix": "nvidia.com",
            "resourceName": "hostdev",
            "selectors": {
              "vendors": ["15b3"],
              "devices": [],
              "drivers": [],
              "pfNames": [],
              "pciAddresses": [],
              "rootDevices": [],
              "linkTypes": [],
              "isRdma": true
            }
          }
        ]
      }
  rdmaSharedDevicePlugin:
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/mellanox
    version: network-operator-v25.7.0
    config: |
      {
        "configList": [
          {
            "resourceName": "hca_shared",
            "rdmaHcaMax": 1000,
            "devices": ["all"]
          }
        ]
      }
EOF

Wait for the policy to be ready:

kubectl get nicclusterpolicy -o jsonpath='{.items[0].status.state}'
# Expected: ready

3. Verify resources are available:

kubectl get nodes -o custom-columns=NAME:.metadata.name,HOSTDEV:.status.allocatable.nvidia\\.com/hostdev
# Expected: nvidia.com/hostdev should appear on the nodes

Note: 15b3 is the NVIDIA/Mellanox PCI vendor ID.

UFM License

Valid UFM Enterprise license file

Installation

Step 1: Set Up Storage

UFM requires ReadWriteMany storage. Make sure you have a persistent storage provisioner configured (e.g., NFS).

Step 2: Create Namespace and License ConfigMap

# Create the namespace
kubectl create namespace ufm-enterprise

# Create license ConfigMap
kubectl create configmap ufm-license \
  --from-file=<license-filename>.lic=/path/to/your/<license-filename>.lic \
  -n ufm-enterprise

Step 3: Install UFM with Helm

helm install ufm-enterprise <chart> \
  --namespace ufm-enterprise \
  --set storage.className=<storage-client> \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=<memory> \
  --set resources.requests.cpu=<cpu>

Note: The chart defaults to fabric_interface = net1 (provided by HostDeviceNetwork). No need to set config.fabricInterface unless your setup differs. Resource limits are optional — only requests are required.

Step 4: Verify Installation

Watch the pod status:

kubectl get pods -n ufm-enterprise -w

Expected state transitions:

NAME                              READY   STATUS              AGE
ufm-ufm-enterprise-xxxxxxxxxx     0/1     Init:0/1            5s
ufm-ufm-enterprise-xxxxxxxxxx     0/1     PodInitializing     30s
ufm-ufm-enterprise-xxxxxxxxxx     0/1     Running             45s
ufm-ufm-enterprise-xxxxxxxxxx     1/1     Running             2m

Note: The pod shows 0/1 Running while the startup probe waits for UFM to fully initialize. This can take several minutes depending on the cluster size.

Verify the HostDeviceNetwork resource created by the chart:

kubectl get hostdevicenetwork ufm-hostdevice -o yaml
kubectl get network-attachment-definition -n ufm-enterprise ufm-hostdevice -o yaml

Configuration Reference

All configuration options are set via Helm values. Use --set key=value or a values file (-f values.yaml).

Namespace Configuration

Parameter	Description	Default
`namespace.create`	Create the namespace	`false`
`namespace.name`	Namespace name	`ufm-enterprise`

Image Configuration

Parameter	Description	Default
`image.repository`	Image repository	`docker.io/mellanox/ufm-enterprise`
`image.tag`	Image tag	`latest`
`image.pullPolicy`	Image pull policy (REQUIRED)	—
`imagePullSecrets`	Image pull secrets for private registries	`[]`
`versionCheckEnabled`	Check that UFM image version >= Helm chart appVersion at startup. Fails the init container with a clear error if the image is older than the chart.	`true`

Note: image.pullPolicy must be set to one of: Never, IfNotPresent, or Always.

UFM Configuration

Parameter	Description	Default
`config.mgmtInterface`	Management network interface name	`""` (uses gv.cfg)
`config.fabricInterface`	Fabric interface override	`""` (uses gv.cfg, default `net1`)
`config.httpPort`	Apache HTTP port	`80`
`config.httpsPort`	Apache HTTPS port	`443`

HostDevice Network Configuration

UFM requires a valid InfiniBand fabric interface inside the pod. The interface is provided by the NVIDIA Network Operator HostDeviceNetwork. The pod gets a net1 interface which is a link/infiniband interface (no IP address). UFM uses fabric_interface = net1.

Parameter	Description	Default
`hostDevice.createNetwork`	Create the HostDeviceNetwork resource as part of chart install	`true`
`hostDevice.networkName`	Name of the HostDeviceNetwork resource	`ufm-hostdevice`
`hostDevice.resourceName`	Resource name referenced by the HostDeviceNetwork	`hostdev`
`hostDevice.podResourceName`	Device-plugin resource requested by the pod	`nvidia.com/hostdev`
`hostDevice.resourceCount`	Number of host-device resources per container	`1`

Set hostDevice.createNetwork=false if you want to reference a pre-created HostDeviceNetwork instead of letting the chart create it.

RDMA Configuration

The RDMA shared device plugin grants cgroup-level access to /dev/infiniband/* character devices. This is required for OpenSM to function. The HostDevice resource alone makes device files visible but does not grant the necessary cgroup permissions.

Parameter	Description	Default
`rdma.resourceName`	RDMA device plugin resource name	`rdma/hca_shared`
`rdma.resourceCount`	Number of RDMA resources to request per container	`1`

Storage Configuration

Parameter	Description	Default
`storage.enabled`	Enable PVC creation	`true`
`storage.existingClaim`	Use existing PVC instead of creating one	`""`
`storage.className`	Storage class name (REQUIRED)	—
`storage.size`	Persistent volume size	`10Gi`
`storage.accessMode`	PVC access mode	`ReadWriteMany`

Resource Requests (REQUIRED) and Limits (OPTIONAL)

Parameter	Description	Default
`resources.requests.memory`	Memory request (REQUIRED)	—
`resources.requests.cpu`	CPU request (REQUIRED)	—
`resources.limits.memory`	Memory limit (optional)	— (no cap)
`resources.limits.cpu`	CPU limit (optional)	— (no cap)

License Configuration

Parameter	Description	Default
`license.existingConfigMap`	ConfigMap containing license file(s)	`""`
`license.existingSecret`	Secret containing license file(s)	`""`

SSL Certificate Configuration

Parameter	Description	Default
`ssl.enabled`	Enable custom SSL certificates	`false`
`ssl.existingSecret`	TLS Secret name (required when ssl.enabled=true)	`""`

Startup Probe Configuration

The startup probe waits for UFM to fully initialize before the liveness probe starts.

Parameter	Description	Default
`startupProbe.enabled`	Enable startup probe	`true`
`startupProbe.initialDelaySeconds`	Initial delay before probe	`2`
`startupProbe.periodSeconds`	Probe interval	`10`
`startupProbe.timeoutSeconds`	Probe timeout	`2`
`startupProbe.failureThreshold`	Failures before giving up (10s × 30 = 5 min max)	`30`

Liveness Probe Configuration

The liveness probe checks if UFM is still running after startup completes.

Parameter	Description	Default
`livenessProbe.enabled`	Enable liveness probe	`true`
`livenessProbe.initialDelaySeconds`	Initial delay before probe	`0`
`livenessProbe.periodSeconds`	Probe interval	`10`
`livenessProbe.timeoutSeconds`	Probe timeout	`2`
`livenessProbe.failureThreshold`	Failures before restart	`3`

Service Configuration

Parameter	Description	Default
`service.enabled`	Enable Kubernetes Service	`true`
`service.type`	Service type: ClusterIP, NodePort, LoadBalancer	`ClusterIP`
`service.nodePort`	NodePort number (30000-32767), auto-assign if empty	`""`

Ingress Configuration

Parameter	Description	Default
`ingress.enabled`	Enable Ingress for external access	`false`
`ingress.className`	Ingress class name (e.g., nginx, traefik)	`""`
`ingress.host`	Hostname for the Ingress	`""`
`ingress.annotations`	Ingress annotations (controller-specific)	`{}`
`ingress.tls.secretName`	TLS secret name for HTTPS	`""`

Scheduling Configuration

Parameter	Description	Default
`nodeSelector`	Node labels for pod scheduling	`{}`
`tolerations`	Tolerations for pod scheduling	`[]`
`affinity`	Affinity rules for pod scheduling	`{}`

Note: When the watchdog is enabled, the chart automatically adds a nodeAffinity rule to exclude nodes labeled unhealthy. If you also provide affinity.nodeAffinity, the watchdog expression is injected into each of your nodeSelectorTerms, preserving OR semantics between terms while ANDing the watchdog rule within each.

Deployment Configuration

Parameter	Description	Default
`deployment.enabled`	Enable the UFM deployment	`true`
`deployment.terminationGracePeriodSeconds`	Time to wait for graceful shutdown	`30`

Config File Overrides

Parameter	Description	Default
`configFiles`	Map of file path to content; override with `--set-file`	`{}`

Override chart-bundled config files without extracting the chart. Escape dots in filenames with a backslash (\.). For nested paths, use path segments as keys (e.g., configFiles.opensm.opensm\.conf for opensm/opensm.conf).

User Scripts Configuration

Parameter	Description	Default
`userScripts`	Map of script filename to content; inject via `--set-file`	`{}`

Mount custom scripts as executable files inside the UFM pod. Scripts are mounted at /opt/ufm/scripts/user-scripts/ with mode 0755. Inject via --set-file (escape dots in filenames with \.). When no userScripts are provided, no ConfigMap or volume mount is created.

Watchdog Operator Configuration

The watchdog operator monitors UFM pods for crash loops and automatically labels problematic nodes to enable rescheduling to healthy nodes. It is enabled by default.

The operator handles two types of failures:

Failover signal: UFM's health detects a critical failure and creates a failover flag. The operator detects this and triggers immediate node labeling and pod migration — no threshold, no waiting.
Process crash: A UFM process dies. The operator counts restarts within a sliding window and migrates only if the threshold is reached.

Parameter	Description	Default
`watchdog.enabled`	Enable watchdog operator	`true`
`watchdog.image.tag`	Watchdog image tag	`.Chart.AppVersion`
`watchdog.restartThreshold`	Restarts before action (process crashes only)	`3`
`watchdog.timeWindowSeconds`	Time window for counting restarts (seconds)	`120`
`watchdog.maxLabeledNodes`	Max nodes to label unhealthy (0 = auto)	`0`
`watchdog.replicas`	Operator replicas for HA	`2`
`watchdog.unhealthyLabelKey`	Label key applied to unhealthy nodes	`ufm.nvidia.com/unhealthy`

maxLabeledNodes=0 (default): Auto-calculates as total_nodes - 1, ensuring at least one node always remains schedulable.

Plugin Watchdog:

The watchdog also monitors plugin pods (deployed by the UFM Plugins Helm chart) for crash loops. Plugins are identified by the ufm.nvidia.com/watchdog-scope=plugin label. Each plugin gets its own per-plugin unhealthy label (e.g., ufm.nvidia.com/fast_api-unhealthy), so one crashing plugin does not affect other plugins or UFM scheduling.

Plugin pods can override chart-level thresholds via annotations:

ufm.nvidia.com/watchdog-restart-threshold
ufm.nvidia.com/watchdog-time-window-seconds

Parameter	Description	Default
`watchdog.plugins.enabled`	Enable plugin monitoring	`true`
`watchdog.plugins.labelSelector`	Label selector for plugin pods	`ufm.nvidia.com/watchdog-scope=plugin`
`watchdog.plugins.restartThreshold`	Restarts before action	`3`
`watchdog.plugins.timeWindowSeconds`	Time window for counting restarts	`120`
`watchdog.plugins.unhealthyLabelKeyTemplate`	Label key template	`ufm.nvidia.com/{pluginName}-unhealthy`
`watchdog.plugins.maxLabeledNodes`	Max nodes to label per plugin (0=auto)	`0`

Observability: The operator reports status through Kubernetes Events (e.g., NodeLabeledUnhealthy, MaxUnhealthyNodesReached) and Prometheus metrics exposed on port watchdog.metricsPort (default 8080).

Recovering a node: Remove the unhealthy label after the issue is resolved: kubectl label node <node-name> ufm.nvidia.com/unhealthy-

Plugin Deployment

UFM plugins are now deployed via a separate Helm chart

Prerequisites

UFM Enterprise must already be installed in the cluster
ufmFullname: Must match your UFM release name (e.g., ufm-ufm-enterprise). Required.
Shared PVC: The plugin chart uses the same PVC as UFM. Default claim name is {ufmFullname}-files.
UFM ConfigMap: A ConfigMap named {ufmFullname}-config with key UFM_VERSION must exist (created by the UFM Enterprise chart).
RDMA (if needed): If plugins use InfiniBand, set rdma.resourceCount and ensure the cluster has the RDMA device plugin.

Plugin Chart Values Reference

Parameter	Description	Default
`ufmFullname`	Full name of the UFM Enterprise release (e.g., `ufm-ufm-enterprise`)	Yes
`namespace`	Kubernetes namespace. Can be auto-discovered via `namespaceSearchList`. Falls back to `ufm-enterprise`.	No
`namespaceSearchList`	List of namespaces to search for `{ufmFullname}-config` at install time	No
`existingClaim`	PVC claim name for UFM files. Default: `{ufmFullname}-files`	No
`configMapName`	ConfigMap name for `plugins.yaml`. Default: `{ufmFullname}-plugins`	No
`rdma.resourceName`	RDMA resource name (e.g., `rdma/hca_shared`)	No
`rdma.resourceCount`	Number of RDMA resources per plugin pod; default `"0"`. Set to `"1"` for InfiniBand plugins.	No
`watchdog.enabled`	Enable watchdog monitoring for plugin pods	No (default: `true`)
`watchdog.restartThreshold`	Chart-level default for max restarts before marking node unhealthy	No (default: `3`)
`watchdog.timeWindowSeconds`	Chart-level default time window for counting restarts	No (default: `120`)
`plugins.defaultResources`	Default `requests`/`limits` when a plugin does not set `resources`	No
`plugins.entries`	Map of plugin definitions keyed by plugin name (see below)	Yes
`podSecurityContext`	Pod-level securityContext (e.g., `runAsNonRoot: true`)	No
`affinity`	Pod affinity rules	No
`tolerations`	Pod tolerations	No
`nodeSelector`	Pod node selector	No
`imagePullSecrets`	Image pull secrets	No

Plugin Entry Fields (`plugins.entries.<name>`)

Each plugin is a map entry keyed by its canonical name (use underscores, e.g., log_streamer). Required fields are image and tag.

Parameter	Description	Default
`enabled`	Set to `false` to skip this plugin. Default `true`.	No
`image`	Container image repository (no tag)	Yes
`tag`	Image tag	Yes
`imagePullPolicy`	e.g., `IfNotPresent` or `Always`	No (default: `IfNotPresent`)
`port`	Main TCP port the plugin listens on. Written to `plugins.yaml` so UFM can reach the plugin.	No
`ports`	Additional container ports (list of integers)	No
`healthEndpoint`	HTTP path for liveness probe (e.g., `/health`). Uses native httpGet.	No
`healthPort`	Port for liveness httpGet probe; defaults to `port`	No
`host`	Host written into `plugins.yaml`. Default: in-cluster DNS `{ufmFullname}-plugin-{name}.{namespace}.svc.cluster.local`	No
`rdma`	Per-plugin RDMA override: `{ resourceName, resourceCount }`	No
`resources`	`requests`/`limits` for this plugin; overrides `plugins.defaultResources`	No
`startupProbe`	Full startup probe spec	No
`livenessProbe`	Full liveness probe spec. Overrides the chart default (httpGet or tcpSocket).	No
`disableLivenessProbe`	Set to `true` to omit liveness probe entirely	No
`readinessProbe`	Full readiness probe spec. No default.	No
`mountHealthScripts`	Mount chart health-check script at `/health-scripts`	No (default: `false`)
`extraCapabilities`	Additional Linux capabilities (e.g., `["SYS_PTRACE"]`)	No
`env`	Extra environment variables for the main container	No
`volumes`	Extra volumes for the pod	No
`volumeMounts`	Extra volumeMounts for the main container	No
`runInitContainer`	When `false`, the init container is omitted. Default `true`.	No
`strategy`	Deployment strategy: `Recreate` (default) or `RollingUpdate`	No
`watchdog`	Per-plugin watchdog override: `{ enabled, restartThreshold, timeWindowSeconds }`	No

What the Plugin Chart Generates

ClusterIP Service per plugin (when port and/or ports is set): enables in-cluster DNS so UFM and other services can reach the plugin
Deployment per plugin: one Deployment per enabled entry, using Recreate strategy by default. Includes init container, shared PVC mounts, optional RDMA resources
ConfigMap plugins.yaml: consumed by UFM with plugin name, host, port, tag for each enabled plugin
Watchdog labels and annotations: when watchdog is enabled, each plugin pod gets discovery labels and threshold annotations

Incremental Plugin Upgrades

The map-based plugins.entries model lets you upgrade a single plugin without restating every other plugin.

Parameter	Description	Default
Upgrade one plugin's image/config	`--reuse-values`	Keeps all other plugins as-is
Add a new plugin to an existing release	`--reuse-values`	Merges the new entry into existing
Disable a single plugin	`--reuse-values`	Only changes that plugin's `enabled` flag
Upgrade the chart version itself	`--reset-values`	Ensures new chart defaults apply cleanly
Full reconcile of all plugins	`--reset-values`	Sets the authoritative desired state

Plugin Manager Behavior on K8s

The Plugin Manager in K8s mode is read-only:

The Plugin Manager UI displays current plugin state but all modification operations are blocked
Plugin Manager REST API and shell operations only allow GET/read actions — write operations are blocked
All plugin lifecycle management (deploy, upgrade, disable) must be done via the Helm chart

Custom Configuration Files

The Helm chart includes default UFM configuration files that can be customized.

Customizing Config Files

Use --set-file (configFiles)

Override chart-bundled files .Escape dots in filenames with a backslash (\.). For nested paths, use path segments as keys.

# Override a top-level file
helm install ufm-enterprise ./ufm-enterprise \
  --set-file 'configFiles.gv\.cfg=/path/to/my-gv.cfg' \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never

# Override a file in a subdirectory
helm install ufm-enterprise ./ufm-enterprise \
  --set-file 'configFiles.opensm.opensm\.conf=/path/to/opensm.conf' \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never

Configuration Priority

Configuration is applied in this order (later wins):

Base install/upgrade — UFM default config files
Helm chart config files — Files from files/conf/ directory
configFiles (values / --set-file) — Overrides chart file content when the path is set
Helm values — config.mgmtInterface, config.fabricInterface (if provided)

Important Notes

Config files are applied after the UFM upgrade/install process completes
File ownership and permissions are preserved for existing files
New files are created with ufmapp:ufmapp ownership
helm upgrade with modified config files or configFiles overrides triggers a pod restart automatically
Pod restarts skip config application if nothing changed (checksum-based)

Operations

Start/Stop UFM

Stop UFM:

kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=0

Verify UFM is stopped:

kubectl get pods -n ufm-enterprise

Start UFM:

kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=1

Wait for the pod to be ready:

kubectl get pods -n ufm-enterprise -w

View Logs

Container Logs:

# Follow logs kubectl logs -n ufm-enterprise -l app=ufm-enterprise -f # Previous container logs (after crash) kubectl logs -n ufm-enterprise -l app=ufm-enterprise --previous

UFM Application Logs:

# List log files
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ls -la /opt/ufm/files/log/

# View specific log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/console.log

# Tail a log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- tail -100 /opt/ufm/files/log/ufmhealth.log

Access UFM UI and REST API

https://<ingress-host>/ufm_web/

REST API

# Get UFM version
curl -k -u <user>:<password> https://<host>/ufmRest/app/ufm_version

# List resources
curl -k -u <user>:<password> https://<host>/ufmRest/resources/systems

Uninstallation

Step 1: Remove Plugins (if installed)

If you deployed plugins via the UFM Plugins Helm chart, uninstall them first:

helm uninstall ufm-plugins -n ufm-enterprise

Step 2: Remove UFM

helm uninstall ufm-enterprise -n ufm-enterprise

Warning: This deletes all UFM resources including the PersistentVolumeClaim and data.

Resource Cleanup

Remove all resources (entire namespace):

kubectl delete namespace ufm-enterprise

Remove specific resources only:

kubectl delete pvc -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise
kubectl delete configmap -n ufm-enterprise ufm-license
kubectl delete secret -n ufm-enterprise ufm-tls

Monitoring

Kubernetes Probes

UFM uses two probes:

Startup	Wait for UFM initialization	REST API returns HTTP 200
Liveness	Detect failures	UfmHealthRunner running, no failover flag

Watchdog Operator Monitoring

The Watchdog Operator provides automatic failover capabilities. When UFM encounters a critical failure or crash loop, the operator:

Labels the current node as unhealthy
Kubernetes reschedules the UFM pod to a healthy node
The same process applies to plugin pods (with per-plugin labels)

Monitoring Commands

Verify Probe Status

kubectl describe pod -n ufm-enterprise -l app=ufm-enterprise | grep -A 5 -E "Liveness:|Startup:"

Verify UFM Processes:

kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ps aux

Check UFM Health Log:

kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/ufmhealth.log

Known Limitations

Parameter	Description	Default
Single Pod	Only one UFM replica supported	No horizontal scaling
sysdump Unavailable	sysdump collector doesn't work in K8s	Use manual log collection
Recreate Strategy	Rolling updates not supported	Downtime during upgrades
Plugin UI	Plugins with web UI are not supported in K8s	—
Plugin Manager Read-Only	Plugin manager UI and REST API are read-only; write operations are blocked	Use Helm chart for plugin lifecycle management
Plugin Port Configuration	User must manually specify plugin ports	Refer to plugin documentation for port values
Watchdog Label Cleanup	Watchdog does not automatically remove unhealthy labels from nodes after recovery	Manual label removal required (`kubectl label node <node>` `ufm.nvidia.com/unhealthy-`)
No Upgrade from 6.24.2	This version is not compatible with the previous K8s deployment	Fresh install required

Version Changes Since UFM 6.24.2

Parameter	Description	Notes
Network	`hostNetwork: true`	HostDevice via NVIDIA Network Operator (no host network)
Security	Privileged container required	Non-privileged container
Watchdog	N/A	Watchdog Operator — automatic failover, node labeling, plugin monitoring
Plugins	Deployed via UFM Helm chart (`plugins.items[]`)	Separate Helm chart in UFM SDK repo (`ufm-plugin-helm-template`)
Config Overrides	Edit chart files before install only	Also supports `--set-file configFiles.*` without extracting chart
Resource Limits	Both requests and limits required	Requests required, limits optional
Service	Disabled by default	Enabled (ClusterIP) by default
User Scripts	N/A	ConfigMap mount at `/opt/ufm/scripts/user-scripts/`
SSL Certificates	N/A	Custom SSL cert support via TLS Secret
Version Check	N/A	Init container verifies image version >= chart appVersion
Plugin Manager	Full operations available	Read-only on K8s — write operations blocked
Upgrade from 6.24.2	—	Not supported — fresh install required

Last updated: June 03, 2026

Overview

Supported Environments

Kubernetes Version

Node Operating Systems

Hardware Requirements

Prerequisites

Kubernetes Cluster

Helm

Storage

NVIDIA Network Operator (Required)

UFM License

Installation

Step 1: Set Up Storage

Step 2: Create Namespace and License ConfigMap

Step 3: Install UFM with Helm

Step 4: Verify Installation

Configuration Reference

Namespace Configuration

Image Configuration

UFM Configuration

HostDevice Network Configuration

RDMA Configuration

Storage Configuration

Resource Requests (REQUIRED) and Limits (OPTIONAL)

License Configuration

SSL Certificate Configuration

Startup Probe Configuration

Liveness Probe Configuration

Service Configuration

Ingress Configuration

Scheduling Configuration

Deployment Configuration

Config File Overrides

User Scripts Configuration

Watchdog Operator Configuration

Plugin Deployment

Prerequisites

Plugin Chart Values Reference

Plugin Entry Fields (plugins.entries.<name>)

What the Plugin Chart Generates

Incremental Plugin Upgrades

Plugin Manager Behavior on K8s

Custom Configuration Files

Customizing Config Files

Configuration Priority

Important Notes

Operations

Start/Stop UFM

View Logs

Access UFM UI and REST API

Uninstallation

Step 1: Remove Plugins (if installed)

Step 2: Remove UFM

Resource Cleanup

Monitoring

Kubernetes Probes

Watchdog Operator Monitoring

Monitoring Commands

Known Limitations

Version Changes Since UFM 6.24.2

Plugin Entry Fields (`plugins.entries.<name>`)