Overview
UFM Enterprise supports deployment on Kubernetes clusters using Helm charts. This deployment method provides:
-
Declarative Configuration: Define your UFM deployment using Helm values
-
Simplified Operations: Use standard Kubernetes tools for deployment, upgrades, and management
-
Plugin Support: Deploy UFM plugins as separate pods via a dedicated Helm chart
Supported Environments
Kubernetes Version
Kubernetes 1.28 or later.
Node Operating Systems
UFM on Kubernetes supports the same operating systems as UFM Enterprise. See the Installation Notes for the complete list of supported operating systems.
Hardware Requirements
UFM on Kubernetes has the same hardware requirements as UFM Enterprise. See the Installation Notes for detailed specifications.
Prerequisites
Before deploying UFM Enterprise on Kubernetes, ensure the following requirements are met:
Kubernetes Cluster
-
Kubernetes cluster version 1.28 or later
-
kubectlconfigured with cluster access -
Cluster admin permissions for installation
Helm
Helm 3.x installed on the management workstation:
helm version
Storage
-
A StorageClass that supports ReadWriteMany access mode
-
Minimum 10GB storage capacity
NVIDIA Network Operator (Required)
UFM cannot function without access to InfiniBand devices. The NVIDIA Network Operator must be installed and configured before installing the UFM Helm chart.
1. Install Network Operator:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install network-operator nvidia/network-operator \
--namespace nvidia-network-operator \
--create-namespace \
--version 25.7.0 \
--set nfd.enabled=true \
--set ofedDriver.deploy=false \
--set sriovDevicePlugin.deploy=true \
--set secondaryNetwork.deploy=true \
--set secondaryNetwork.multus.deploy=true \
--wait --timeout 5m
Note: Set ofedDriver.deploy=false if OFED/DOCA drivers are already installed on the host.
2. Create NicClusterPolicy:
The sriovDevicePlugin must be enabled so the nodes expose nvidia.com/hostdev. The rdmaSharedDevicePlugin must be enabled to expose rdma/hca_shared, which is required for OpenSM to access InfiniBand character devices.
kubectl apply -f - <<EOF
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
secondaryNetwork:
multus:
image: multus-cni
repository: ghcr.io/k8snetworkplumbingwg
version: v4.1.0
cniPlugins:
image: plugins
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
sriovDevicePlugin:
image: sriov-network-device-plugin
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
config: |
{
"resourceList": [
{
"resourcePrefix": "nvidia.com",
"resourceName": "hostdev",
"selectors": {
"vendors": ["15b3"],
"devices": [],
"drivers": [],
"pfNames": [],
"pciAddresses": [],
"rootDevices": [],
"linkTypes": [],
"isRdma": true
}
}
]
}
rdmaSharedDevicePlugin:
image: k8s-rdma-shared-dev-plugin
repository: nvcr.io/nvidia/mellanox
version: network-operator-v25.7.0
config: |
{
"configList": [
{
"resourceName": "hca_shared",
"rdmaHcaMax": 1000,
"devices": ["all"]
}
]
}
EOF
Wait for the policy to be ready:
kubectl get nicclusterpolicy -o jsonpath='{.items[0].status.state}'
# Expected: ready
3. Verify resources are available:
kubectl get nodes -o custom-columns=NAME:.metadata.name,HOSTDEV:.status.allocatable.nvidia\\.com/hostdev
# Expected: nvidia.com/hostdev should appear on the nodes
Note: 15b3 is the NVIDIA/Mellanox PCI vendor ID.
UFM License
-
Valid UFM Enterprise license file
Installation
Step 1: Set Up Storage
UFM requires ReadWriteMany storage. Make sure you have a persistent storage provisioner configured (e.g., NFS).
Step 2: Create Namespace and License ConfigMap
# Create the namespace
kubectl create namespace ufm-enterprise
# Create license ConfigMap
kubectl create configmap ufm-license \
--from-file=<license-filename>.lic=/path/to/your/<license-filename>.lic \
-n ufm-enterprise
Step 3: Install UFM with Helm
helm install ufm-enterprise <chart> \
--namespace ufm-enterprise \
--set storage.className=<storage-client> \
--set image.pullPolicy=Never \
--set license.existingConfigMap=ufm-license \
--set resources.requests.memory=<memory> \
--set resources.requests.cpu=<cpu>
Note: The chart defaults to fabric_interface = net1 (provided by HostDeviceNetwork). No need to set config.fabricInterface unless your setup differs. Resource limits are optional — only requests are required.
Step 4: Verify Installation
Watch the pod status:
kubectl get pods -n ufm-enterprise -w
Expected state transitions:
NAME READY STATUS AGE
ufm-ufm-enterprise-xxxxxxxxxx 0/1 Init:0/1 5s
ufm-ufm-enterprise-xxxxxxxxxx 0/1 PodInitializing 30s
ufm-ufm-enterprise-xxxxxxxxxx 0/1 Running 45s
ufm-ufm-enterprise-xxxxxxxxxx 1/1 Running 2m
Note: The pod shows 0/1 Running while the startup probe waits for UFM to fully initialize. This can take several minutes depending on the cluster size.
Verify the HostDeviceNetwork resource created by the chart:
kubectl get hostdevicenetwork ufm-hostdevice -o yaml
kubectl get network-attachment-definition -n ufm-enterprise ufm-hostdevice -o yaml
Configuration Reference
All configuration options are set via Helm values. Use --set key=value or a values file (-f values.yaml).
Namespace Configuration
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Create the namespace |
|
|
Namespace name |
|
Image Configuration
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Image repository |
|
|
|
Image tag |
|
|
|
Image pull policy (REQUIRED) |
— |
|
|
Image pull secrets for private registries |
|
|
|
Check that UFM image version >= Helm chart appVersion at startup. Fails the init container with a clear error if the image is older than the chart. |
|
Note: image.pullPolicy must be set to one of: Never, IfNotPresent, or Always.
UFM Configuration
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Management network interface name |
|
|
|
Fabric interface override |
|
|
|
Apache HTTP port |
|
|
|
Apache HTTPS port |
|
HostDevice Network Configuration
UFM requires a valid InfiniBand fabric interface inside the pod. The interface is provided by the NVIDIA Network Operator HostDeviceNetwork. The pod gets a net1 interface which is a link/infiniband interface (no IP address). UFM uses fabric_interface = net1.
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Create the HostDeviceNetwork resource as part of chart install |
|
|
|
Name of the HostDeviceNetwork resource |
|
|
|
Resource name referenced by the HostDeviceNetwork |
|
|
|
Device-plugin resource requested by the pod |
|
|
|
Number of host-device resources per container |
|
Set hostDevice.createNetwork=false if you want to reference a pre-created HostDeviceNetwork instead of letting the chart create it.
RDMA Configuration
The RDMA shared device plugin grants cgroup-level access to /dev/infiniband/* character devices. This is required for OpenSM to function. The HostDevice resource alone makes device files visible but does not grant the necessary cgroup permissions.
|
Parameter |
Description |
Default |
|---|---|---|
|
|
RDMA device plugin resource name |
|
|
|
Number of RDMA resources to request per container |
|
Storage Configuration
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Enable PVC creation |
|
|
|
Use existing PVC instead of creating one |
|
|
|
Storage class name (REQUIRED) |
— |
|
|
Persistent volume size |
|
|
|
PVC access mode |
|
Resource Requests (REQUIRED) and Limits (OPTIONAL)
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Memory request (REQUIRED) |
— |
|
|
CPU request (REQUIRED) |
— |
|
|
Memory limit (optional) |
— (no cap) |
|
|
CPU limit (optional) |
— (no cap) |
License Configuration
|
Parameter |
Description |
Default |
|---|---|---|
|
|
ConfigMap containing license file(s) |
|
|
|
Secret containing license file(s) |
|
SSL Certificate Configuration
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Enable custom SSL certificates |
|
|
|
TLS Secret name (required when ssl.enabled=true) |
|
Startup Probe Configuration
The startup probe waits for UFM to fully initialize before the liveness probe starts.
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Enable startup probe |
|
|
|
Initial delay before probe |
|
|
|
Probe interval |
|
|
|
Probe timeout |
|
|
|
Failures before giving up (10s × 30 = 5 min max) |
|
Liveness Probe Configuration
The liveness probe checks if UFM is still running after startup completes.
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Enable liveness probe |
|
|
|
Initial delay before probe |
|
|
|
Probe interval |
|
|
|
Probe timeout |
|
|
|
Failures before restart |
|
Service Configuration
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Enable Kubernetes Service |
|
|
|
Service type: ClusterIP, NodePort, LoadBalancer |
|
|
|
NodePort number (30000-32767), auto-assign if empty |
|
Ingress Configuration
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Enable Ingress for external access |
|
|
|
Ingress class name (e.g., nginx, traefik) |
|
|
|
Hostname for the Ingress |
|
|
|
Ingress annotations (controller-specific) |
|
|
|
TLS secret name for HTTPS |
|
Scheduling Configuration
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Node labels for pod scheduling |
|
|
|
Tolerations for pod scheduling |
|
|
|
Affinity rules for pod scheduling |
|
Note: When the watchdog is enabled, the chart automatically adds a nodeAffinity rule to exclude nodes labeled unhealthy. If you also provide affinity.nodeAffinity, the watchdog expression is injected into each of your nodeSelectorTerms, preserving OR semantics between terms while ANDing the watchdog rule within each.
Deployment Configuration
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Enable the UFM deployment |
|
|
|
Time to wait for graceful shutdown |
|
Config File Overrides
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Map of file path to content; override with |
|
Override chart-bundled config files without extracting the chart. Escape dots in filenames with a backslash (\.). For nested paths, use path segments as keys (e.g., configFiles.opensm.opensm\.conf for opensm/opensm.conf).
User Scripts Configuration
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Map of script filename to content; inject via |
|
Mount custom scripts as executable files inside the UFM pod. Scripts are mounted at /opt/ufm/scripts/user-scripts/ with mode 0755. Inject via --set-file (escape dots in filenames with \.). When no userScripts are provided, no ConfigMap or volume mount is created.
Watchdog Operator Configuration
The watchdog operator monitors UFM pods for crash loops and automatically labels problematic nodes to enable rescheduling to healthy nodes. It is enabled by default.
The operator handles two types of failures:
-
Failover signal: UFM's health detects a critical failure and creates a failover flag. The operator detects this and triggers immediate node labeling and pod migration — no threshold, no waiting.
-
Process crash: A UFM process dies. The operator counts restarts within a sliding window and migrates only if the threshold is reached.
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Enable watchdog operator |
|
|
|
Watchdog image tag |
|
|
|
Restarts before action (process crashes only) |
|
|
|
Time window for counting restarts (seconds) |
|
|
|
Max nodes to label unhealthy (0 = auto) |
|
|
|
Operator replicas for HA |
|
|
|
Label key applied to unhealthy nodes |
|
maxLabeledNodes=0 (default): Auto-calculates as total_nodes - 1, ensuring at least one node always remains schedulable.
Plugin Watchdog:
The watchdog also monitors plugin pods (deployed by the UFM Plugins Helm chart) for crash loops. Plugins are identified by the ufm.nvidia.com/watchdog-scope=plugin label. Each plugin gets its own per-plugin unhealthy label (e.g., ufm.nvidia.com/fast_api-unhealthy), so one crashing plugin does not affect other plugins or UFM scheduling.
Plugin pods can override chart-level thresholds via annotations:
-
ufm.nvidia.com/watchdog-restart-threshold -
ufm.nvidia.com/watchdog-time-window-seconds
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Enable plugin monitoring |
|
|
|
Label selector for plugin pods |
|
|
|
Restarts before action |
|
|
|
Time window for counting restarts |
|
|
|
Label key template |
|
|
|
Max nodes to label per plugin (0=auto) |
|
Observability: The operator reports status through Kubernetes Events (e.g., NodeLabeledUnhealthy, MaxUnhealthyNodesReached) and Prometheus metrics exposed on port watchdog.metricsPort (default 8080).
Recovering a node: Remove the unhealthy label after the issue is resolved: kubectl label node <node-name> ufm.nvidia.com/unhealthy-
Plugin Deployment
UFM plugins are now deployed via a separate Helm chart
Prerequisites
-
UFM Enterprise must already be installed in the cluster
-
ufmFullname: Must match your UFM release name (e.g.,ufm-ufm-enterprise). Required. -
Shared PVC: The plugin chart uses the same PVC as UFM. Default claim name is
{ufmFullname}-files. -
UFM ConfigMap: A ConfigMap named
{ufmFullname}-configwith keyUFM_VERSIONmust exist (created by the UFM Enterprise chart). -
RDMA (if needed): If plugins use InfiniBand, set
rdma.resourceCountand ensure the cluster has the RDMA device plugin.
Plugin Chart Values Reference
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Full name of the UFM Enterprise release (e.g., |
Yes |
|
|
Kubernetes namespace. Can be auto-discovered via |
No |
|
|
List of namespaces to search for |
No |
|
|
PVC claim name for UFM files. Default: |
No |
|
|
ConfigMap name for |
No |
|
|
RDMA resource name (e.g., |
No |
|
|
Number of RDMA resources per plugin pod; default |
No |
|
|
Enable watchdog monitoring for plugin pods |
No (default: |
|
|
Chart-level default for max restarts before marking node unhealthy |
No (default: |
|
|
Chart-level default time window for counting restarts |
No (default: |
|
|
Default |
No |
|
|
Map of plugin definitions keyed by plugin name (see below) |
Yes |
|
|
Pod-level securityContext (e.g., |
No |
|
|
Pod affinity rules |
No |
|
|
Pod tolerations |
No |
|
|
Pod node selector |
No |
|
|
Image pull secrets |
No |
Plugin Entry Fields (plugins.entries.<name>)
Each plugin is a map entry keyed by its canonical name (use underscores, e.g., log_streamer). Required fields are image and tag.
|
Parameter |
Description |
Default |
|---|---|---|
|
|
Set to |
No |
|
|
Container image repository (no tag) |
Yes |
|
|
Image tag |
Yes |
|
|
e.g., |
No (default: |
|
|
Main TCP port the plugin listens on. Written to |
No |
|
|
Additional container ports (list of integers) |
No |
|
|
HTTP path for liveness probe (e.g., |
No |
|
|
Port for liveness httpGet probe; defaults to |
No |
|
|
Host written into |
No |
|
|
Per-plugin RDMA override: |
No |
|
|
|
No |
|
|
Full startup probe spec |
No |
|
|
Full liveness probe spec. Overrides the chart default (httpGet or tcpSocket). |
No |
|
|
Set to |
No |
|
|
Full readiness probe spec. No default. |
No |
|
|
Mount chart health-check script at |
No (default: |
|
|
Additional Linux capabilities (e.g., |
No |
|
|
Extra environment variables for the main container |
No |
|
|
Extra volumes for the pod |
No |
|
|
Extra volumeMounts for the main container |
No |
|
|
When |
No |
|
|
Deployment strategy: |
No |
|
|
Per-plugin watchdog override: |
No |
What the Plugin Chart Generates
-
ClusterIP Service per plugin (when
portand/orportsis set): enables in-cluster DNS so UFM and other services can reach the plugin -
Deployment per plugin: one Deployment per enabled entry, using
Recreatestrategy by default. Includes init container, shared PVC mounts, optional RDMA resources -
ConfigMap
plugins.yaml: consumed by UFM with plugin name, host, port, tag for each enabled plugin -
Watchdog labels and annotations: when watchdog is enabled, each plugin pod gets discovery labels and threshold annotations
Incremental Plugin Upgrades
The map-based plugins.entries model lets you upgrade a single plugin without restating every other plugin.
|
Parameter |
Description |
Default |
|---|---|---|
|
Upgrade one plugin's image/config |
|
Keeps all other plugins as-is |
|
Add a new plugin to an existing release |
|
Merges the new entry into existing |
|
Disable a single plugin |
|
Only changes that plugin's |
|
Upgrade the chart version itself |
|
Ensures new chart defaults apply cleanly |
|
Full reconcile of all plugins |
|
Sets the authoritative desired state |
Plugin Manager Behavior on K8s
The Plugin Manager in K8s mode is read-only:
-
The Plugin Manager UI displays current plugin state but all modification operations are blocked
-
Plugin Manager REST API and shell operations only allow
GET/read actions — write operations are blocked -
All plugin lifecycle management (deploy, upgrade, disable) must be done via the Helm chart
Custom Configuration Files
The Helm chart includes default UFM configuration files that can be customized.
Customizing Config Files
Use --set-file (configFiles)
Override chart-bundled files .Escape dots in filenames with a backslash (\.). For nested paths, use path segments as keys.
# Override a top-level file
helm install ufm-enterprise ./ufm-enterprise \
--set-file 'configFiles.gv\.cfg=/path/to/my-gv.cfg' \
--set storage.className=nfs-client \
--set image.pullPolicy=Never
# Override a file in a subdirectory
helm install ufm-enterprise ./ufm-enterprise \
--set-file 'configFiles.opensm.opensm\.conf=/path/to/opensm.conf' \
--set storage.className=nfs-client \
--set image.pullPolicy=Never
Configuration Priority
Configuration is applied in this order (later wins):
-
Base install/upgrade — UFM default config files
-
Helm chart config files — Files from
files/conf/directory -
configFiles (values /
--set-file) — Overrides chart file content when the path is set -
Helm values —
config.mgmtInterface,config.fabricInterface(if provided)
Important Notes
-
Config files are applied after the UFM upgrade/install process completes
-
File ownership and permissions are preserved for existing files
-
New files are created with
ufmapp:ufmappownership -
helm upgradewith modified config files orconfigFilesoverrides triggers a pod restart automatically -
Pod restarts skip config application if nothing changed (checksum-based)
Operations
Start/Stop UFM
Stop UFM:
kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=0
Verify UFM is stopped:
kubectl get pods -n ufm-enterprise
Start UFM:
kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=1
Wait for the pod to be ready:
kubectl get pods -n ufm-enterprise -w
View Logs
Container Logs:
# Follow logs kubectl logs -n ufm-enterprise -l app=ufm-enterprise -f # Previous container logs (after crash) kubectl logs -n ufm-enterprise -l app=ufm-enterprise --previous
UFM Application Logs:
# List log files
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ls -la /opt/ufm/files/log/
# View specific log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/console.log
# Tail a log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- tail -100 /opt/ufm/files/log/ufmhealth.log
Access UFM UI and REST API
https://<ingress-host>/ufm_web/
REST API
# Get UFM version
curl -k -u <user>:<password> https://<host>/ufmRest/app/ufm_version
# List resources
curl -k -u <user>:<password> https://<host>/ufmRest/resources/systems
Uninstallation
Step 1: Remove Plugins (if installed)
If you deployed plugins via the UFM Plugins Helm chart, uninstall them first:
helm uninstall ufm-plugins -n ufm-enterprise
Step 2: Remove UFM
helm uninstall ufm-enterprise -n ufm-enterprise
Warning: This deletes all UFM resources including the PersistentVolumeClaim and data.
Resource Cleanup
Remove all resources (entire namespace):
kubectl delete namespace ufm-enterprise
Remove specific resources only:
kubectl delete pvc -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise
kubectl delete configmap -n ufm-enterprise ufm-license
kubectl delete secret -n ufm-enterprise ufm-tls
Monitoring
Kubernetes Probes
UFM uses two probes:
|
Startup |
Wait for UFM initialization |
REST API returns HTTP 200 |
|
Liveness |
Detect failures |
UfmHealthRunner running, no failover flag |
Watchdog Operator Monitoring
The Watchdog Operator provides automatic failover capabilities. When UFM encounters a critical failure or crash loop, the operator:
-
Labels the current node as unhealthy
-
Kubernetes reschedules the UFM pod to a healthy node
-
The same process applies to plugin pods (with per-plugin labels)
Monitoring Commands
Verify Probe Status
kubectl describe pod -n ufm-enterprise -l app=ufm-enterprise | grep -A 5 -E "Liveness:|Startup:"
Verify UFM Processes:
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ps aux
Check UFM Health Log:
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/ufmhealth.log
Known Limitations
|
Parameter |
Description |
Default |
|---|---|---|
|
Single Pod |
Only one UFM replica supported |
No horizontal scaling |
|
sysdump Unavailable |
sysdump collector doesn't work in K8s |
Use manual log collection |
|
Recreate Strategy |
Rolling updates not supported |
Downtime during upgrades |
|
Plugin UI |
Plugins with web UI are not supported in K8s |
— |
|
Plugin Manager Read-Only |
Plugin manager UI and REST API are read-only; write operations are blocked |
Use Helm chart for plugin lifecycle management |
|
Plugin Port Configuration |
User must manually specify plugin ports |
Refer to plugin documentation for port values |
|
Watchdog Label Cleanup |
Watchdog does not automatically remove unhealthy labels from nodes after recovery |
Manual label removal required ( |
|
No Upgrade from 6.24.2 |
This version is not compatible with the previous K8s deployment |
Fresh install required |
Version Changes Since UFM 6.24.2
|
Parameter |
Description |
Notes |
|---|---|---|
|
Network |
|
HostDevice via NVIDIA Network Operator (no host network) |
|
Security |
Privileged container required |
Non-privileged container |
|
Watchdog |
N/A |
Watchdog Operator — automatic failover, node labeling, plugin monitoring |
|
Plugins |
Deployed via UFM Helm chart ( |
Separate Helm chart in UFM SDK repo ( |
|
Config Overrides |
Edit chart files before install only |
Also supports |
|
Resource Limits |
Both requests and limits required |
Requests required, limits optional |
|
Service |
Disabled by default |
Enabled (ClusterIP) by default |
|
User Scripts |
N/A |
ConfigMap mount at |
|
SSL Certificates |
N/A |
Custom SSL cert support via TLS Secret |
|
Version Check |
N/A |
Init container verifies image version >= chart appVersion |
|
Plugin Manager |
Full operations available |
Read-only on K8s — write operations blocked |
|
Upgrade from 6.24.2 |
— |
Not supported — fresh install required |
Last updated: