NVIDIA UFM Enterprise User Manual

UFM on Kubernetes

Overview

UFM Enterprise supports deployment on Kubernetes clusters using Helm charts. This deployment method provides:

  • Declarative Configuration: Define your UFM deployment using Helm values

  • Simplified Operations: Use standard Kubernetes tools for deployment, upgrades, and management

  • Plugin Support: Deploy UFM plugins as separate pods via a dedicated Helm chart

Supported Environments

Kubernetes Version

Kubernetes 1.28 or later.

Node Operating Systems

UFM on Kubernetes supports the same operating systems as UFM Enterprise. See the Installation Notes for the complete list of supported operating systems.

Hardware Requirements

UFM on Kubernetes has the same hardware requirements as UFM Enterprise. See the Installation Notes for detailed specifications.

Prerequisites

Before deploying UFM Enterprise on Kubernetes, ensure the following requirements are met:

Kubernetes Cluster

  • Kubernetes cluster version 1.28 or later

  • kubectl configured with cluster access

  • Cluster admin permissions for installation

Helm

Helm 3.x installed on the management workstation:

helm version

Storage

  • A StorageClass that supports ReadWriteMany access mode

  • Minimum 10GB storage capacity

NVIDIA Network Operator (Required)

UFM cannot function without access to InfiniBand devices. The NVIDIA Network Operator must be installed and configured before installing the UFM Helm chart.

1. Install Network Operator: 

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install network-operator nvidia/network-operator \
  --namespace nvidia-network-operator \
  --create-namespace \
  --version 25.7.0 \
  --set nfd.enabled=true \
  --set ofedDriver.deploy=false \
  --set sriovDevicePlugin.deploy=true \
  --set secondaryNetwork.deploy=true \
  --set secondaryNetwork.multus.deploy=true \
  --wait --timeout 5m

Note: Set ofedDriver.deploy=false if OFED/DOCA drivers are already installed on the host.

2. Create NicClusterPolicy:

The sriovDevicePlugin must be enabled so the nodes expose nvidia.com/hostdev. The rdmaSharedDevicePlugin must be enabled to expose rdma/hca_shared, which is required for OpenSM to access InfiniBand character devices. 

kubectl apply -f - <<EOF
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  secondaryNetwork:
    multus:
      image: multus-cni
      repository: ghcr.io/k8snetworkplumbingwg
      version: v4.1.0
    cniPlugins:
      image: plugins
      repository: nvcr.io/nvidia/mellanox
      version: network-operator-v25.7.0
  sriovDevicePlugin:
    image: sriov-network-device-plugin
    repository: nvcr.io/nvidia/mellanox
    version: network-operator-v25.7.0
    config: |
      {
        "resourceList": [
          {
            "resourcePrefix": "nvidia.com",
            "resourceName": "hostdev",
            "selectors": {
              "vendors": ["15b3"],
              "devices": [],
              "drivers": [],
              "pfNames": [],
              "pciAddresses": [],
              "rootDevices": [],
              "linkTypes": [],
              "isRdma": true
            }
          }
        ]
      }
  rdmaSharedDevicePlugin:
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/mellanox
    version: network-operator-v25.7.0
    config: |
      {
        "configList": [
          {
            "resourceName": "hca_shared",
            "rdmaHcaMax": 1000,
            "devices": ["all"]
          }
        ]
      }
EOF


Wait for the policy to be ready:

kubectl get nicclusterpolicy -o jsonpath='{.items[0].status.state}'
# Expected: ready


3. Verify resources are available:

kubectl get nodes -o custom-columns=NAME:.metadata.name,HOSTDEV:.status.allocatable.nvidia\\.com/hostdev
# Expected: nvidia.com/hostdev should appear on the nodes


Note: 15b3 is the NVIDIA/Mellanox PCI vendor ID.

UFM License

  • Valid UFM Enterprise license file

Installation

Step 1: Set Up Storage

UFM requires ReadWriteMany storage. Make sure you have a persistent storage provisioner configured (e.g., NFS).

Step 2: Create Namespace and License ConfigMap

# Create the namespace
kubectl create namespace ufm-enterprise

# Create license ConfigMap
kubectl create configmap ufm-license \
  --from-file=<license-filename>.lic=/path/to/your/<license-filename>.lic \
  -n ufm-enterprise


Step 3: Install UFM with Helm

helm install ufm-enterprise <chart> \
  --namespace ufm-enterprise \
  --set storage.className=<storage-client> \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=<memory> \
  --set resources.requests.cpu=<cpu>


Note: The chart defaults to fabric_interface = net1 (provided by HostDeviceNetwork). No need to set config.fabricInterface unless your setup differs. Resource limits are optional — only requests are required.

Step 4: Verify Installation

Watch the pod status: 

kubectl get pods -n ufm-enterprise -w


Expected state transitions:

NAME                              READY   STATUS              AGE
ufm-ufm-enterprise-xxxxxxxxxx     0/1     Init:0/1            5s
ufm-ufm-enterprise-xxxxxxxxxx     0/1     PodInitializing     30s
ufm-ufm-enterprise-xxxxxxxxxx     0/1     Running             45s
ufm-ufm-enterprise-xxxxxxxxxx     1/1     Running             2m



Note: The pod shows 0/1 Running while the startup probe waits for UFM to fully initialize. This can take several minutes depending on the cluster size.

Verify the HostDeviceNetwork resource created by the chart: 

kubectl get hostdevicenetwork ufm-hostdevice -o yaml
kubectl get network-attachment-definition -n ufm-enterprise ufm-hostdevice -o yaml



Configuration Reference

All configuration options are set via Helm values. Use --set key=value or a values file (-f values.yaml).

Namespace Configuration

Parameter

Description

Default

namespace.create

Create the namespace

false

namespace.name

Namespace name

ufm-enterprise

Image Configuration

Parameter

Description

Default

image.repository

Image repository

docker.io/mellanox/ufm-enterprise

image.tag

Image tag

latest

image.pullPolicy

Image pull policy (REQUIRED)

imagePullSecrets

Image pull secrets for private registries

[]

versionCheckEnabled

Check that UFM image version >= Helm chart appVersion at startup. Fails the init container with a clear error if the image is older than the chart.

true


Note: image.pullPolicy must be set to one of: NeverIfNotPresent, or Always.

UFM Configuration

Parameter

Description

Default

config.mgmtInterface

Management network interface name

"" (uses gv.cfg)

config.fabricInterface

Fabric interface override

"" (uses gv.cfg, default net1)

config.httpPort

Apache HTTP port

80

config.httpsPort

Apache HTTPS port

443

HostDevice Network Configuration

UFM requires a valid InfiniBand fabric interface inside the pod. The interface is provided by the NVIDIA Network Operator HostDeviceNetwork. The pod gets a net1 interface which is a link/infiniband interface (no IP address). UFM uses fabric_interface = net1.

Parameter

Description

Default

hostDevice.createNetwork

Create the HostDeviceNetwork resource as part of chart install

true

hostDevice.networkName

Name of the HostDeviceNetwork resource

ufm-hostdevice

hostDevice.resourceName

Resource name referenced by the HostDeviceNetwork

hostdev

hostDevice.podResourceName

Device-plugin resource requested by the pod

nvidia.com/hostdev

hostDevice.resourceCount

Number of host-device resources per container

1

Set hostDevice.createNetwork=false if you want to reference a pre-created HostDeviceNetwork instead of letting the chart create it.

RDMA Configuration

The RDMA shared device plugin grants cgroup-level access to /dev/infiniband/* character devices. This is required for OpenSM to function. The HostDevice resource alone makes device files visible but does not grant the necessary cgroup permissions.

Parameter

Description

Default

rdma.resourceName

RDMA device plugin resource name

rdma/hca_shared

rdma.resourceCount

Number of RDMA resources to request per container

1

Storage Configuration

Parameter

Description

Default

storage.enabled

Enable PVC creation

true

storage.existingClaim

Use existing PVC instead of creating one

""

storage.className

Storage class name (REQUIRED)

storage.size

Persistent volume size

10Gi

storage.accessMode

PVC access mode

ReadWriteMany

Resource Requests (REQUIRED) and Limits (OPTIONAL)

Parameter

Description

Default

resources.requests.memory

Memory request (REQUIRED)

resources.requests.cpu

CPU request (REQUIRED)

resources.limits.memory

Memory limit (optional)

— (no cap)

resources.limits.cpu

CPU limit (optional)

— (no cap)

License Configuration

Parameter

Description

Default

license.existingConfigMap

ConfigMap containing license file(s)

""

license.existingSecret

Secret containing license file(s)

""

SSL Certificate Configuration

Parameter

Description

Default

ssl.enabled

Enable custom SSL certificates

false

ssl.existingSecret

TLS Secret name (required when ssl.enabled=true)

""

Startup Probe Configuration

The startup probe waits for UFM to fully initialize before the liveness probe starts.

Parameter

Description

Default

startupProbe.enabled

Enable startup probe

true

startupProbe.initialDelaySeconds

Initial delay before probe

2

startupProbe.periodSeconds

Probe interval

10

startupProbe.timeoutSeconds

Probe timeout

2

startupProbe.failureThreshold

Failures before giving up (10s × 30 = 5 min max)

30

Liveness Probe Configuration

The liveness probe checks if UFM is still running after startup completes.

Parameter

Description

Default

livenessProbe.enabled

Enable liveness probe

true

livenessProbe.initialDelaySeconds

Initial delay before probe

0

livenessProbe.periodSeconds

Probe interval

10

livenessProbe.timeoutSeconds

Probe timeout

2

livenessProbe.failureThreshold

Failures before restart

3

Service Configuration

Parameter

Description

Default

service.enabled

Enable Kubernetes Service

true

service.type

Service type: ClusterIP, NodePort, LoadBalancer

ClusterIP

service.nodePort

NodePort number (30000-32767), auto-assign if empty

""

Ingress Configuration

Parameter

Description

Default

ingress.enabled

Enable Ingress for external access

false

ingress.className

Ingress class name (e.g., nginx, traefik)

""

ingress.host

Hostname for the Ingress

""

ingress.annotations

Ingress annotations (controller-specific)

{}

ingress.tls.secretName

TLS secret name for HTTPS

""

Scheduling Configuration

Parameter

Description

Default

nodeSelector

Node labels for pod scheduling

{}

tolerations

Tolerations for pod scheduling

[]

affinity

Affinity rules for pod scheduling

{}

Note: When the watchdog is enabled, the chart automatically adds a nodeAffinity rule to exclude nodes labeled unhealthy. If you also provide affinity.nodeAffinity, the watchdog expression is injected into each of your nodeSelectorTerms, preserving OR semantics between terms while ANDing the watchdog rule within each.

Deployment Configuration

Parameter

Description

Default

deployment.enabled

Enable the UFM deployment

true

deployment.terminationGracePeriodSeconds

Time to wait for graceful shutdown

30

Config File Overrides

Parameter

Description

Default

configFiles

Map of file path to content; override with --set-file

{}

Override chart-bundled config files without extracting the chart. Escape dots in filenames with a backslash (\.). For nested paths, use path segments as keys (e.g., configFiles.opensm.opensm\.conf for opensm/opensm.conf).

User Scripts Configuration

Parameter

Description

Default

userScripts

Map of script filename to content; inject via --set-file

{}

Mount custom scripts as executable files inside the UFM pod. Scripts are mounted at /opt/ufm/scripts/user-scripts/ with mode 0755. Inject via --set-file (escape dots in filenames with \.). When no userScripts are provided, no ConfigMap or volume mount is created.

Watchdog Operator Configuration

The watchdog operator monitors UFM pods for crash loops and automatically labels problematic nodes to enable rescheduling to healthy nodes. It is enabled by default.

The operator handles two types of failures:

  • Failover signal: UFM's health detects a critical failure and creates a failover flag. The operator detects this and triggers immediate node labeling and pod migration — no threshold, no waiting.

  • Process crash: A UFM process dies. The operator counts restarts within a sliding window and migrates only if the threshold is reached.

Parameter

Description

Default

watchdog.enabled

Enable watchdog operator

true

watchdog.image.tag

Watchdog image tag

.Chart.AppVersion

watchdog.restartThreshold

Restarts before action (process crashes only)

3

watchdog.timeWindowSeconds

Time window for counting restarts (seconds)

120

watchdog.maxLabeledNodes

Max nodes to label unhealthy (0 = auto)

0

watchdog.replicas

Operator replicas for HA

2

watchdog.unhealthyLabelKey

Label key applied to unhealthy nodes

ufm.nvidia.com/unhealthy




maxLabeledNodes=0 (default): Auto-calculates as total_nodes - 1, ensuring at least one node always remains schedulable.




Plugin Watchdog:

The watchdog also monitors plugin pods (deployed by the UFM Plugins Helm chart) for crash loops. Plugins are identified by the ufm.nvidia.com/watchdog-scope=plugin label. Each plugin gets its own per-plugin unhealthy label (e.g., ufm.nvidia.com/fast_api-unhealthy), so one crashing plugin does not affect other plugins or UFM scheduling.

Plugin pods can override chart-level thresholds via annotations:

  • ufm.nvidia.com/watchdog-restart-threshold

  • ufm.nvidia.com/watchdog-time-window-seconds

Parameter

Description

Default

watchdog.plugins.enabled

Enable plugin monitoring

true

watchdog.plugins.labelSelector

Label selector for plugin pods

ufm.nvidia.com/watchdog-scope=plugin

watchdog.plugins.restartThreshold

Restarts before action

3

watchdog.plugins.timeWindowSeconds

Time window for counting restarts

120

watchdog.plugins.unhealthyLabelKeyTemplate

Label key template

ufm.nvidia.com/{pluginName}-unhealthy

watchdog.plugins.maxLabeledNodes

Max nodes to label per plugin (0=auto)

0

Observability: The operator reports status through Kubernetes Events (e.g., NodeLabeledUnhealthyMaxUnhealthyNodesReached) and Prometheus metrics exposed on port watchdog.metricsPort (default 8080).

Recovering a node: Remove the unhealthy label after the issue is resolved: kubectl label node <node-name> ufm.nvidia.com/unhealthy-

Plugin Deployment

UFM plugins are now deployed via a separate Helm chart

Prerequisites

  • UFM Enterprise must already be installed in the cluster

  • ufmFullname: Must match your UFM release name (e.g., ufm-ufm-enterprise). Required.

  • Shared PVC: The plugin chart uses the same PVC as UFM. Default claim name is {ufmFullname}-files.

  • UFM ConfigMap: A ConfigMap named {ufmFullname}-config with key UFM_VERSION must exist (created by the UFM Enterprise chart).

  • RDMA (if needed): If plugins use InfiniBand, set rdma.resourceCount and ensure the cluster has the RDMA device plugin.

Plugin Chart Values Reference

Parameter

Description

Default

ufmFullname

Full name of the UFM Enterprise release (e.g., ufm-ufm-enterprise)

Yes

namespace

Kubernetes namespace. Can be auto-discovered via namespaceSearchList. Falls back to ufm-enterprise.

No

namespaceSearchList

List of namespaces to search for {ufmFullname}-config at install time

No

existingClaim

PVC claim name for UFM files. Default: {ufmFullname}-files

No

configMapName

ConfigMap name for plugins.yaml. Default: {ufmFullname}-plugins

No

rdma.resourceName

RDMA resource name (e.g., rdma/hca_shared)

No

rdma.resourceCount

Number of RDMA resources per plugin pod; default "0". Set to "1" for InfiniBand plugins.

No

watchdog.enabled

Enable watchdog monitoring for plugin pods

No (default: true)

watchdog.restartThreshold

Chart-level default for max restarts before marking node unhealthy

No (default: 3)

watchdog.timeWindowSeconds

Chart-level default time window for counting restarts

No (default: 120)

plugins.defaultResources

Default requests/limits when a plugin does not set resources

No

plugins.entries

Map of plugin definitions keyed by plugin name (see below)

Yes

podSecurityContext

Pod-level securityContext (e.g., runAsNonRoot: true)

No

affinity

Pod affinity rules

No

tolerations

Pod tolerations

No

nodeSelector

Pod node selector

No

imagePullSecrets

Image pull secrets

No

Plugin Entry Fields (plugins.entries.<name>)

Each plugin is a map entry keyed by its canonical name (use underscores, e.g., log_streamer). Required fields are image and tag.

Parameter

Description

Default

enabled

Set to false to skip this plugin. Default true.

No

image

Container image repository (no tag)

Yes

tag

Image tag

Yes

imagePullPolicy

e.g., IfNotPresent or Always

No (default: IfNotPresent)

port

Main TCP port the plugin listens on. Written to plugins.yaml so UFM can reach the plugin.

No

ports

Additional container ports (list of integers)

No

healthEndpoint

HTTP path for liveness probe (e.g., /health). Uses native httpGet.

No

healthPort

Port for liveness httpGet probe; defaults to port

No

host

Host written into plugins.yaml. Default: in-cluster DNS {ufmFullname}-plugin-{name}.{namespace}.svc.cluster.local

No

rdma

Per-plugin RDMA override: { resourceName, resourceCount }

No

resources

requests/limits for this plugin; overrides plugins.defaultResources

No

startupProbe

Full startup probe spec

No

livenessProbe

Full liveness probe spec. Overrides the chart default (httpGet or tcpSocket).

No

disableLivenessProbe

Set to true to omit liveness probe entirely

No

readinessProbe

Full readiness probe spec. No default.

No

mountHealthScripts

Mount chart health-check script at /health-scripts

No (default: false)

extraCapabilities

Additional Linux capabilities (e.g., ["SYS_PTRACE"])

No

env

Extra environment variables for the main container

No

volumes

Extra volumes for the pod

No

volumeMounts

Extra volumeMounts for the main container

No

runInitContainer

When false, the init container is omitted. Default true.

No

strategy

Deployment strategy: Recreate (default) or RollingUpdate

No

watchdog

Per-plugin watchdog override: { enabled, restartThreshold, timeWindowSeconds }

No

What the Plugin Chart Generates

  • ClusterIP Service per plugin (when port and/or ports is set): enables in-cluster DNS so UFM and other services can reach the plugin

  • Deployment per plugin: one Deployment per enabled entry, using Recreate strategy by default. Includes init container, shared PVC mounts, optional RDMA resources

  • ConfigMap plugins.yaml: consumed by UFM with plugin name, host, port, tag for each enabled plugin

  • Watchdog labels and annotations: when watchdog is enabled, each plugin pod gets discovery labels and threshold annotations

Incremental Plugin Upgrades

The map-based plugins.entries model lets you upgrade a single plugin without restating every other plugin.

Parameter

Description

Default

Upgrade one plugin's image/config

--reuse-values

Keeps all other plugins as-is

Add a new plugin to an existing release

--reuse-values

Merges the new entry into existing

Disable a single plugin

--reuse-values

Only changes that plugin's enabled flag

Upgrade the chart version itself

--reset-values

Ensures new chart defaults apply cleanly

Full reconcile of all plugins

--reset-values

Sets the authoritative desired state

Plugin Manager Behavior on K8s

The Plugin Manager in K8s mode is read-only:

  • The Plugin Manager UI displays current plugin state but all modification operations are blocked

  • Plugin Manager REST API and shell operations only allow GET/read actions — write operations are blocked

  • All plugin lifecycle management (deploy, upgrade, disable) must be done via the Helm chart

Custom Configuration Files

The Helm chart includes default UFM configuration files that can be customized.

Customizing Config Files

Use --set-file (configFiles)

Override chart-bundled files .Escape dots in filenames with a backslash (\.). For nested paths, use path segments as keys. 

# Override a top-level file
helm install ufm-enterprise ./ufm-enterprise \
  --set-file 'configFiles.gv\.cfg=/path/to/my-gv.cfg' \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never

# Override a file in a subdirectory
helm install ufm-enterprise ./ufm-enterprise \
  --set-file 'configFiles.opensm.opensm\.conf=/path/to/opensm.conf' \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never


Configuration Priority

Configuration is applied in this order (later wins):

  1. Base install/upgrade — UFM default config files

  2. Helm chart config files — Files from files/conf/ directory

  3. configFiles (values / --set-file) — Overrides chart file content when the path is set

  4. Helm values — config.mgmtInterfaceconfig.fabricInterface (if provided)

Important Notes

  • Config files are applied after the UFM upgrade/install process completes

  • File ownership and permissions are preserved for existing files

  • New files are created with ufmapp:ufmapp ownership

  • helm upgrade with modified config files or configFiles overrides triggers a pod restart automatically

  • Pod restarts skip config application if nothing changed (checksum-based)

Operations

Start/Stop UFM

Stop UFM: 

kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=0

Verify UFM is stopped:



kubectl get pods -n ufm-enterprise

Start UFM:


kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=1

Wait for the pod to be ready:


kubectl get pods -n ufm-enterprise -w

View Logs

Container Logs:


# Follow logs kubectl logs -n ufm-enterprise -l app=ufm-enterprise -f # Previous container logs (after crash) kubectl logs -n ufm-enterprise -l app=ufm-enterprise --previous

UFM Application Logs:

# List log files
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ls -la /opt/ufm/files/log/

# View specific log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/console.log

# Tail a log
kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- tail -100 /opt/ufm/files/log/ufmhealth.log


Access UFM UI and REST API

https://<ingress-host>/ufm_web/


REST API 

# Get UFM version
curl -k -u <user>:<password> https://<host>/ufmRest/app/ufm_version

# List resources
curl -k -u <user>:<password> https://<host>/ufmRest/resources/systems

Uninstallation

Step 1: Remove Plugins (if installed)

If you deployed plugins via the UFM Plugins Helm chart, uninstall them first:

helm uninstall ufm-plugins -n ufm-enterprise

Step 2: Remove UFM

helm uninstall ufm-enterprise -n ufm-enterprise



Warning: This deletes all UFM resources including the PersistentVolumeClaim and data.

Resource Cleanup

Remove all resources (entire namespace):

kubectl delete namespace ufm-enterprise


Remove specific resources only:

kubectl delete pvc -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise
kubectl delete configmap -n ufm-enterprise ufm-license
kubectl delete secret -n ufm-enterprise ufm-tls


Monitoring

Kubernetes Probes

UFM uses two probes:

Startup

Wait for UFM initialization

REST API returns HTTP 200

Liveness

Detect failures

UfmHealthRunner running, no failover flag

Watchdog Operator Monitoring

The Watchdog Operator provides automatic failover capabilities. When UFM encounters a critical failure or crash loop, the operator:

  1. Labels the current node as unhealthy

  2. Kubernetes reschedules the UFM pod to a healthy node

  3. The same process applies to plugin pods (with per-plugin labels)

Monitoring Commands

Verify Probe Status 

kubectl describe pod -n ufm-enterprise -l app=ufm-enterprise | grep -A 5 -E "Liveness:|Startup:"

Verify UFM Processes:


kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ps aux


Check UFM Health Log:

kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/ufmhealth.log


Known Limitations

Parameter

Description

Default

Single Pod

Only one UFM replica supported

No horizontal scaling

sysdump Unavailable

sysdump collector doesn't work in K8s

Use manual log collection

Recreate Strategy

Rolling updates not supported

Downtime during upgrades

Plugin UI

Plugins with web UI are not supported in K8s

Plugin Manager Read-Only

Plugin manager UI and REST API are read-only; write operations are blocked

Use Helm chart for plugin lifecycle management

Plugin Port Configuration

User must manually specify plugin ports

Refer to plugin documentation for port values

Watchdog Label Cleanup

Watchdog does not automatically remove unhealthy labels from nodes after recovery

Manual label removal required (kubectl label node <node> ufm.nvidia.com/unhealthy-)

No Upgrade from 6.24.2

This version is not compatible with the previous K8s deployment

Fresh install required


Version Changes Since UFM 6.24.2

Parameter

Description

Notes

Network

hostNetwork: true

HostDevice via NVIDIA Network Operator (no host network)

Security

Privileged container required

Non-privileged container

Watchdog

N/A

Watchdog Operator — automatic failover, node labeling, plugin monitoring

Plugins

Deployed via UFM Helm chart (plugins.items[])

Separate Helm chart in UFM SDK repo (ufm-plugin-helm-template)

Config Overrides

Edit chart files before install only

Also supports --set-file configFiles.* without extracting chart

Resource Limits

Both requests and limits required

Requests required, limits optional

Service

Disabled by default

Enabled (ClusterIP) by default

User Scripts

N/A

ConfigMap mount at /opt/ufm/scripts/user-scripts/

SSL Certificates

N/A

Custom SSL cert support via TLS Secret

Version Check

N/A

Init container verifies image version >= chart appVersion

Plugin Manager

Full operations available

Read-only on K8s — write operations blocked

Upgrade from 6.24.2

Not supported — fresh install required

Last updated: