NVIDIA UFM Enterprise User Manual

UFM on Kubernetes

UFM Enterprise Kubernetes Deployment Guide

This guide provides comprehensive instructions for deploying NVIDIA UFM Enterprise on Kubernetes using Helm charts.

Overview

What's New

UFM Enterprise now supports deployment on Kubernetes clusters using Helm charts. This deployment method provides:

  • Declarative Configuration: Define your UFM deployment using Helm values

  • Simplified Operations: Use standard Kubernetes tools for deployment, upgrades, and management

  • Plugin Support: Deploy UFM plugins as separate pods with automatic configuration

  • Ingress Integration: Expose UFM through Kubernetes Ingress controllers

  • Persistent Storage: Use Kubernetes PersistentVolumeClaims for data persistence

Supported Environments

Kubernetes Version

Kubernetes 1.28 or later.

Node Operating Systems

UFM on Kubernetes supports the same operating systems as UFM Enterprise. See the Installation Notes for the complete list of supported operating systems.

Hardware Requirements

UFM on Kubernetes has the same hardware requirements as UFM Enterprise. See the Installation Notes for detailed specifications.

Prerequisites

Before deploying UFM Enterprise on Kubernetes, ensure the following requirements are met:

Kubernetes Cluster

  • Kubernetes cluster version 1.28 or later

  • kubectl configured with cluster access

  • Cluster admin permissions for installation

Helm

  • Helm 3.x installed on the management workstation. Run: 

    # Verify Helm installation
    helm version

 Storage

  • A StorageClass that supports ReadWriteMany access mode

  • Minimum 10GB storage capacity

InfiniBand

  • At least one node with InfiniBand interface

  • DOCA drivers installed on the worker node where UFM will be deployed

  • InfiniBand port configured and in "up" state. Run: 

    # Verify InfiniBand interface
    ip link show | grep -E 'ib[0-9]|ibp'

UFM License


  • Valid UFM Enterprise license file

  • License file accessible from the management workstation

Network Ports

UFM uses host network mode. Ensure these ports are available on the target node (more ports might be used):

Port

Protocol

Purpose

Configurable

80/443

TCP

Apache HTTP/HTTPS

Yes

8000

TCP

UFM Internal REST Server

No

8081

TCP

OpenSM Plugin Communication

Yes

8082

TCP

OpenSM Traps Listening

Yes

8087

TCP

Auth Service

Yes

9001

TCP

Telemetry/Prometheus Endpoint

Yes

9002

TCP

Secondary Telemetry Endpoint

Yes

8401+

TCP

Plugin Ports (varies per plugin)

Yes


When using Ingress, UFM automatically switches to ports 8080/18443 to avoid conflicts with the Ingress controller.

Installation

Step 1: Set Up Storage

UFM requires ReadWriteMany storage. Make sure you have a Persistent storage configured.

Step 2: UFM Docker Image

The UFM Docker image needs to be located in a place your K8S cluster has access to.

It can be pre loaded or in your own Registry.

Step 3: Create Namespace and License ConfigMap 

# Create the namespace
kubectl create namespace ufm-enterprise

# Create license ConfigMap
kubectl create configmap ufm-license \
  --from-file=<license-filename>.lic=/path/to/your/<license-filename>.lic \
  -n ufm-enterprise


Step 4: Install UFM with Helm 

helm install ufm ufm-enterprise-<version>-helm.tgz \
  --namespace ufm-enterprise \
  --set config.fabricInterface=<your_ib_interface> \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4


Replace <your_ib_interface> with your InfiniBand interface name (e.g., ib0, ibp4s0f0).

Note: The Helm chart is distributed as a .tgz package.

Step 5: Verify Installation

Watch the pod status: 

kubectl get pods -n ufm-enterprise -w


Expected state transitions:

NAME                              READY   STATUS        AGE
ufm-ufm-enterprise-xxxxxxxxxx     0/1     Init:0/1      5s
ufm-ufm-enterprise-xxxxxxxxxx     0/1     PodInitializing   30s
ufm-ufm-enterprise-xxxxxxxxxx     0/1     Running       45s
ufm-ufm-enterprise-xxxxxxxxxx     1/1     Running       2m


Note: The pod shows 0/1 Running while the startup probe waits for UFM to fully initialize. This can take several minutes, depending on the cluster size.

 Configuration Reference

All configuration options are set via Helm values. Use --set key=value or a values file (-f values.yaml).

Namespace Configuration

Parameter

Description

Default

namespace.create

Create the namespace

false

namespace.name

Namespace name

ufm-enterprise

Image Configuration

Parameter

Description

Default

image.repository

Image repository

docker.io/mellanox/ufm-enterprise

image.tag

Image tag

latest

image.pullPolicy

Image pull policy (Required)

-

imagePullSecrets

Image pull secrets for private registries

[]


Note: image.pullPolicy must be set to one of: Never, IfNotPresent, or Always.

UFM Configuration

Parameter

Description

Default

config.fabricInterface

InfiniBand fabric interface name

"" (uses gv.cfg)

config.mgmtInterface

Management network interface name

"" (uses gv.cfg)

config.httpPort

Apache HTTP port

80 (or 8080 with Ingress)

config.httpsPort

Apache HTTPS port

443 (or 18443 with Ingress)

Storage Configuration

Parameter

Description

Default

storage.enabled

Enable PVC creation

true

storage.existingClaim

Use existing PVC name

""

storage.className

Storage class name (Required)

-

storage.size

Persistent volume size

10Gi

storage.accessMode

PVC access mode

ReadWriteMany

Resource Limits (Required)

Parameter

Description

Default

resources.requests.memory

Memory request (Required)

-

resources.requests.cpu

CPU request (Required)

-

resources.limits.memory

Memory limit (Required)

-

resources.limits.cpu

CPU limit (Required)

-

License Configuration

Parameter

Description

Default

license.existingConfigMap

ConfigMap containing license file(s)

""

license.existingSecret

Secret containing license file(s)

""

Startup Probe Configuration

Parameter

Description

Default

startupProbe.enabled

Enable startup probe

true

startupProbe.initialDelaySeconds

Initial delay

2

startupProbe.periodSeconds

Probe interval

10

startupProbe.timeoutSeconds

Probe timeout

2

startupProbe.failureThreshold

Failures before giving up

30


Note: With default settings, UFM has up to 5 minutes (10s × 30) to fully start.

Liveness Probe Configuration

Parameter

Description

Default

livenessProbe.enabled

Enable liveness probe

true

livenessProbe.initialDelaySeconds

Initial delay

0

livenessProbe.periodSeconds

Probe interval

10

livenessProbe.timeoutSeconds

Probe timeout

2

livenessProbe.failureThreshold

Failures before restart

3

Service Configuration

Parameter

Description

Default

service.enabled

Enable Kubernetes Service

false

service.type

Service type

ClusterIP

service.nodePort

NodePort number (30000-32767)

""


Note: A Service is automatically created when Ingress is enabled. Use service.enabled=true only if you need a standalone Service without Ingress (e.g., LoadBalancer type in cloud environments).

Ingress Configuration

Parameter

Description

Default

ingress.enabled

Expose UFM via Ingress controller for external access

false

ingress.className

Ingress controller to use (e.g., nginx, traefik)

""

ingress.host

DNS hostname for accessing UFM (e.g., ufm.example.com)

""

ingress.annotations

Controller-specific annotations (e.g., backend protocol, timeouts)

{}

ingress.tls.secretName

Kubernetes TLS Secret for HTTPS (created via kubectl create secret tls)

""

Scheduling Configuration

Parameter

Description

Default

nodeSelector

Schedule UFM on nodes with specific labels (e.g., kubernetes.io/hostname: ufm-node)

{}

tolerations

Allow UFM to run on tainted nodes (e.g., dedicated infrastructure nodes)

[]

affinity

Advanced scheduling rules for node or pod affinity/anti-affinity

{}

Example - Schedule on specific node: 

--set nodeSelector."kubernetes\.io/hostname"=ufm-node


Plugin Configuration

Parameter

Description

Default

plugins.items

List of plugins to deploy (see example below)

[]

plugins.defaultResources

Default resource limits for plugins if not specified per-plugin

See below

Example - Deploy with a plugin: 

helm install ufm ufm-enterprise-<version>-helm.tgz \
  --namespace ufm-enterprise \
  --set config.fabricInterface=ib0 \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4 \
  --set plugins.items[0].name=<plugin-name> \
  --set plugins.items[0].image=<plugin-image> \
  --set plugins.items[0].tag=<plugin-version> \
  --set plugins.items[0].port=<plugin-port> \
  --set plugins.items[0].imagePullPolicy=Always



Deployment Options

Option 1: Host Network Mode (Default)

This is the default and simplest deployment mode. UFM binds directly to the host's network ports. 

helm install ufm ./ufm-enterprise \
  --namespace ufm-enterprise \
  --set config.fabricInterface=ib0 \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4


Access UFM at: https://<node-ip>:443

Option 2: With Ingress Controller

Use an Ingress controller for external access with TLS termination and hostname-based routing.

Step 1: Install Ingress Controller (if not installed)


Step 2: Deploy UFM with Ingress 
helm install ufm ./ufm-enterprise \
  --namespace ufm-enterprise \
  --set config.fabricInterface=ib0 \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4 \
  --set ingress.enabled=true \
  --set ingress.className=traefik \
  --set ingress.host=ufm.example.com


Access UFM at: https://ufm.example.com

Note: When Ingress is enabled, UFM automatically switches to ports 8080/18443 to avoid conflicts.

Option 3: Using a Values File

For complex configurations, use a YAML values file: 

# my-values.yaml
namespace:
  name: ufm-enterprise

image:
  pullPolicy: Never

config:
  fabricInterface: ib0
  mgmtInterface: eth0

storage:
  className: nfs-client
  size: 50Gi

resources:
  requests:
    memory: 8Gi
    cpu: 4
  limits:
    memory: 16Gi
    cpu: 8

license:
  existingConfigMap: ufm-license

ingress:
  enabled: true
  className: nginx
  host: ufm.example.com
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"

nodeSelector:
  kubernetes.io/hostname: ufm-node


Deploy with the values file: 

helm install ufm ./ufm-enterprise -f my-values.yaml -n ufm-enterprise


Plugin Deployment

UFM plugins run as separate pods with pod affinity to ensure they are scheduled on the same node as UFM.

Plugin Configuration Fields

Limitation: In this version, you must manually specify the plugin port number. Refer to the plugin documentation for the correct port value.


Field

Description

Required

name

Plugin name without ufm-plugin- prefix

Yes

image

Plugin Docker image repository

Yes

tag

Plugin image tag

Yes

port

Plugin service port (omit if no HTTP)

No

imagePullPolicy

Image pull policy

No (default: IfNotPresent)

healthEndpoint

HTTP health endpoint path

No

healthPort

Port for health endpoint

No (defaults to port)

livenessInitialDelay

Seconds before first liveness probe

No (default: 60)

livenessPeriod

Seconds between liveness probes

No (default: 30)

livenessTimeout

Seconds before probe times out

No (default: 15)

livenessFailureThreshold

Failures before restart

No (default: 3)

readinessInitialDelay

Seconds before first readiness probe

No (default: 10)

readinessPeriod

Seconds between readiness probes

No (default: 10)

readinessTimeout

Seconds before probe times out

No (default: 15)

readinessFailureThreshold

Failures before not-ready

No (default: 3)

Deploy Single Plugin

helm install ufm ./ufm-enterprise \
  --namespace ufm-enterprise \
  --set config.fabricInterface=ib0 \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4 \
  --set plugins.items[0].name=<plugin-name> \
  --set plugins.items[0].image=<plugin-image> \
  --set plugins.items[0].tag=<plugin-version> \
  --set plugins.items[0].port=<plugin-port> \
  --set plugins.items[0].imagePullPolicy=Always


Deploy Multiple Plugins 

helm install ufm ufm-enterprise-<version>-helm.tgz \
  --namespace ufm-enterprise \
  --set config.fabricInterface=ib0 \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set license.existingConfigMap=ufm-license \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4 \
  --set plugins.items[0].name=<plugin1-name> \
  --set plugins.items[0].image=<plugin1-image> \
  --set plugins.items[0].tag=<plugin1-version> \
  --set plugins.items[0].port=<plugin1-port> \
  --set plugins.items[0].imagePullPolicy=Always \
  --set plugins.items[1].name=<plugin2-name> \
  --set plugins.items[1].image=<plugin2-image> \
  --set plugins.items[1].tag=<plugin2-version> \
  --set plugins.items[1].port=<plugin2-port> \
  --set plugins.items[1].imagePullPolicy=Always

Important: Plugin array indices must be sequential starting from 0.

Plugin Without HTTP Port

Some plugins don't expose an HTTP port. Omit the port field: 

--set plugins.items[0].name=<plugin-name> \
--set plugins.items[0].image=<plugin-image> \
--set plugins.items[0].tag=<plugin-version> \
--set plugins.items[0].imagePullPolicy=Always


Plugins with Values File

# plugins-values.yaml
plugins:
  items:
    - name: <plugin-name>
      image: <plugin-image>
      tag: <plugin-version>
      port: <plugin-port>
      imagePullPolicy: Always

helm install ufm ufm-enterprise-<version>-helm.tgz -f my-values.yaml -f plugins-values.yaml -n ufm-enterprise



Custom Configuration Files

The Helm chart includes default UFM configuration files that can be customized.

Included Config Files

File

Description

gv.cfg

Main UFM configuration

opensm/opensm.conf

OpenSM configuration

sharp/sharp_am.cfg

SHARP AM configuration

telemetry_defaults/primary_env.cfg

Primary telemetry environment

telemetry_defaults/launch_ibdiagnet_config.ini

IBDiagNet configuration

secondary_telemetry_defaults/launch_ibdiagnet_config.ini

Secondary telemetry config

Method 1: Edit Files Before Install

Extract the chart:
tar xzf ufm-enterprise-<version>-helm.tgz
Edit config files:
vim ufm-enterprise/files/config/gv.cfg
vim ufm-enterprise/files/config/opensm/opensm.conf
Install with modified files
helm install ufm ./ufm-enterprise -n ufm-enterprise \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4


Configuration Priority

Configuration is applied in this order (later wins):

  1. Base install/upgrade - UFM default config files

  2. Helm chart config files - Files from files/config/ directory

  3. Helm values - config.fabricInterface, config.mgmtInterface

Adding Custom Counter Sets

Add custom Prometheus counter set files for telemetry customization:

Extract the chart:
tar xzf ufm-enterprise-<version>-helm.tgz
Add custom cset file:
mkdir -p ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/
cp my-custom-counters.cset ufm-enterprise/files/config/telemetry_defaults/prometheus_configs/cset/
Install:
helm install ufm ./ufm-enterprise -n ufm-enterprise \
  --set storage.className=nfs-client \
  --set image.pullPolicy=Never \
  --set resources.requests.memory=4Gi \
  --set resources.requests.cpu=2 \
  --set resources.limits.memory=8Gi \
  --set resources.limits.cpu=4


Operations

Start/Stop UFM

Stop UFM

Scale down the deployment to 0 replicas: 

kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=0


Verify UFM is stopped:

kubectl get pods -n ufm-enterprise


Start UFM

Scale back up to 1 replica:

kubectl scale deployment -n ufm-enterprise -l app=ufm-enterprise --replicas=1


Wait for the pod to be ready:

kubectl get pods -n ufm-enterprise -w


View Logs

Container Logs

Follow logs:

kubectl logs -n ufm-enterprise -l app=ufm-enterprise -f

Previous container logs (after crash):

kubectl logs -n ufm-enterprise -l app=ufm-enterprise --previous


UFM Application Logs

# List log files kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ls -la /opt/ufm/files/log/ # View specific log kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/console.log # Tail a log kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- tail -100 /opt/ufm/files/log/ufmhealth.log

Access UFM UI and REST API

Web UI

https://<node-ip>:443/ufm_web/

REST API 
# Get UFM version
curl https://<node-ip>:443/ufmRest/app/ufm_version

# List resources
curl https://<node-ip>:443/ufmRest/resources/systems


Uninstallation

Remove UFM

Run: 

helm uninstall ufm -n ufm-enterprise


Warning: This deletes all UFM resources including the PersistentVolumeClaim and data.

Resource Cleanup

Remove All Resources

To delete the entire namespace and all associated resources: 

kubectl delete namespace ufm-enterprise

Remove Specific Resources Only

To delete selected resources instead of the full namespace:

kubectl delete pvc -n ufm-enterprise -l app.kubernetes.io/name=ufm-enterprise
kubectl delete configmap -n ufm-enterprise ufm-license
kubectl delete secret -n ufm-enterprise ufm-tls


Networking

Port Reference

Port

Service

Description

80 / 8080

Apache HTTP

Web UI and REST API (HTTP)

443 / 18443

Apache HTTPS

Web UI and REST API (HTTPS)

8000

Flask

Internal REST server

8081

OpenSM

Plugin communication

8082

OpenSM

Trap listener

8087

Auth

Authentication service

9001

Telemetry

Prometheus metrics endpoint

9002

Telemetry

Secondary metrics endpoint

8401+

Plugins

Plugin-specific ports


Host Network Architecture

UFM is deployed with hostNetwork: true, enabling direct access to:

  • InfiniBand interfaces for fabric management

  • Host ports for external connectivity

  • Low-latency communication with OpenSM

Implications:

  • UFM pods bind directly to the node’s network stack

  • Required ports must be available on the host

  • Port conflicts may prevent pod startup

Traffic Flow with Ingress

Client → Ingress Controller (80/443) → UFM Service → UFM Pod (8080/18443)

When Ingress is enabled:

  • UFM listens internally on ports 8080 and 18443

  • The Ingress controller handles external traffic on ports 80 and 443

  • TLS may be terminated at the Ingress or passed through to UFM

Storage

Data Persistence

All data stored under /opt/ufm/files/ is persisted via a PersistentVolumeClaim (PVC), ensuring data retention across pod restarts.

Security Considerations

Privileged Container Requirement

UFM runs in privileged mode to allow:

  • Direct access to InfiniBand hardware

  • Loading of kernel modules

  • Management of the InfiniBand Subnet Manager

Security Impact: Privileged containers have elevated access to the host and should be deployed with caution.


Host Network Implications

Using hostNetwork: true means:

  • UFM can access all host network interfaces

  • Service ports are exposed directly on the node

  • Kubernetes NetworkPolicies do not apply to pod traffic


Best Practices

  1. Dedicated Nodes – Deploy UFM on dedicated infrastructure nodes.

  2. Node Taints – Apply taints to prevent unrelated workloads from scheduling on UFM nodes.

  3. Network Segmentation – Isolate UFM nodes on a management network.

  4. RBAC Controls – Restrict access to the UFM namespace using Kubernetes RBAC.

  5. Secrets Management – Store sensitive data in Kubernetes Secrets.

  6. Regular Updates – Keep UFM and Kubernetes components up to date.

Monitoring

Kubernetes Probes

UFM uses two probes:

Probe

Purpose

Check

Startup

Wait for UFM initialization

REST API returns HTTP 200

Liveness

Detect failures

UfmHealthRunner running, no failover flag

Health Check Details

Startup Probe:

  • Calls /app/versioning/ on UFM Web server

  • Returns 503 during initialization, 200 when ready

  • Allows up to 5 minutes for startup

Liveness Probe:

  • Verifies UfmHealthRunner process is running

  • Checks for failover flag (critical failure indicator)

  • Verifies config_watcher.sh is running

Monitoring Commands

Verify Probe Status

Run the following command to review the liveness and startup probe configuration and status: 

kubectl describe pod -n ufm-enterprise -l app=ufm-enterprise | grep -A 5 -E "Liveness:|Startup:"


Verify UFM Processes

Use the command below to list running UFM-related processes inside the pod:

kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- ps aux


Check UFM Health Log

Run the following command to inspect the UFM health log file: 

kubectl exec -n ufm-enterprise -l app=ufm-enterprise -- cat /opt/ufm/files/log/ufmhealth.log

Known Limitations

Limitation

Description

Impact

Single Pod

Only one replica supported

No horizontal scaling

No Automatic Failover

Pod won't migrate on node failure

Manual intervention required

No High Availability

HA mode not supported in K8s

Use Docker HA for HA requirements

Privileged Mode

Container requires privileged access

Security considerations

Host Network

Uses host networking

Port conflicts possible

sysdump Unavailable

sysdump collector doesn't work

Use manual log collection

Recreate Strategy

Rolling updates not supported

Brief downtime during upgrades

Plugin Operations

Not all plugin operations are supported

Some plugin features may not work

Plugin Port Configuration

User must manually specify plugin ports

Refer to plugin documentation for port values

Last updated: