NVIDIA UFM Enterprise User Manual

UFM Infra

The UFM Infra feature introduces a structured architecture where services are divided into two categories, each deployed differently based on functionality:

  • UFM Infra: A set of persistent infrastructure services that run on all nodes. These services support system-level operations and ensure distributed availability.

  • UFM Enterprise: Services that run exclusively on the master node, responsible for management, orchestration, and user-facing functionality.

Key Benefits

  • Faster API Availability after Failover: By limiting service transitions during node failures, recovery times are significantly reduced.

  • Improved Modularity: Separating core infrastructure from enterprise logic simplifies maintenance and troubleshooting.

  • Enhanced Scalability: Services can be scaled and managed independently across nodes.

Users can enable or disable the UFM Infra feature without requiring a reinstallation of the UFM system. For more information, refer to UFM Infra | Enabling or Disabling UFM Infra.
Installation instructions are available at UFM Infra Installation.

Pre-Requirement

The Valkey image must be loaded, or the is_external_redis flag must be enabled in gv.cfg.

Service Architecture

ufm-infra.service

Manages the following infrastructure components:

Component

Description

Valkey Server

Inter-node communication and topology storage

Apache Web Server

HTTP/HTTPS web server for UFM API and UI

Authentication Server

User authentication and session management

UFM Health (Infra)

Infrastructure health monitoring

Infra Plugins

Plugins running in infra context (e.g., Fast API)

UTM Telemetry

Telemetry services (when UTM mode enabled)

ufm-enterprise.service

Manages the following enterprise components:

Component

Description

OpenSM

Subnet Manager for InfiniBand fabric

UFM Main Process

Core UFM fabric management engine

Enterprise Plugins

Plugins running in enterprise context

Topology Publishing

Publishes fabric topology to Valkey (Infra mode)

Shared Resources

In Infra mode, the following resources are shared between services:

  • Docker Volume (ufm-shared-data) (ufm-shared-data): Shared Apache configuration between containers

  • Shared Configuration Files: opt/ufm/files/mounted to both containers

  • Valkey: Used for topology publishing and inter-service communication

Configuring UFM Infra

Key

Type

Default Value

Description

enabled

boolean

false

Enable or disable UFM Infra mode

redis_host

string

localhost

Valkey server hostname or IP address

redis_port

integer

6379

Valkey server port number

redis_socket_timeout

integer

5

Valkey connection timeout in seconds

is_external_redis

boolean

false

Use external Redis/Valkey server instead of internal

is_tls_redis

boolean

false

Enable TLS encryption for Valkey connections

Fast-API configuration

The following parameters can be modified within the Fast API configuration file:

Section

Default Value

Description

smCommunicator

600

Default Time-to-live (TTL) for SM-related transactions before expiration (in seconds)

sharpCommunicator

600

Default Time-to-live (TTL) for SHARP-related transactions before expiration (in seconds)

Enabling or Disabling UFM Infra

UFM Infra mode can be enabled or disabled after installation using the ufm_infra_feature_flag.py script.

Script Location

/opt/ufm/files/scripts/ufm_infra_feature_flag.py

Command Line Options

Usage:  

ufm_infra_feature_flag.py[-h](
    -e | -d)[--rootless][--log - level{DEBUG, INFO, WARNING, ERROR, CRITICAL}]
            [--timeout - seconds TIMEOUT_SECONDS][--ufm - user UFM_USER]
            [--force][--skip - ha - validation]
            [--infra - plugins - dir<path>] Control UFM Infra feature flags


Flag

Description

-e, --enable

Enable the Infra feature

-d, --disable

Disable the Infra feature

--rootless

Use rootless Podman mode (default: root Docker mode)

--log-level

Set logging level (default: INFO)

--timeout-seconds

Timeout for waiting for containers to stop (default: 120)

--ufm-user

User for rootless Podman commands (default: ufmadm)

--force

Automatically stop/start UFM services

--skip-ha-validation

Skip HA configuration validation

--infra-plugins-dir

Directory containing plugin images to load and install

Enabling Infra Mode

Standalone Mode (Docker)

Without Automatic Service Management
  1. Stop UFM services manually: 

    systemctl stop ufm-enterprise systemctl stop ufm-infra
    


  2. Enable Infra mode: 

    cd /opt/ufm/files/scripts/ ./ufm_infra_feature_flag.py --enable
    


  3. Start UFM services manually: 

    systemctl start ufm-infra systemctl start ufm-enterprise
    

    The script automatically detects whether the system is running in HA mode and manages cluster resources accordingly.

Disabling Infra Mode

Standalone Mode (Docker) 

cd /opt/ufm/files/scripts/ ./ufm_infra_feature_flag.py --disable --force

Standalone Mode (Rootless Podman)

cd /opt/ufm/files/scripts/ ./ufm_infra_feature_flag.py --disable --rootless --force

High Availability (HA) Mode

cd /opt/ufm/files/scripts/ ./ufm_infra_feature_flag.py --disable --force

Script Behavior

When Enabling Infra Mode

The script performs the following actions:

  • Stops UFM services (standalone) or the HA cluster

  • Waits for all UFM containers to stop

  • Updates gv.cfg to set: 

    [UFMInfra] enabled = true
    


  • Updates the Valkey trigger file to enabled

  • Validates HA resources (if running in HA mode)

  • Loads and installs Infra plugins if --infra-plugins-dir is specified

  • Restarts UFM services or the HA cluster


When Disabling Infra Mode

The script performs the following actions:

  • Stops UFM services (standalone) or the HA cluster

  • Waits for all UFM containers to stop

  • Updates gv.cfg to set:

    [UFMInfra] enabled = false
    


  • Updates the Valkey trigger file to disabled

  • Restarts UFM services or the HA cluster

Communication Flow: Fast API, Valkey, SM/SHARP Components

As part of the updated architecture, a FAST-API plugin can be deployed as an Infra Plugin and a Valkey server is required for inter-service communication. Valkey can be configured in two ways:

  • As an internal service (installed with UFM)

  • As an external Redis/Valkey instance, depending on deployment needs.

The following sequence describes how communication is handled between Fast API, Valkey, and SM/SHARP components:

  1. Request Submission via Fast API
    Users send REST API requests (e.g., for PKey creation or SHARP reservation actions) to the Fast API. These requests are placed into Valkey queues, and a Transaction ID (TID) is returned to the user for tracking purposes.

  2. Processing by Communicators

    • The SM Communicator or SHARP Communicator monitors Valkey queues for new requests.

    • Upon receiving a request, the communicator forwards it to the relevant component (SM or SHARP) for execution.

    • After processing, the communicator captures the response and status.

  3. Status Updates
    The communicators update the status of each request back into Valkey. Users can query the status of their transaction using the TID provided during request submission. 

  4. Configuration Storage and Retrieval

    • Communicators store the configuration in Valkey.

    • This allows the Fast API to retrieve and expose configuration data via REST APIs, giving users access to the configuration via REST APIs to understand cluster-level settings.

      image-2025-4-23_18-46-3.png


Last updated: