NVIDIA UFM Enterprise User Manual

Installing UFM Infra Using Rootless with Podman

UFM Infra Rootless Deployment with Podman (Restricted to Oracle Linux Only)

Prerequisites

  1. Download the UFM and plugins bundle tar file to /tmp.

  2. Extract the contents using the command: 

     tar -xvf <bundle tar>
    

This archive (tar file) includes the following components:

  • Relevant UFM container image

  • Relevant FAST-API container image

  • Relevant Infra container image (for internal Valkey usage). Refer to Redis-Related Configuration for more information.

  • Default plugin bundle for UFM

  • UFM-HA package

HA Installation Requirements

To enable the UFM Infra feature, UFM HA must be installed in a new mode (`external-storage`), using a new product (`enterprise-multinode`).

Additionally, NFS must be configured as follows:

NFS Setup Prerequisites

  • Select a dedicated NFS server to host the shared directories.

  • Create a shared directory on the NFS server for UFM configuration and logs.

  • Install the NFS client on each UFM node if not already present.

Enable HA ports in Firewall

If you have firewall rules that blocks non-standard ports, we need to open these ports so high availability services could communicate with each other on the HA nodes. To do so, run these commands:

firewall-cmd --permanent --add-service=high-availability
# or
firewall-cmd --add-service=high-availability
# and then reload the rules
firewall-cmd --reload

Create and Mount the UFM Directory

At this stage, apply point #2 (Mount the UFM directory) only on the master machine.

Other nodes will be visited for mount later.

  1. Create the UFM directory: 

     mkdir -p /opt/ufm/files/
    


  2. Mount the UFM directory:

    • If using NFS 4.2: 

       mount -t nfs4 -o context="system_u:object_r:container_file_t:s0" <server>:/shared_folder /opt/ufm/shared_files
      
    • If using NFS 3:

       mount -t nfs -o vers=3,context="system_u:object_r:container_file_t:s0" <server>:/shared_folder /opt/ufm/shared_files
      



  3. Ensure the NFS version and mount options are compatible with the NFS server. 

  4. Verify that the following HA packages are installed: pcs, pacemaker, and corosync. Install them if they are missing.  

  5. Follow the HA installation steps in Run the HA Installation.

Run the HA Installation

Follow the HA installation instructions at UFM High-Availability Installation and Configuration.

When running the HA installation script, use the following command: 

./install.sh -p enterprise-multinode -l /opt/ufm/shared_files
  • The -l flag must always point to the shared directory path: /opt/ufm/shared_files

  • No need to provide the DRBD disk argument to the installation script. 

Installation Instructions

  1. Check firewall status: 

    systemctl status firewalld
    
  2. Configure Firewall (if active): 

    # check if firewalld is running
       systemctl status firewalld
    # Permanently add port 8443 to firewalld
       firewall-cmd --permanent --add-port=8443/tcp
    # reload firewalld config
       firewall-cmd --reload
    
    

  3.  Create UFM directory: 

    mkdir -p /opt/ufm 

  4. Create UFM group: 

    groupadd ufmadm -g 733

  5.  Create a UFM user: 

    useradd -d /opt/ufm -m -u 733 -g ufmadm ufmadm

  6. Set directory ownership: 

    chown -R ufmadm:ufmadm /opt/ufm
    chown -R ufmadm:ufmadm /opt/ufm/shared_files

  7. Configure SubUID and SubGID: 

    echo "ufmadm:100000:65536" >> /etc/subuid
    echo "ufmadm:100000:65536" >> /etc/subgid
      

  8.   Enable Login Linger for UFM ser: 

    loginctl enable-linger ufmadm


  9. Configure Rootless Podam storage

    sudo -u ufmadm mkdir -p /opt/ufm/.config/containers
    cat <<EOF | sudo -u ufmadm tee /opt/ufm/.config/containers/storage.conf > /dev/null
    [storage]
    driver = "overlay"
    runroot = "/run/user/733"
    EOF

  10. Create Podman UFM socket: 

    cat <<EOF > /usr/lib/systemd/system/podman-ufm.socket
    [Unit]
    Description=Podman API Socket For Nvidia UFM
    
    [Socket]
    SocketUser=ufmadm
    SocketGroup=ufmadm
    ListenStream=%t/podman-ufm/podman-ufm.sock
    SocketMode=0660
    
    [Install]
    WantedBy=sockets.target
    EOF

  11.  Create Podman UFM service

    cat <<EOF > /usr/lib/systemd/system/podman-ufm.service
    [Unit]
    Description=Podman API Service for Nvidia UFM
    Requires=podman-ufm.socket
    After=podman-ufm.socket
    StartLimitIntervalSec=0
    
    [Service]
    Delegate=true
    Type=exec
    User=ufmadm
    Group=ufmadm
    KillMode=process
    Environment=LOGGING="--log-level=info"
    ExecStart=/usr/bin/podman \$LOGGING system service
    LimitMEMLOCK=infinity
    
    [Install]
    WantedBy=default.target
    EOF

  12.  Create Podman cleanup service:

    cat <<EOF > /usr/lib/systemd/system/podman-ufm-cleanup.service
    [Unit]
    Description=podman-ufm-cleanup - clean stuck rootless containers at boot
    After=podman-ufm.service
    Before=ufm-enterprise.service
    
    [Service]
    Type=oneshot
    User=ufmadm
    Group=ufmadm
    ExecStart=/usr/bin/podman system migrate
    
    [Install]
    WantedBy=multi-user.target
    EOF

  13.  Enable and start Podman services:

    systemctl daemon-reload
    systemctl enable --now podman-ufm.socket
    systemctl enable --now podman-ufm.service
    systemctl enable --now podman-ufm-cleanup.service

  14. Create Udev Rules for InfiniBand Devices

    cat <<EOF > /etc/udev/rules.d/70-umad.rules
    KERNEL=="umad*", SUBSYSTEM=="infiniband_mad", MODE="0600", OWNER="ufmadm", GROUP="ufmadm"
    KERNEL=="issm*", SUBSYSTEM=="infiniband_mad", MODE="0600", OWNER="ufmadm", GROUP="ufmadm"
    EOF
    
    udevadm control --reload-rules
    udevadm trigger

  15. Clean and create UFM directories

    rm -rf /opt/ufm/systemd
    sudo -u ufmadm mkdir -p /opt/ufm/ufm_plugins_data
    sudo -u ufmadm mkdir -p /opt/ufm/systemd
    sudo -u ufmadm mkdir -p /opt/ufm/etc/apache2

  16. Load UFM image and extract version: 

    # Extract UFM version from filename (e.g., ufm_6.22.0-7.ubuntu24.x86_64-docker.img.gz -> 6_22_0_7)
    UFM_VERSION=$(basename "$UFM_IMAGE_FILE" | sed 's/ufm_\([0-9][^.]*\.[^.]*\.[^.]*-[^.]*\)\.ubuntu.*/\1/' | tr '.-' '_')
    echo "UFM Version: $UFM_VERSION"
    
    # Load the UFM image
    sudo -u ufmadm podman load -i "$UFM_IMAGE_FILE"
     

  17.  Create a soft link: 

    # Remove existing files link if it exists
    rm -f /opt/ufm/files
    
    # Create soft link
    sudo -u ufmadm ln -s /opt/ufm/shared_files/ /opt/ufm/files
    
    # Verify the soft link
    ls -la /opt/ufm/files

  18.  Run UFM installer: 

    sudo -u ufmadm podman run -it --rm --name=ufm_installer \
                              -v /run/podman-ufm/podman-ufm.sock:/var/run/docker.sock \
                              -v /opt/ufm/:/installation/ufm_files/ \
                              -v /opt/ufm/files:/installation/ufm_files/files \
                              -v /opt/ufm/systemd:/etc/systemd_files/ \
                              mellanox/ufm-enterprise:latest \
                              --install \
                              --fabric-interface ib0 \
                              --rootless \
                              --plugin-path /opt/ufm/ufm_plugins_data \
                              --ufm-user ufmadm \
                              --ufm-group ufmadm \
                              --ufm-infra
     
    **Note**: Replace `ib0` with your actual InfiniBand interface name, if it is not the default ib0.
    **Note**: - All other UFM install flags are supported and can be added to the command.
     

  19. Load Valkey Image (if not using external Redis/Valkey): 

    Load the given Valkey image (in case you are not using external Redis/Valkey)
    sudo - u ufmadm load -i "<PATH TO GIVEN VALKEY IMAGE>"


  20. Load Fast API Plugin image:

    sudo - u ufmadm run --hostname $HOSTNAME --rm --name=ufm_plugin_mgmt --entrypoint="" \
        -v /run/podman-ufm/podman-ufm.sock:/var/run/docker.sock \
        -v /opt/ufm/files:/opt/ufm/shared_config_files \
        -v /dev/log:/dev/log \
        -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
        -v /lib/modules:/lib/modules:ro \
        -v /opt/ufm/ufm_plugins_data:/opt/ufm/ufm_plugins_data \
        -e UFM_CONTEXT=ufm-infra \
        mellanox/ufm-enterprise:latest \
        /opt/ufm/scripts/manage_ufm_plugins.sh add -p fast_api -t ${FAST_API_VERSION} -c ufm-infra

  21. Install service files: 

    mv /opt/ufm/systemd/ufm-enterprise.service /etc/systemd/system/ufm-enterprise.service
    mv /opt/ufm/systemd/ufm-infra.service /etc/systemd/system/ufm-infra.service
    systemctl daemon-reload
    
    
     

To start UFM as a standalone instance, run: 


systemctl daemon-reload 
systemctl start ufm-infra 
systemctl start ufm-enterprise

Running in HA Mode 

Do not manually start any services.

  1. Ensure UFM and UFM-HA are installed on all nodes as described in the above sections.

  2. Mount /opt/ufm/files on all standby nodes as described point #2 (Mount the UFM directory)

  3. On one node, edit the HA configuration file: 

    /etc/ufm_ha/ha_nodes.cfg
    

    Fill each node parameters

    [Node.1]
    # valid role options: master/standby
    role = master
    # Mandatory
    primary_ip =
    # Mandatory if dual_link = true 
    secondary_ip =
    
    [Node.2]
    role = standby
    primary_ip =
    secondary_ip =
    
    [Node.3]
    role = standby
    primary_ip =
    secondary_ip =
    
  4. Ensure the file sync mode is set to external-storage, and that the shared file system is mounted prior to HA configuration.

    [FileSync]
    # valid options are: drbd/external-storage
    # in case of external-storage the user MUST mount the files system PRIOR to ha configuration
    mode = external-storage
    
  5.  Copy the edited file to all nodes at the same path.

  6. Configure the cluster, starting from standby nodes and ending with the master node: 

    ufm_ha_cluster config -p <password>

    Use the same password on all nodes.

  7. After finishing the configuration on all nodes, run:

    ufm_ha_cluster status
    


  8. Start the cluster:

    ufm_ha_cluster start
    


  9. Check cluster status again to ensure all services have started successfully.


Last updated: