DOCA SDK Documentation

DPL Container Deployment

1. Introduction

The DPL Runtime Service supports Bluefield devices (DPU mode) and ConnectX-9 (Host mode). 

There are few differences when preparing the system for running the DPL Runtime Service on DPU mode (BlueField) vs Host mode (ConnectX-9), these differences are outlined in the sections below where applicable.

When deploying on the host, please refer to the instructions in section "DOCA Services for Host" in the DOCA Container Deployment Guide.

2. BlueField Preparations

This section is specific to DPU mode when running on BlueField devices.

2.1. Setting BlueField to DPU Mode

BlueField must run in DPU mode to use the DPL Runtime Service. For details on how to change modes, refer to BlueField Modes of Operation documentation.

2.2. Determining Your BlueField Variant

Your BlueField may be installed in a host server or it may be a standalone server.

If your BlueField is a standalone server, ignore the parts that mention the host server or SR-IOV. You may still use Scalable Functions (SFs) if your BlueField is a standalone server.

2.3. Setting Up DPU Management Access and Updating BlueField-Bundle

These pages provide detailed information about DPU management access, software installation, and updates:

Systems with a host server typically use RShim (i.e., the tmfifo_net0 interface). Standalone systems must use the OOB interface option for management access.

3. Device Configuration

3.1. Changing the eSwitch to switchdev mode

Do this before creating SR-IOV Virtual functions. In case Virtual Functions already exist for the interface, remove them before trying to change the mode.

The DPL Runtime Service can only start if the eSwitch is in switchdev mode. If it's not, an error will be logged on startup and the process will exit.

If the platform is Bluefield in DPU mode, run this command in the DPU shell, otherwise, (e.g. ConnectX-9), use the host shell.
Your Bluefield DPU may be pre-configured in switchdev mode after the bfb installation. If this is the case, this step may be unnecessary.

Find the PCI address of the interface that you'd like to use with the DPL Runtime Service and use the following command (replace the pci/<addr> part with correct values)

3.1.1.1.1. Example
sudo devlink dev eswitch set pci/0000:03:00.0 mode switchdev

Here are a few options for commands that may help you find your PCI address:

  • lspci -D

  • mst status -v

  • ip -d link

  • ethtool -i <interface name>

devlink settings are not persistent across reboots.

3.2. Enabling Multiport eSwitch Mode (Optional)

This step is optional and depends on your DPL program and setup needs.

Multiport eSwitch mode allows for traffic forwarding between multiple physical ports and their VFs/SFs (e.g., between p0 and p1).

Before enabling this mode:

  1. Ensure LAG_RESOURCE_ALLOCATION is enabled in firmware:

    Example

    sudo mlxconfig -d 0000:03:00.0 s LAG_RESOURCE_ALLOCATION=1
    

    Refer to the Using mlxconfig guide for more information.

  2. After reboot or firmware reset, enable esw_multiport mode:

    Example

    sudo devlink dev param set pci/0000:03:00.0 name esw_multiport value 1 cmode runtime
    

devlink settings are not persistent across reboots.

3.3. Creating SR-IOV Virtual Functions

To use SR-IOV, first create Virtual Functions (VFs) on the host server:

3.3.1.1.1. Example
Bash
sudo -s  # enter sudo shell
echo 4 > /sys/class/net/eth2/device/sriov_numvfs
exit     # exit sudo shell

Entering a sudo shell is necessary because sudo only applies to the echo command, and not the redirection (>), which would otherwise result in "Permission denied."

This example creates 4 VFs under Physical Function eth2. Adjust the number as needed.

If a PF already has VFs and you'd like to change the number, first set it to 0 before applying the new value.

3.4. Creating Scalable Functions (Optional)

This step is optional and depends on your DPL program and setup needs.

For more information, see the BlueField Scalable Functions, TODO: CX9.

If you create SFs, refer to their representors in the configuration file.

4. Installing the DPL Runtime Service

4.1. Downloading Container Resources from NGC

Start by downloading and installing the ngc-cli tools.
For example:

  • For DPU mode, download the Arm ngc-cli tool:

    Example

    Bash
    wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.5.2/files/ngccli_arm64.zip -O ngccli_arm64.zip
    unzip ngccli_arm64.zip
    
  • For Host mode, download the appropriate ngc-cli tool for your system architecture:

    Example for x86_64

    Bash
    wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.5.2/files/ngccli_linux.zip -O ngccli_linux.zip
    unzip ngccli_linux.zip
    

Once the ngc-cli tool has been downloaded, use it to download the latest dpl_rt_service resources:

Bash
./ngc-cli/ngc registry resource download-version "nvidia/doca/dpl_rt_service"

This creates a directory in the format dpl_rt_service_va.b.c-docax.y.z. Where a.b.c is the DPL Runtime Service version number, and x.y.z is the DOCA version number (e.g., dpl_rt_service_v1.2.0-doca3.1.0).

You can find available versions at NGC Catalog.

Each release includes a kubelet.d YAML file that is used by the dpl_rt_service_ctl.sh script for retrieving the correct container image for either DPU or Host mode.  

4.2. Running the Preparation Script

Run the dpl_system_setup.sh script to configure the system:

  • For DPU mode:

    Bash
    cd dpl_rt_service_va.b.c-docax.y.z
    chmod +x ./scripts/dpl_system_setup.sh  
    sudo ./scripts/dpl_system_setup.sh
    sudo systemctl restart kubelet.service
    sudo systemctl restart containerd.service
    

    For DPU mode, restarting kubelet and containerd is required whenever hugepages configuration changes for the changes to take effect.


  • For Host mode, specify the ConnectX device(s) that should be configured for DPL use using the --dev option (this option can be repeated): 

    Bash
    cd dpl_rt_service_va.b.c-docax.y.z
    chmod +x ./scripts/dpl_system_setup.sh
    sudo ./scripts/dpl_system_setup.sh --dev 0000:08:00.0
    

The dpl_system_setup.sh script will perform the following:

  • Configures mlxconfig values:

    • FLEX_PARSER_PROFILE_ENABLE=4

    • PROG_PARSE_GRAPH=true

    • SRIOV_EN=1

  • Enables SR-IOV

  • Sets up initial DPL Runtime Service configuration folder at /etc/dpl_rt_service/

  • Configures hugepages

Please note that the dpl_system_setup.sh script takes optional arguments to control the hugepages.

For DPU mode, if you set it larger than 4GB of hugepages, you also have to modify dpl_rt_service.yaml with a higher limit for spec->resources->limits->hugepages-2Mi

4.3. Editing the Configuration Files

Create device(s) configuration file based on the provided template config file. See DPL Service Configuration for details.

For example:

4.3.1.1.1. Example
Bash
sudo cp /etc/dpl_rt_service/devices.d/NAME.conf.template /etc/dpl_rt_service/devices.d/1000.conf
# Then update /etc/dpl_rt_service/devices.d/1000.conf as needed.
sudo vim /etc/dpl_rt_service/devices.d/1000.conf

You must create at least one device configuration file. Otherwise, the DPL Runtime Service Container will not be able to start.

4.4. Firewall Configuration to Open gRPC Server Ports

The DPL Runtime Service has several gRPC servers, each listening on a dedicated TCP port, supporting a corresponding DPL Developer tool. It is critical to make sure that these ports are accessible from the system(s) you plan to run the DPL Developer tools on as tThe tools will connect to the DPL Runtime Service using the corresponding tool's gRPC server TCP port.

The ports are configurable (see server_tcp_port settings at DPL Service Configuration). By default, they have the following values:

gRPC server

TCP Port

P4 Runtime

9559

DPL Admin

9600

DPL Nspect/Debugger

9560

4.4.1.1.1. Example for allowing the ports on RHEL-9
Bash
sudo firewall-cmd --permanent --add-port=9559/tcp
sudo firewall-cmd --permanent --add-port=9600/tcp
sudo firewall-cmd --permanent --add-port=9560/tcp
sudo firewall-cmd --reload

# List Configurations to confirm ports were allowed.
sudo firewall-cmd --list-all

4.5. Starting the DPL Runtime Service Container

Once your configuration files are ready, use the dpl_rt_service_ctl.sh script to start the container: 

Before running the script for the first time, user must grant it execution rights:

sudo chmod +x ./scripts/dpl_rt_service_ctl.sh
Bash
sudo ./scripts/dpl_rt_service_ctl.sh --start

For DPU mode, the script will copy the YAML file into the/etc/kubelet.d/directory, which will trigger automatic creation and start of DPL RT Service Pod and container.

For Host mode, the script will start a Docker container named dpl-rt-service.

Allow a few minutes for the container to start. To monitor status:

  • For DPU mode:



    • Check logs:

      Bash
      sudo journalctl -u kubelet --since -5m
      
    • List images:

      Bash
      sudo crictl images
      
    • List pods:

      Bash
      sudo crictl pods
      
  • For Host mode:

    • Check logs:

      Bash
      sudo docker logs dpl-rt-service
      
    • List images:

      Bash
      sudo docker images
      
  • View runtime logs:

    Bash
    /var/log/doca/dpl_rt_service/dpl_rtd.log
    

If the container fails to start due to configuration errors, then the log file at /var/log/doca/dpl_rt_service/dpl_rtd.log may be empty or missing the relevant error logs. In this case, you can view logs with the relevant errors using the relevant tool:

  • For DPU:

    Bash
    sudo crictl logs $(sudo crictl ps -a | grep dpl-rt-service | awk '{print $1}')
    
  • For Host:

    Bash
    sudo docker logs dpl-rt-service
    

4.6. Stopping the DPL Runtime Service Container

Stop the container by using the dpl_rt_service_ctl.sh script:


Bash
sudo ./scripts/dpl_rt_service_ctl.sh --stop

For DPU mode, the script will remove the YAML file from the /etc/kubelet.d/ directory.

For Host mode, the script will stop the Docker container named dpl-rt-service.

To confirm the pod is gone (this might take a few seconds to complete):

  • For DPU:

    sudo crictl pods | grep dpl-rt-service
    
  • For Host:

    sudo docker ps | grep dpl-rt-service
    

4.7. Restarting the DPL Runtime Service After Configuration Changes

Once the DPL Runtime Service container is up and running, any change to any file under the /etc/dpl_rt_service/ configuration folder requires restarting the container in order for the new changes to take effect.

Perform the following steps to restart the container:

  1. Stop the container.

    Bash
    sudo ./scripts/dpl_rt_service_ctl.sh --stop
    
  2. Wait for the container to stop.

  3. Start the container:

    Bash
    sudo ./scripts/dpl_rt_service_ctl.sh --start
    

4.8. End-to-End Installation Steps

Replace device IDs and filenames as appropriate for your setup.

  • DPU example:

    Bash
    # Download NGC CLI tool:
    wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.5.2/files/ngccli_arm64.zip -O ngccli_arm64.zip
    unzip ngccli_arm64.zip
    
    # Download the DPL Runtime Service Resources bundle:
    ./ngc-cli/ngc registry resource download-version "nvidia/doca/dpl_rt_service"
    
    # Prepare DPU and restart services:
    cd dpl_rt_service_va.b.c-docax.y.z
    chmod +x ./scripts/dpl_system_setup.sh
    sudo ./scripts/dpl_system_setup.sh
    sudo systemctl restart kubelet.service
    sudo systemctl restart containerd.service
    
    # Create a device configuration file with relevant interfaces info:
    sudo cp /etc/dpl_rt_service/devices.d/NAME.conf.template /etc/dpl_rt_service/devices.d/1000.conf
    sudo vim /etc/dpl_rt_service/devices.d/1000.conf
    
    # Launch the Pod and container:
    sudo ./scripts/dpl_rt_service_ctl.sh --start
    
  • Host example:

    Bash
    # Download NGC CLI tool:
    wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.5.2/files/ngccli_linux.zip -O ngccli_linux.zip
    unzip ngccli_linux.zip
    
    # Download the DPL Runtime Service Resources bundle:
    ./ngc-cli/ngc registry resource download-version "nvidia/doca/dpl_rt_service"
    
    # Prepare DPU and restart services:
    cd dpl_rt_service_va.b.c-docax.y.z
    chmod +x ./scripts/dpl_system_setup.sh
    sudo ./scripts/dpl_system_setup.sh --dev 0000:08:00.0
    
    # Create a device configuration file with relevant interfaces info:
    sudo cp /etc/dpl_rt_service/devices.d/NAME.conf.template /etc/dpl_rt_service/devices.d/1000.conf
    sudo vim /etc/dpl_rt_service/devices.d/1000.conf
    
    # Launch the Docker container:
    sudo ./scripts/dpl_rt_service_ctl.sh --start
    

5. Troubleshooting

For additional troubleshooting steps and deeper explanations, refer to BlueField Container Deployment Guide.

Checkpoint

Command

View recent kubelet logs (DPU only)

sudo journalctl -u kubelet --since -5m

View logs of the dpl-rt-service container (helpful if /var/log/doca/dpl_rt_service/dpl_rtd.log is missing or incomplete)


  • For DPU: 

    sudo crictl logs $(sudo crictl ps -a | grep dpl-rt-service | awk '{print $1}')
    
  • For Host: 

    sudo docker logs dpl-rt-service
    


List pulled container images


  • For DPU: 

    sudo crictl images
    
  • For Host: 

    sudo docker images
    


List all created pods (DPU only)

sudo crictl pods

List running containers


  • For DPU: 

    sudo crictl ps
    
  • For Host: 

    sudo docker ps
    


View DPL service logs

/var/log/doca/dpl_rt_service/dpl_rtd.log

Make sure the following conditions are met before or during deployment:

  • VFs were created before deploying the container (if using SR-IOV).

  • All required configuration files exist under /etc/dpl_rt_service/, are correctly named, and include valid device IDs.

  • Network interface names and MTU settings match the physical and virtual network topology.

  • Firmware is up to date and matches DOCA compatibility requirements.

  • For DPU mode, BlueField is operating in the correct mode (DPU mode) using sudo mlxconfig -d <pci-device> q.

Last updated: