Follow this guide from the source GitHub repo at github.com/NVIDIA/doca-platform and moving to the docs/public/user-guides/host-trusted/use-cases/hbn/README.md for better formatting of the code.
In addition, for this use case, the Top of Rack switch(ToR) must support BGP and EVPN.
The HBN DPUService has the additional requirements:
Software Prerequisites
This guide uses the following tools which must be installed on the machine where the commands contained in this guide run.
kubectl
helm
envsubst
Kubernetes Prerequisites
Control plane setup is complete before starting this guide
CNI installed before starting this guide
Worker nodes are not added until indicated by this guide
High-speed ports are used for secondary workload network and not for primary CNI
Virtual Functions
A number of virtual functions (VFs) will be created on hosts when provisioning DPUs. Certain of these VFs are marked for specific usage:
The first VF (vf0) is used by provisioning components.
The remaining VFs are allocated by SR-IOV Device Plugin.
Installation Guide
0. Required Variables
The following variables are required by this guide. A sensible default is provided where it makes sense, but many will be specific to the target infrastructure.
Commands in this guide are run in the same directory that contains this readme.
Environment variables file
## Virtual IP used by the load balancer for the DPU Cluster. Must be a reserved IP from the management subnet and not allocated by DHCP.
export DPUCLUSTER_VIP=
## Interface on which the DPUCluster load balancer will listen. Should be the management interface of the control plane node.
export DPUCLUSTER_INTERFACE=
## The repository URL for the NVIDIA Helm chart registry.
## Usually this is the NVIDIA Helm NGC registry. For development purposes, this can be set to a different repository.
export HELM_REGISTRY_REPO_URL=https://helm.ngc.nvidia.com/nvidia/doca
## The repository URL for the HBN container image.
## Usually this is the NVIDIA NGC registry. For development purposes, this can be set to a different repository.
export HBN_NGC_IMAGE_URL=nvcr.io/nvidia/doca/doca_hbn
## The DPF REGISTRY is the Helm repository URL where the DPF Operator Chart resides.
## Usually this is the NVIDIA Helm NGC registry. For development purposes, this can be set to a different repository.
export REGISTRY=https://helm.ngc.nvidia.com/nvidia/doca
## The DPF TAG is the version of the DPF components which will be deployed in this guide.
export TAG=v26.4.0
## URL to the BFB used in the `bfb.yaml` and linked by the DPUSet.
export BFB_URL="https://content.mellanox.com/BlueField/BFBs/Ubuntu24.04/bf-bundle-3.4.0-92_26.04_ubuntu-24.04_64k_prod.bfb"
Modify the variables in manifests/00-env-vars/envvars.env to fit your environment, then source the file:
source manifests/00-env-vars/envvars.env
1. DPF Operator Installation
Dependencies
Before deploying the DPF Operator, ensure that Helm is properly configured according to the Helm prerequisites.
This is a critical prerequisite step that must be completed for the DPF Operator to function properly.
These verification commands may need to be run multiple times to ensure the condition is met.
Verify the DPF Operator installation with:
## Ensure the DPF Operator deployment is available.
kubectl rollout status deployment --namespace dpf-operator-system dpf-operator-controller-manager
## Ensure all pods in the DPF Operator system are ready.
kubectl wait --for=condition=ready --namespace dpf-operator-system pods --all
2. DPF system installation
This section involves creating the DPF system components and some basic infrastructure required for a functioning DPF-enabled cluster.
DPUCluster to serve as Kubernetes control plane for DPU nodes
YAML
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUCluster
metadata:
name: dpu-cplane-tenant1
namespace: dpu-cplane-tenant1
spec:
type: kamaji
maxNodes: 1000
clusterEndpoint:
# deploy keepalived instances on the nodes that match the given nodeSelector.
keepalived:
# interface on which keepalived will listen. Should be the oob interface of the control plane node.
interface: $DPUCLUSTER_INTERFACE
# Virtual IP reserved for the DPU Cluster load balancer. Must not be allocatable by DHCP.
vip: $DPUCLUSTER_VIP
# virtualRouterID must be in range [1,255], make sure the given virtualRouterID does not duplicate with any existing keepalived process running on the host
virtualRouterID: 126
nodeSelector:
node-role.kubernetes.io/control-plane: ""
Verification
These verification commands may need to be run multiple times to ensure the condition is met.
Verify the DPF System with:
## Ensure the provisioning and DPUService controller manager deployments are available.
kubectl rollout status deployment --namespace dpf-operator-system dpf-provisioning-controller-manager dpuservice-controller-manager
## Ensure all other deployments in the DPF Operator system are Available.
kubectl rollout status deployment --namespace dpf-operator-system
## Ensure the DPUCluster is ready for nodes to join.
kubectl wait --for=condition=ready --namespace dpu-cplane-tenant1 dpucluster --all
3. Enable Accelerated Interfaces
Traffic can be routed through HBN on the worker node by mounting the DPU physical interface into a pod.
The NodeSRIOVDevicePluginConfig defines which VFs on the DPU physical functions are exposed as SR-IOV device plugin resources on the host node. The DPF Operator's NodeSRIOVDevicePluginController (enabled in the DPFOperatorConfig) manages the SR-IOV device plugin pods based on this configuration.
The NodeSRIOVDevicePluginConfig is linked to DPUs via the noderesources.dpu.nvidia.com/nodesriovdevicepluginconfig annotation on the DPU object. This annotation is set in the DPUDeployment's dpuAnnotations field.
Verification
These verification commands may need to be run multiple times to ensure the condition is met.
Verify the accelerated network prerequisites with:
## Ensure all pods in the nvidia-network-operator namespace are ready.
kubectl wait --for=condition=Ready --namespace nvidia-network-operator pods --all
## Expect the Multus Daemonset to be successfully rolled out.
kubectl rollout status daemonset --namespace nvidia-network-operator kube-multus-ds
4. DPU Provisioning and Service Installation
In this step we deploy our DPUs and the services that will run on them.
The user is expected to create a DPUDeployment object that reflects a set of DPUServices that should run on a set of DPUs.
Create the DPUDeployment, DPUServiceConfig, DPUServiceTemplate and other necessary objects
In case more than 1 DPU exists per node, the relevant selector should be applied in the DPUDeployment to select the appropriate DPU. See DPUDeployment - DPUs Configuration to understand more about the selectors.
These verification commands may need to be run multiple times to ensure the condition is met.
Note that the DPUService name will have a random suffix. For example, doca-hbn-l2xsl.
Verify the DPU and Service installation with:
## Ensure the BFB is ready
kubectl wait --for=jsonpath='{.status.phase}'=Ready --namespace dpf-operator-system bfb bf-bundle-$TAG --timeout=600s
## Ensure the DPUServices are created and have been reconciled.
kubectl wait --for=condition=ApplicationsReconciled --namespace dpf-operator-system dpuservices -l svc.dpu.nvidia.com/owned-by-dpudeployment=dpf-operator-system_hbn-only
## Ensure the DPUServiceIPAMs have been reconciled
kubectl wait --for=condition=DPUIPAMObjectReconciled --namespace dpf-operator-system dpuserviceipam --all
## Ensure the DPUServiceInterfaces have been reconciled
kubectl wait --for=condition=ServiceInterfaceSetReconciled --namespace dpf-operator-system dpuserviceinterface --all
## Ensure the DPUServiceChains have been reconciled
kubectl wait --for=condition=ServiceChainSetReconciled --namespace dpf-operator-system dpuservicechain --all
5. Test Traffic
Add worker nodes to the cluster
At this point workers should be added to the cluster. Each worker node should be configured in line with the prerequisites. As workers are added to the cluster DPUs will be provisioned and DPUServices will begin to be spun up.
You can verify the status of the DPUDeployment and its components with the following command:
HBN functionality can be tested by pinging between the pods and services deployed in the default namespace.
Uninstall
This section describes how to clean up the DPF components installed in this guide. It is recommended to run this section only after the DPF Operator and DPUCluster are no longer needed.
Note: there can be a race condition with deleting the underlying Kamaji cluster which runs the DPU cluster control plane in this guide. If that happens it may be necessary to remove finalizers manually from DPUCluster and Datastore objects.