Follow this guide from the source GitHub repo at github.com/NVIDIA/doca-platform and moving to the docs/public/user-guides/zero-trust/use-cases/hbn/README.md for better formatting of the code.
This configuration provides instructions for deploying the NVIDIA DOCA Platform Framework (DPF) on high-performance, bare-metal infrastructure in Zero Trust mode, utilizing DPU BMC and Redfish. It focuses on provisioning NVIDIA® BlueField®-3 DPUs using DPF, installing the HBN DPUService on those DPUs and enabling workload traffic to pass through HBN before leaving the DPU.
Prerequisites
This guide should be run by cloning the repo from github.com/NVIDIA/doca-platform and moving to the docs/public/user-guides/zero-trust/use-cases/hbn directory.
The system is set up as described in the prerequisites.
In addition, for this use case, the Top of Rack switch(ToR) must support BGP and EVPN, and should be configured to support unnumbered BGP towards the two ports of the DPU, where HBN will act as peer, and advertise routes over BGP to allow for ECMP from the DPU. Additional information about how to do that can be found in the RDG for DPF Zero Trust (DPF-ZT) with HBN DPU Service.
Software prerequisites
The following tools must be installed on the machine where the commands contained in this guide run:
-
kubectl
-
helm
-
envsubst
Installation guide
This guide assumes that the setup includes only 2 workers with DPUs. If your setup has more than 2 workers, then you will need to set additional variables to enable the rest of the DPUs.
0. Required variables
The following variables are required by this guide. A sensible default is provided where it makes sense, but many will be specific to the target infrastructure.
Commands in this guide are run in the same directory that contains this readme.
Modify the variables in manifests/00-env-vars/envvars.env to fit your environment, then source the file:
source manifests/00-env-vars/envvars.env
1. DPF Operator installation
Create DPU BMC shared password secret
In Zero Trust mode, provisioning DPUs requires authentication with Redfish. In order to do that, you must set the same root password to access the BMC for all DPUs DPF is going to manage.
For more information on how to set the BMC root password refer to BlueField DPU Administrator Quick Start Guide
The password is provided to DPF by creating the following secret:
kubectl create secret generic -n dpf-operator-system bmc-shared-password --from-literal=password=$BMC_ROOT_PASSWORD
Additional Dependencies
Before deploying the DPF Operator, ensure that Helm is properly configured according to the Helm prerequisites.
This is a critical prerequisite step that must be completed for the DPF Operator to function properly.
Deploy the DPF Operator
A number of environment variables must be set before running this command.
HTTP Registry (default)
If the $REGISTRY is an HTTP Registry (default value) use this command:
helm repo add --force-update dpf-repository ${REGISTRY}
helm repo update
helm upgrade --install -n dpf-operator-system dpf-operator dpf-repository/dpf-operator --version=$TAG
OCI Registry
For development purposes, if the $REGISTRY is an OCI Registry use this command:
helm upgrade --install -n dpf-operator-system dpf-operator $REGISTRY/dpf-operator --version=$TAG
Verification
These verification commands may need to be run multiple times to ensure the condition is met.
Verify the DPF Operator installation with:
## Ensure the DPF Operator deployment is available.
kubectl rollout status deployment --namespace dpf-operator-system dpf-operator-controller-manager
## Ensure all pods in the DPF Operator system are ready.
kubectl wait --for=condition=ready --namespace dpf-operator-system pods --all
2. DPF system installation
This section involves creating the DPF system components and some basic infrastructure required for a functioning DPF-enabled cluster.
Deploy the DPF System components
A number of environment variables must be set before running this command.
kubectl create ns dpu-cplane-tenant1
cat manifests/02-dpf-system-installation/*.yaml | envsubst | kubectl apply -f -
This will create the following objects:
Verification
These verification commands may need to be run multiple times to ensure the condition is met.
Verify the DPF System with:
## Ensure the provisioning and DPUService controller manager deployments are available.
kubectl rollout status deployment --namespace dpf-operator-system dpf-provisioning-controller-manager dpuservice-controller-manager
## Ensure all other deployments in the DPF Operator system are Available.
kubectl rollout status deployment --namespace dpf-operator-system
## Ensure bfb-registry pod is running.
kubectl wait --for=condition=ready --namespace dpf-operator-system pod/bfb-registry --timeout=600s
## Ensure bfb-registry service exists.
kubectl get svc bfb-registry --namespace dpf-operator-system
## Ensure the DPUCluster is ready for nodes to join.
kubectl wait --for=condition=ready --namespace dpu-cplane-tenant1 dpucluster --all
3. DPU Provisioning and Service Installation
There are 2 types of installation a user can do. The first one is using the PFs of the host and the second one is using both PFs and VFs. You should choose the one that fits best on your use case.
In the following section, we provision our DPUs and the services tht will run on them. The user is expected to create a DPUDeployment object that reflects a set of DPUServices that should run on a set of DPUs.
If you want to learn more about
DPUDeployments, feel free to check the DPUDeployment documentation.
Using PFs
In this scenario, the PF0 and PF1 are connected to separate VRFs which means that:
-
PF0 on Host 1 will be able to communicate with PF0 on Host 2
-
PF0 on Host 1 will not be able to communicate with PF1 on Host 1 and 2
-
PF1 on Host 1 will be able to communicate with PF1 on Host 2
-
PF1 on Host 1 will not be able to communicate with PF0 on Host 1 and 2
We make use of a PF on the host to test traffic.
Create the DPUDeployment, DPUServiceConfig, DPUServiceTemplate and other necessary objects
In case more than 1 DPU exists per node, the relevant selector should be applied in the DPUDeployment to select the appropriate DPU. See DPUDeployment - DPUs Configuration to understand more about the selectors.
A number of environment variables must be set before running this command.
cat manifests/03.1-dpudeployment-installation-pf/*.yaml | envsubst | kubectl apply -f -
This will deploy the following objects:
Verification
These verification commands may need to be run multiple times to ensure the condition is met.
Note that the DPUService name will have a random suffix. For example, doca-hbn-l2xsl.
Verify the DPU and Service installation with:
## Ensure the BFB is ready
kubectl wait --for=jsonpath='{.status.phase}'=Ready --namespace dpf-operator-system bfb bf-bundle-$TAG --timeout=600s
## Ensure the DPUServices are created and have been reconciled.
kubectl wait --for=condition=ApplicationsReconciled --namespace dpf-operator-system dpuservices -l svc.dpu.nvidia.com/owned-by-dpudeployment=dpf-operator-system_hbn
## Ensure the DPUServiceIPAMs have been reconciled
kubectl wait --for=condition=DPUIPAMObjectReconciled --namespace dpf-operator-system dpuserviceipam --all
## Ensure the DPUServiceInterfaces have been reconciled
kubectl wait --for=condition=ServiceInterfaceSetReconciled --namespace dpf-operator-system dpuserviceinterface --all
## Ensure the DPUServiceChains have been reconciled
kubectl wait --for=condition=ServiceChainSetReconciled --namespace dpf-operator-system dpuservicechain --all
## Ensure the DPUs have the condition Initialized (this may take time)
kubectl wait --for=condition=Initialized --namespace dpf-operator-system dpu --all
or with dpfctl:
$ kubectl -n dpf-operator-system exec deploy/dpf-operator-controller-manager -- /dpfctl describe dpudeployments
NAME NAMESPACE STATUS REASON SINCE MESSAGE
DPFOperatorConfig/dpfoperatorconfig dpf-operator-system
│ ├─Ready False Pending 17m The following conditions are not ready:
│ │ * SystemComponentsReady
│ └─SystemComponentsReady False Error 16m System components must be ready for DPF Operator to continue:
│ * nvidia-k8s-ipam: DPUService dpf-operator-system/nvidia-k8s-ipam is not ready
└─DPUDeployments
└─DPUDeployment/hbn dpf-operator-system
│ ├─Ready False Pending 11m The following conditions are not ready:
│ │ * DPUSetsReady
│ └─DPUSetsReady False Pending 11m Objects are not ready:
│ * dpf-operator-system/hbn-dpuset1
├─DPUServiceChains
│ └─DPUServiceChain/hbn-8kkjz dpf-operator-system Ready: True Success 11m
├─DPUServiceInterfaces
│ └─4 DPUServiceInterfaces... dpf-operator-system Ready: True Success 11m See doca-hbn-p0-if-mcqp4, doca-hbn-p1-if-6x2hh, doca-hbn-pf0hpf-if-q9lvk, doca-hbn-pf1hpf-if-979t7
├─DPUSets
│ └─DPUSet/hbn-dpuset1 dpf-operator-system
│ ├─BFB/bf-bundle dpf-operator-system Ready: True Ready 13m File: bf-bundle-3.2.1-34_25.11_ubuntu-24.04_64k_prod.bfb, DOCA: 3.2.1
│ └─DPUs
│ ├─DPU/dpu-node-mt2402xz0f6v-mt2402xz0f6v dpf-operator-system
│ │ └─Ready False OS Installing 8m39s
│ └─DPU/dpu-node-mt2404xz0c98-mt2404xz0c98 dpf-operator-system
│ └─Ready False OS Installing 8m39s
└─Services
├─DPUServiceTemplates
│ └─DPUServiceTemplate/doca-hbn dpf-operator-system Ready: True Success 13m
└─DPUServices
└─DPUService/doca-hbn-jmj45 dpf-operator-system Ready: True Success 11m
Releasing the Node Effect Hold
Since the DPUDeployment is configured with nodeEffect.hold: true, the DPUs will pause at the "Node Effect" phase and wait for external action before proceeding with provisioning. This gives the administrator control over when the node effect is applied.
To check that DPUNodeMaintenance objects have been created and are in the hold state:
kubectl get dpunodemaintenances -n dpf-operator-system
Once you are ready for provisioning to proceed, release the hold by setting the annotation on the DPUNodeMaintenance objects to "false". You can do this per-node or all at once:
kubectl annotate --overwrite dpunodemaintenances -n dpf-operator-system --all provisioning.dpu.nvidia.com/wait-for-external-nodeeffect=false
After releasing the hold, the DPUs will proceed through the remaining provisioning phases (BFB installation, OS installation, etc.).
Making the DPUs Ready
In order to make the DPUs ready, we will need to manually power cycle the host. This operation should be done in the most graceful manner by gracefully shutting down the Host and DPU, powering off the server and then powering it on to avoid corruption. This should happen when the object gives us the signal. The described flow can be automated by the admin depending on the infrastructure.
The following verification command may need to be run multiple times to ensure the condition is met.
## Ensure the DPUs have the condition WaitingForManualPowerCycleOrReboot (this may take time)
kubectl wait --for=condition=WaitingForManualPowerCycleOrReboot --namespace dpf-operator-system dpu --all
or with dpfctl:
$ kubectl -n dpf-operator-system exec deploy/dpf-operator-controller-manager -- /dpfctl describe dpudeployments
NAME NAMESPACE STATUS REASON SINCE MESSAGE
DPFOperatorConfig/dpfoperatorconfig dpf-operator-system
│ ├─Ready False Pending 66m The following conditions are not ready:
│ │ * SystemComponentsReady
│ └─SystemComponentsReady False Error 66m System components must be ready for DPF Operator to continue:
│ * nvidia-k8s-ipam: DPUService dpf-operator-system/nvidia-k8s-ipam is not ready
└─DPUDeployments
└─DPUDeployment/hbn dpf-operator-system
│ ├─Ready False Pending 61m The following conditions are not ready:
│ │ * DPUSetsReady
│ └─DPUSetsReady False Pending 61m Objects are not ready:
│ * dpf-operator-system/hbn-dpuset1
├─DPUServiceChains
│ └─DPUServiceChain/hbn-8kkjz dpf-operator-system Ready: True Success 61m
├─DPUServiceInterfaces
│ └─4 DPUServiceInterfaces... dpf-operator-system Ready: True Success 61m See doca-hbn-p0-if-mcqp4, doca-hbn-p1-if-6x2hh, doca-hbn-pf0hpf-if-q9lvk, doca-hbn-pf1hpf-if-979t7
├─DPUSets
│ └─DPUSet/hbn-dpuset1 dpf-operator-system
│ ├─BFB/bf-bundle dpf-operator-system Ready: True Ready 62m File: bf-bundle-3.2.1-34_25.11_ubuntu-24.04_64k_prod.bfb, DOCA: 3.2.1
│ └─DPUs
│ ├─DPU/dpu-node-mt2402xz0f6v-mt2402xz0f6v dpf-operator-system
│ │ ├─Rebooted False WaitingForManualPowerCycleOrReboot 11m
│ │ └─Ready False Rebooting 11m
│ └─DPU/dpu-node-mt2404xz0c98-mt2404xz0c98 dpf-operator-system
│ ├─Rebooted False WaitingForManualPowerCycleOrReboot 5m49s
│ └─Ready False Rebooting 5m49s
└─Services
├─DPUServiceTemplates
│ └─DPUServiceTemplate/doca-hbn dpf-operator-system Ready: True Success 62m
└─DPUServices
└─DPUService/doca-hbn-jmj45 dpf-operator-system Ready: True Success 61m
At this point, we have to power cycle the hosts. Once all the hosts are back online, we have to remove an annotation from the DPUNodes. The user can choose to remove this annotation node by node but to make it simpler in this guide, we do that all at once.
kubectl annotate dpunodes -n dpf-operator-system --all provisioning.dpu.nvidia.com/dpunode-external-reboot-required-
After this is done, we should expect that all DPUs become Ready:
kubectl wait --for="jsonpath={.status.phase}=Ready" --namespace dpf-operator-system dpu --all
or with dpfctl:
$ kubectl -n dpf-operator-system exec deploy/dpf-operator-controller-manager -- /dpfctl describe dpudeployments
NAME NAMESPACE STATUS REASON SINCE MESSAGE
DPFOperatorConfig/dpfoperatorconfig dpf-operator-system Ready: True Success 8m19s
└─DPUDeployments
└─DPUDeployment/hbn dpf-operator-system Ready: True Success 19s
├─DPUServiceChains
│ └─DPUServiceChain/hbn-8kkjz dpf-operator-system Ready: True Success 90m
├─DPUServiceInterfaces
│ └─4 DPUServiceInterfaces... dpf-operator-system Ready: True Success 48s See doca-hbn-p0-if-mls69, doca-hbn-p1-if-dv6ds, doca-hbn-pf0hpf-if-q9lvk, doca-hbn-pf1hpf-if-979t7
├─DPUSets
│ └─DPUSet/hbn-dpuset1 dpf-operator-system
│ ├─BFB/bf-bundle dpf-operator-system Ready: True Ready 91m File: bf-bundle-3.2.1-34_25.11_ubuntu-24.04_64k_prod.bfb, DOCA: 3.2.1
│ └─DPUs
│ └─2 DPUs... dpf-operator-system Ready: True DPUReady 25m See dpu-node-mt2402xz0f6v-mt2402xz0f6v, dpu-node-mt2404xz0c98-mt2404xz0c98
└─Services
├─DPUServiceTemplates
│ └─DPUServiceTemplate/doca-hbn dpf-operator-system Ready: True Success 91m
└─DPUServices
└─DPUService/doca-hbn-6rhsx dpf-operator-system Ready: True Success 21s
Test Traffic
After the DPUs are provisioned and the rest of the objects are Ready, we can test traffic by assigning an IP to the PF0 on the host for each DPU, and run a simple ping. Although the configuration is enabling both PFs, we focus on the PF0 for testing traffic. Assuming the PF0 is named ens5f0np0:
On the host with DPU with serial number DPU1_SERIAL:
ip link set dev ens5f0np0 up
ip addr add 10.0.121.1/29 dev ens5f0np0
ip route add 10.0.121.0/24 dev ens5f0np0 via 10.0.121.2
On the host with DPU with serial number DPU2_SERIAL:
ip link set dev ens5f0np0 up
ip addr add 10.0.121.9/29 dev ens5f0np0
ip route add 10.0.121.0/24 dev ens5f0np0 via 10.0.121.10
On the host with DPU with serial number DPU1_SERIAL:
$ ping 10.0.121.9 -c3
PING 10.0.121.9 (10.0.121.9) 56(84) bytes of data.
64 bytes from 10.0.121.9: icmp_seq=1 ttl=64 time=0.387 ms
64 bytes from 10.0.121.9: icmp_seq=2 ttl=64 time=0.344 ms
64 bytes from 10.0.121.9: icmp_seq=3 ttl=64 time=0.396 ms
--- 10.0.121.9 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2053ms
rtt min/avg/max/mdev = 0.344/0.375/0.396/0.022 ms
Using PFs + VFs
In this scenario, the PF0, PF1, VF10 of the PF0 and VF10 of the PF1 are connected to separate VRFs which means that:
-
PF0 on Host 1 will be able to communicate with PF0 on Host 2
-
PF0 on Host 1 will not be able to communicate with PF1 on Host 1 and 2
-
PF0 on Host 1 will not be able to communicate with PF0VF10 on Host 1 and 2
-
PF0 on Host 1 will not be able to communicate with PF1VF10 on Host 1 and 2
-
PF1 on Host 1 will be able to communicate with PF1 on Host 2
-
PF1 on Host 1 will not be able to communicate with PF0 on Host 1 and 2
-
PF1 on Host 1 will not be able to communicate with PF0VF10 on Host 1 and 2
-
PF1 on Host 1 will not be able to communicate with PF1VF10 on Host 1 and 2
-
PF0VF10 on Host 1 will be able to communicate with PF0VF10 on Host 2
-
PF0VF10 on Host 1 will not be able to communicate with PF0 on Host 1 and 2
-
PF0VF10 on Host 1 will not be able to communicate with PF1 on Host 1 and 2
-
PF0VF10 on Host 1 will not be able to communicate with PF1VF10 on Host 1 and 2
-
PF1VF10 on Host 1 will be able to communicate with PF1VF10 on Host 2
-
PF1VF10 on Host 1 will not be able to communicate with PF0 on Host 1 and 2
-
PF1VF10 on Host 1 will not be able to communicate with PF1 on Host 1 and 2
-
PF1VF10 on Host 1 will not be able to communicate with PF0VF10 on Host 1 and 2
We make use of a PF and a VF on the host to test traffic.
Create the DPUDeployment, DPUServiceConfig, DPUServiceTemplate and other necessary objects
A number of environment variables must be set before running this command.
cat manifests/03.2-dpudeployment-installation-pf-vf/*.yaml | envsubst | kubectl apply -f -
This will deploy the following objects:
Verification
These verification commands may need to be run multiple times to ensure the condition is met.
Note that the DPUService name will have a random suffix. For example, doca-hbn-l2xsl.
Verify the DPU and Service installation with:
## Ensure the BFB is ready
kubectl wait --for=jsonpath='{.status.phase}'=Ready --namespace dpf-operator-system bfb bf-bundle-$TAG --timeout=600s
## Ensure the DPUServices are created and have been reconciled.
kubectl wait --for=condition=ApplicationsReconciled --namespace dpf-operator-system dpuservices -l svc.dpu.nvidia.com/owned-by-dpudeployment=dpf-operator-system_hbn
## Ensure the DPUServiceIPAMs have been reconciled
kubectl wait --for=condition=DPUIPAMObjectReconciled --namespace dpf-operator-system dpuserviceipam --all
## Ensure the DPUServiceInterfaces have been reconciled
kubectl wait --for=condition=ServiceInterfaceSetReconciled --namespace dpf-operator-system dpuserviceinterface --all
## Ensure the DPUServiceChains have been reconciled
kubectl wait --for=condition=ServiceChainSetReconciled --namespace dpf-operator-system dpuservicechain --all
## Ensure the DPUs have the condition Initialized (this may take time)
kubectl wait --for=condition=Initialized --namespace dpf-operator-system dpu --all
or with dpfctl:
$ kubectl -n dpf-operator-system exec deploy/dpf-operator-controller-manager -- /dpfctl describe dpudeployments
NAME NAMESPACE STATUS REASON SINCE MESSAGE
DPFOperatorConfig/dpfoperatorconfig dpf-operator-system
│ ├─Ready False Pending 3m13s The following conditions are not ready:
│ │ * SystemComponentsReady
│ └─SystemComponentsReady False Error 2m28s System components must be ready for DPF Operator to continue:
│ * nvidia-k8s-ipam: DPUService dpf-operator-system/nvidia-k8s-ipam is not ready
└─DPUDeployments
└─DPUDeployment/hbn dpf-operator-system
│ ├─Ready False Pending 77s The following conditions are not ready:
│ │ * DPUSetsReady
│ └─DPUSetsReady False Pending 79s Objects are not ready:
│ * dpf-operator-system/hbn-dpuset1
├─DPUServiceChains
│ └─DPUServiceChain/hbn-5zgs4 dpf-operator-system Ready: True Success 79s
├─DPUServiceInterfaces
│ └─6 DPUServiceInterfaces... dpf-operator-system Ready: True Success 79s See doca-hbn-p0-if-w6f6b, doca-hbn-p1-if-p7565, doca-hbn-pf0hpf-if-wb84j, doca-hbn-pf0vf10-if-mr6fj,
│ doca-hbn-pf1hpf-if-cnbz8, doca-hbn-pf1vf10-if-7r6r6
├─DPUSets
│ └─DPUSet/hbn-dpuset1 dpf-operator-system
│ ├─BFB/bf-bundle dpf-operator-system Ready: True Ready 105s File: bf-bundle-3.2.1-34_25.11_ubuntu-24.04_64k_prod.bfb, DOCA: 3.2.1
│ └─DPUs
│ ├─DPU/dpu-node-mt2402xz0f6v-mt2402xz0f6v dpf-operator-system
│ │ └─Ready False OS Installing 72s
│ └─DPU/dpu-node-mt2404xz0c98-mt2404xz0c98 dpf-operator-system
│ └─Ready False OS Installing 69s
└─Services
├─DPUServiceTemplates
│ └─DPUServiceTemplate/doca-hbn dpf-operator-system Ready: True Success 104s
└─DPUServices
└─DPUService/doca-hbn-bjqbh dpf-operator-system Ready: True Success 77s
Releasing the Node Effect Hold
Since the DPUDeployment is configured with nodeEffect.hold: true, the DPUs will pause at the "Node Effect" phase and wait for external action before proceeding with provisioning. This gives the administrator control over when the node effect is applied.
To check that DPUNodeMaintenance objects have been created and are in the hold state:
kubectl get dpunodemaintenances -n dpf-operator-system
Once you are ready for provisioning to proceed, release the hold by setting the annotation on the DPUNodeMaintenance objects to "false". You can do this per-node or all at once:
kubectl annotate --overwrite dpunodemaintenances -n dpf-operator-system --all provisioning.dpu.nvidia.com/wait-for-external-nodeeffect=false
After releasing the hold, the DPUs will proceed through the remaining provisioning phases (BFB installation, OS installation, etc.).
Making the DPUs Ready
In order to make the DPUs ready, we will need to manually power cycle the host. This operation should be done in the most graceful manner by gracefully shutting down the Host and DPU, powering off the server and then powering it on to avoid corruption. This should happen when the object gives us the signal. The described flow can be automated by the admin depending on the infrastructure.
The following verification command may need to be run multiple times to ensure the condition is met.
## Ensure the DPUs have the condition WaitingForManualPowerCycleOrReboot (this may take time)
kubectl wait --for=condition=WaitingForManualPowerCycleOrReboot --namespace dpf-operator-system dpu --all
or with dpfctl:
$ kubectl -n dpf-operator-system exec deploy/dpf-operator-controller-manager -- /dpfctl describe dpudeployments
NAME NAMESPACE STATUS REASON SINCE MESSAGE
DPFOperatorConfig/dpfoperatorconfig dpf-operator-system
│ ├─Ready False Pending 17m The following conditions are not ready:
│ │ * SystemComponentsReady
│ └─SystemComponentsReady False Error 16m System components must be ready for DPF Operator to continue:
│ * nvidia-k8s-ipam: DPUService dpf-operator-system/nvidia-k8s-ipam is not ready
└─DPUDeployments
└─DPUDeployment/hbn dpf-operator-system
│ ├─Ready False Pending 15m The following conditions are not ready:
│ │ * DPUSetsReady
│ └─DPUSetsReady False Pending 15m Objects are not ready:
│ * dpf-operator-system/hbn-dpuset1
├─DPUServiceChains
│ └─DPUServiceChain/hbn-5zgs4 dpf-operator-system Ready: True Success 15m
├─DPUServiceInterfaces
│ └─6 DPUServiceInterfaces... dpf-operator-system Ready: True Success 15m See doca-hbn-p0-if-w6f6b, doca-hbn-p1-if-p7565, doca-hbn-pf0hpf-if-wb84j, doca-hbn-pf0vf10-if-mr6fj,
│ doca-hbn-pf1hpf-if-cnbz8, doca-hbn-pf1vf10-if-7r6r6
├─DPUSets
│ └─DPUSet/hbn-dpuset1 dpf-operator-system
│ ├─BFB/bf-bundle dpf-operator-system Ready: True Ready 15m File: bf-bundle-3.2.1-34_25.11_ubuntu-24.04_64k_prod.bfb, DOCA: 3.2.1
│ └─DPUs
│ ├─DPU/dpu-node-mt2402xz0f6v-mt2402xz0f6v dpf-operator-system
│ │ ├─Rebooted False WaitingForManualPowerCycleOrReboot 2m36s
│ │ └─Ready False Rebooting 2m36s
│ └─DPU/dpu-node-mt2404xz0c98-mt2404xz0c98 dpf-operator-system
│ ├─Rebooted False WaitingForManualPowerCycleOrReboot 2m36s
│ └─Ready False Rebooting 2m36s
└─Services
├─DPUServiceTemplates
│ └─DPUServiceTemplate/doca-hbn dpf-operator-system Ready: True Success 15m
└─DPUServices
└─DPUService/doca-hbn-bjqbh dpf-operator-system Ready: True Success 15m
At this point, we have to power cycle the hosts. Once all the hosts are back online, we have to remove an annotation from the DPUNodes. The user can choose to remove this annotation node by node but to make it simpler in this guide, we do that all at once.
kubectl annotate dpunodes -n dpf-operator-system --all provisioning.dpu.nvidia.com/dpunode-external-reboot-required-
After this is done, we should expect that all DPUs become Ready:
kubectl wait --for="jsonpath={.status.phase}=Ready" --namespace dpf-operator-system dpu --all
or with dpfctl:
$ kubectl -n dpf-operator-system exec deploy/dpf-operator-controller-manager -- /dpfctl describe dpudeployments
NAME NAMESPACE STATUS REASON SINCE MESSAGE
NAME NAMESPACE STATUS REASON SINCE MESSAGE
DPFOperatorConfig/dpfoperatorconfig dpf-operator-system Ready: True Success 6m5s
└─DPUDeployments
└─DPUDeployment/hbn dpf-operator-system Ready: True Success 2s
├─DPUServiceChains
│ └─DPUServiceChain/hbn-5zgs4 dpf-operator-system Ready: True Success 36s
├─DPUServiceInterfaces
│ └─6 DPUServiceInterfaces... dpf-operator-system Ready: True Success 6s See doca-hbn-p0-if-w6f6b, doca-hbn-p1-if-p7565, doca-hbn-pf0hpf-if-wb84j, doca-hbn-pf0vf10-if-mr6fj,
│ doca-hbn-pf1hpf-if-cnbz8, doca-hbn-pf1vf10-if-7r6r6
├─DPUSets
│ └─DPUSet/hbn-dpuset1 dpf-operator-system
│ ├─BFB/bf-bundle dpf-operator-system Ready: True Ready 28m File: bf-bundle-3.2.1-34_25.11_ubuntu-24.04_64k_prod.bfb, DOCA: 3.2.1
│ └─DPUs
│ └─2 DPUs... dpf-operator-system Ready: True DPUReady 5m52s See dpu-node-mt2402xz0f6v-mt2402xz0f6v, dpu-node-mt2404xz0c98-mt2404xz0c98
└─Services
├─DPUServiceTemplates
│ └─DPUServiceTemplate/doca-hbn dpf-operator-system Ready: True Success 28m
└─DPUServices
└─DPUService/doca-hbn-bjqbh dpf-operator-system Ready: True Success 3s
Test Traffic
After the DPUs are provisioned and the rest of the objects are Ready, we can test traffic by assigning an IP to the PF0 on the host for each DPU, and run a simple ping. Although the configuration is enabling both PFs, we focus on the PF0 for testing traffic. Assuming the PF0 is named ens5f0np0:
On the host with DPU with serial number DPU1_SERIAL:
ip link set dev ens5f0np0 up
ip addr add 10.0.121.1/29 dev ens5f0np0
ip route add 10.0.121.0/24 dev ens5f0np0 via 10.0.121.2
On the host with DPU with serial number DPU2_SERIAL:
ip link set dev ens5f0np0 up
ip addr add 10.0.121.9/29 dev ens5f0np0
ip route add 10.0.121.0/24 dev ens5f0np0 via 10.0.121.10
On the host with DPU with serial number DPU1_SERIAL:
$ ping 10.0.121.9 -c3
PING 10.0.121.9 (10.0.121.9) 56(84) bytes of data.
64 bytes from 10.0.121.9: icmp_seq=1 ttl=64 time=0.387 ms
64 bytes from 10.0.121.9: icmp_seq=2 ttl=64 time=0.344 ms
64 bytes from 10.0.121.9: icmp_seq=3 ttl=64 time=0.396 ms
--- 10.0.121.9 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2053ms
rtt min/avg/max/mdev = 0.344/0.375/0.396/0.022 ms
In addition, we can test traffic by assigning an IP to the 10th VF of PF0 on the host for each DPU, and run a simple ping. We could use any VF, but the DPUDeployment and DPUServiceInterface will need to be adjusted accordingly. First thing to do is to create the VFs on the hosts where the each DPU belongs to:
echo 12 > /sys/class/net/ens5f0np0/device/sriov_numvfs
Then, assuming the VF is named ens5f0v10:
On the host with DPU with serial number DPU1_SERIAL:
ip link set dev ens5f0v10 up
ip addr add 10.0.123.1/29 dev ens5f0v10
ip route add 10.0.123.0/24 dev ens5f0v10 via 10.0.123.2
On the host with DPU with serial number DPU2_SERIAL:
ip link set dev ens5f0v10 up
ip addr add 10.0.123.9/29 dev ens5f0v10
ip route add 10.0.123.0/24 dev ens5f0v10 via 10.0.123.10
On the host with DPU with serial number DPU1_SERIAL:
$ ping 10.0.123.9 -c3
PING 10.0.123.9 (10.0.123.9) 56(84) bytes of data.
64 bytes from 10.0.123.9: icmp_seq=1 ttl=64 time=0.387 ms
64 bytes from 10.0.123.9: icmp_seq=2 ttl=64 time=0.344 ms
64 bytes from 10.0.123.9: icmp_seq=3 ttl=64 time=0.396 ms
--- 10.0.123.9 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2053ms
rtt min/avg/max/mdev = 0.344/0.375/0.396/0.022 ms
Uninstall
This section covers only the DPF related components and not the prerequisites as these must be managed by the admin.
Delete the DPF Operator system and DPF Operator
kubectl delete -n dpf-operator-system dpfoperatorconfig dpfoperatorconfig --wait
helm uninstall -n dpf-operator-system dpf-operator --wait
Note: there can be a race condition with deleting the underlying Kamaji cluster which runs the DPU cluster control plane in this guide. If that happens it may be necessary to remove finalizers manually from DPUCluster and Datastore objects.
Last updated: