OVN Kubernetes with Host Based Networking and SNAP Block Storage
Follow this guide from the source GitHub repo at github.com/NVIDIA/doca-platform and moving to the docs/public/user-guides/host-trusted/use-cases/hbn-ovnk-snap/README.md for better formatting of the code.
This guide should be run by cloning the repo from github.com/NVIDIA/doca-platform and moving to the docs/public/user-guides/host-trusted/use-cases/hbn-ovnk-snap directory.
A number of virtual functions (VFs) will be created on hosts when provisioning DPUs. Certain of these VFs are marked for specific usage:
The first VF (vf0) is used by provisioning components.
The second VF (vf1) is used by ovn-kubernetes.
The remaining VFs are allocated by SR-IOV Device Plugin. Each pod using OVN Kubernetes in DPU mode as its primary CNI will have one of these VFs injected at Pod creation time.
Installation Guide
0. Required Variables
The following variables are required by this guide. A sensible default is provided where it makes sense, but many will be specific to the target infrastructure.
Commands in this guide are run in the same directory that contains this readme.
Environment variables file
## IP Address for the Kubernetes API server of the target cluster on which DPF is installed.
## This should never include a scheme or a port.
## e.g. 10.10.10.10
export TARGETCLUSTER_API_SERVER_HOST=
## Port for the Kubernetes API server of the target cluster on which DPF is installed.
export TARGETCLUSTER_API_SERVER_PORT=6443
## IP address range for hosts in the target cluster on which DPF is installed.
## This is a CIDR in the form e.g. 10.10.10.0/24
export TARGETCLUSTER_NODE_CIDR=
## Virtual IP used by the load balancer for the DPU Cluster. Must be a reserved IP from the management subnet and not allocated by DHCP.
export DPUCLUSTER_VIP=
## Interface on which the DPUCluster load balancer will listen. Should be the management interface of the control plane node.
export DPUCLUSTER_INTERFACE=
## The repository URL for the NVIDIA Helm chart registry.
## Usually this is the NVIDIA Helm NGC registry. For development purposes, this can be set to a different repository.
export HELM_REGISTRY_REPO_URL=https://helm.ngc.nvidia.com/nvidia/doca
## The repository URL for the HBN container image.
## Usually this is the NVIDIA NGC registry. For development purposes, this can be set to a different repository.
export HBN_NGC_IMAGE_URL=nvcr.io/nvidia/doca/doca_hbn
## The repository URL for the SNAP VFS container image.
## Usually this is the NVIDIA NGC registry. For development purposes, this can be set to a different repository.
export SNAP_NGC_IMAGE_URL=nvcr.io/nvidia/doca/doca_vfs
## The repository URL for the OVN-Kubernetes Helm chart.
## Usually this is the NVIDIA GHCR repository. For development purposes, this can be set to a different repository.
export OVN_KUBERNETES_REPO_URL=oci://ghcr.io/mellanox/charts
# OVN-Kubernetes chart tag
export OVN_KUBERNETES_CHART_TAG=v26.4.0
## POD_CIDR is the CIDR used for pods in the target Kubernetes cluster.
export POD_CIDR=10.233.64.0/18
## SERVICE_CIDR is the CIDR used for services in the target Kubernetes cluster.
## This is a CIDR in the form e.g. 10.10.10.0/24
export SERVICE_CIDR=10.233.0.0/18
## The DPF REGISTRY is the Helm repository URL where the DPF Operator Chart resides.
## Usually this is the NVIDIA Helm NGC registry. For development purposes, this can be set to a different repository.
export REGISTRY=https://helm.ngc.nvidia.com/nvidia/doca
## The DPF TAG is the version of the DPF components which will be deployed in this guide.
export TAG=v26.4.0
## URL to the BFB used in the `bfb.yaml` and linked by the DPUSet.
export BFB_URL="https://content.mellanox.com/BlueField/BFBs/Ubuntu24.04/bf-bundle-3.4.0-92_26.04_ubuntu-24.04_64k_prod.bfb"
Modify the variables in manifests/00-env-vars/envvars.env to fit your environment, then source the file:
source manifests/00-env-vars/envvars.env
1. CNI Installation
OVN Kubernetes is used as the primary CNI for the cluster. On worker nodes the primary CNI will be accelerated by offloading work to the DPU. On control plane nodes OVN Kubernetes will run without offloading.
Create the Namespace
kubectl create ns ovn-kubernetes
Install OVN Kubernetes from the helm chart
Install the OVN Kubernetes CNI components from the helm chart. A number of environment variables must be set before running this command.
commonManifests:
enabled: true
nodeWithoutDPUManifests:
enabled: true
controlPlaneManifests:
enabled: true
nodeWithDPUManifests:
enabled: true
nodeMgmtPortDpResourceName: nvidia.com/ovnk-mgmt-vf
dpuServiceAccountNamespace: dpf-operator-system
gatewayOpts: --gateway-interface=derive-from-mgmt-port
## Note this CIDR is followed by a trailing /24 which informs OVN Kubernetes on how to split the CIDR per node.
podNetwork: $POD_CIDR/24
serviceNetwork: $SERVICE_CIDR
k8sAPIServer: https://$TARGETCLUSTER_API_SERVER_HOST:$TARGETCLUSTER_API_SERVER_PORT
Verification
These verification commands may need to be run multiple times to ensure the condition is met.
Verify the CNI installation with:
## Ensure all nodes in the cluster are ready.
kubectl wait --for=condition=ready nodes --all
## Ensure all pods in the ovn-kubernetes namespace are ready.
kubectl wait --for=condition=ready --namespace ovn-kubernetes pods --all --timeout=300s
2. DPF Operator Installation
Dependencies
Before deploying the DPF Operator, ensure that Helm is properly configured according to the Helm prerequisites.
This is a critical prerequisite step that must be completed for the DPF Operator to function properly.
After applying the additional dependencies you MUST ensure that the KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT environment variables are set in the node-feature-discovery-worker DaemonSet.
NFD needs to target the VIP because it needs to be up before cluster services can work.
Example commands to set the environment variables:
kubectl -n dpf-operator-system set env daemonset/node-feature-discovery-worker \
KUBERNETES_SERVICE_HOST=$TARGETCLUSTER_API_SERVER_HOST \
KUBERNETES_SERVICE_PORT=$TARGETCLUSTER_API_SERVER_PORT
These verification commands may need to be run multiple times to ensure the condition is met.
Verify the DPF Operator installation with:
## Ensure the DPF Operator deployment is available.
kubectl rollout status deployment --namespace dpf-operator-system dpf-operator-controller-manager
## Ensure all pods in the DPF Operator system are ready.
kubectl wait --for=condition=ready --namespace dpf-operator-system pods --all
3. DPF System Installation
This section involves creating the DPF system components and some basic infrastructure required for a functioning DPF-enabled cluster.
DPUCluster to serve as Kubernetes control plane for DPU nodes
YAML
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUCluster
metadata:
name: dpu-cplane-tenant1
namespace: dpu-cplane-tenant1
spec:
type: kamaji
maxNodes: 1000
clusterEndpoint:
# deploy keepalived instances on the nodes that match the given nodeSelector.
keepalived:
# interface on which keepalived will listen. Should be the oob interface of the control plane node.
interface: $DPUCLUSTER_INTERFACE
# Virtual IP reserved for the DPU Cluster load balancer. Must not be allocatable by DHCP.
vip: $DPUCLUSTER_VIP
# virtualRouterID must be in range [1,255], make sure the given virtualRouterID does not duplicate with any existing keepalived process running on the host
virtualRouterID: 126
nodeSelector:
node-role.kubernetes.io/control-plane: ""
Verification
These verification commands may need to be run multiple times to ensure the condition is met.
Verify the DPF System with:
## Ensure the provisioning and DPUService controller manager deployments are available.
kubectl rollout status deployment --namespace dpf-operator-system dpf-provisioning-controller-manager dpuservice-controller-manager
## Ensure all other deployments in the DPF Operator system are Available.
kubectl rollout status deployment --namespace dpf-operator-system
## Ensure the DPUCluster is ready for nodes to join.
kubectl wait --for=condition=ready --namespace dpu-cplane-tenant1 dpucluster --all
4. Install Components to Enable Accelerated CNI Nodes
OVN Kubernetes will accelerate traffic by attaching a VF to each pod using the primary CNI. This VF is used to offload flows to the DPU. This section details the components needed to connect pods to the offloaded OVN Kubernetes CNI.
Install the OVN Kubernetes resource injection webhook
The OVN Kubernetes resource injection webhook injected each pod scheduled to a worker node with a request for a VF and a Network Attachment Definition. This webhook is part of the same helm chart as the other components of the OVN Kubernetes CNI. Here it is installed by adjusting the existing helm installation to add the webhook component to the installation.
The NodeSRIOVDevicePluginConfig defines which VFs on the DPU physical functions are exposed as SR-IOV device plugin resources on the host node. The DPF Operator's NodeSRIOVDevicePluginController (enabled in the DPFOperatorConfig) manages the SR-IOV device plugin pods based on this configuration.
The NodeSRIOVDevicePluginConfig is linked to DPUs via the noderesources.dpu.nvidia.com/nodesriovdevicepluginconfig annotation on the DPU object. This annotation is set in the DPUDeployment's dpuAnnotations field.
Verification
These verification commands may need to be run multiple times to ensure the condition is met.
Verify that the accelerated CNI is enabled with:
## Ensure all pods in the nvidia-network-operator namespace are ready.
kubectl wait --for=condition=Ready --namespace nvidia-network-operator pods --all
## Expect the Multus Daemonset to be successfully rolled out.
kubectl rollout status daemonset --namespace nvidia-network-operator kube-multus-ds
## Expect the network injector to be successfully rolled out.
kubectl rollout status deployment --namespace ovn-kubernetes ovn-kubernetes-resource-injector
5. DPU Provisioning and Service Installation
This section covers creating the vendor CSI controller credentials, installing the required storage components on the host cluster, and deploying the DPUs together with the services that run on them.
The user is expected to create a DPUDeployment object that reflects a set of DPUServices that should run on a set of DPUs.
host:
enabled: true
config:
targets:
nodes:
# name of the target
- name: spdk-target
# management address
rpcURL: http://10.0.110.25:8000
# type of the target, e.g. nvme-tcp, nvme-rdma
targetType: nvme-rdma
# target service IP
targetAddr: 10.0.124.1
# required parameter, name of the secret that contains connection
# details to access the DPU cluster.
# this secret should be created by the DPUServiceCredentialRequest API.
dpuClusterSecret: spdk-csi-controller-dpu-cluster-credentials
Apply the DPUDeployment and DPU-side Storage Resources
Storage use-cases set RDMA_SET_NETNS_EXCLUSIVE="no" in the DPUFlavor, putting the DPU in shared RDMA mode. The default SFC NAD (mybrsfc) enables RDMA for SF interfaces, which is not compatible with shared RDMA mode. All services deployed on a DPU provisioned with a storage flavor that use SF interfaces must reference a NAD without RDMA. A custom DPUServiceNAD (mybrsfc-storage) is included in the manifests below for this reason.
[!WARNING] In case more than 1 DPU exists per node, the relevant selector should be applied in the DPUDeployment to select the appropriate DPU. See DPUDeployment - DPUs Configuration to understand more about the selectors.
OVN DPUServiceConfiguration and DPUServiceTemplate to deploy OVN workloads to the DPUs
YAML
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceConfiguration
metadata:
name: ovn
namespace: dpf-operator-system
spec:
deploymentServiceName: "ovn"
serviceConfiguration:
helmChart:
values:
k8sAPIServer: https://$TARGETCLUSTER_API_SERVER_HOST:$TARGETCLUSTER_API_SERVER_PORT
podNetwork: $POD_CIDR/24
serviceNetwork: $SERVICE_CIDR
dpuManifests:
kubernetesSecretName: "ovn-dpu" # user needs to populate based on DPUServiceCredentialRequest
vtepCIDR: "10.0.120.0/22" # user needs to populate based on DPUServiceIPAM
hostCIDR: $TARGETCLUSTER_NODE_CIDR # user needs to populate
ipamPool: "pool1" # user needs to populate based on DPUServiceIPAM
ipamPoolType: "cidrpool" # user needs to populate based on DPUServiceIPAM
ipamVTEPIPIndex: 0
ipamPFIPIndex: 1
SPDK CSI Controller DPU DPUServiceConfiguration and DPUServiceTemplate (DPU Cluster)
YAML
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceConfiguration
metadata:
name: spdk-csi-controller-dpu
namespace: dpf-operator-system
spec:
deploymentServiceName: spdk-csi-controller-dpu
upgradePolicy:
applyNodeEffect: false
serviceConfiguration:
helmChart:
values:
dpu:
enabled: true
storageClass:
# the name of the storage class that will be created for spdk-csi,
# this StorageClass name should be used in the StorageVendor settings
name: spdkcsi-sc
# name of the secret that contains credentials for the remote SPDK target,
# content of the secret is injected during CreateVolume request
secretName: spdkcsi-secret
# namespace of the secret with credentials for the remote SPDK target
secretNamespace: dpf-operator-system
rbacRoles:
spdkCsiController:
# the name of the service account for spdk-csi-controller
# this value must be aligned with the value from the DPUServiceCredentialRequest
serviceAccount: spdk-csi-controller-sa
---
apiVersion: v1
kind: Secret
metadata:
name: spdkcsi-secret
namespace: dpf-operator-system
labels:
# this label enables replication of the secret from the host to the dpu cluster
dpu.nvidia.com/image-pull-secret: ""
stringData:
# name field in the "rpcTokens" list should match name of the
# spdk target from DPUService.helmChart.values.host.config.targets.nodes
secret.json: |-
{
"rpcTokens": [
{
"name": "spdk-target",
"username": "exampleuser",
"password": "examplepassword"
}
]
}
OVN DPUServiceCredentialRequest to allow cross cluster communication
These verification commands may need to be run multiple times to ensure the condition is met.
Note that the DPUService name will have a random suffix. For example, hbn-vs6mj. Use the correct name for the verification.
Verify the DPU and Service installation with:
## Ensure the BFB is ready
kubectl wait --for=jsonpath='{.status.phase}'=Ready --namespace dpf-operator-system bfb bf-bundle-$TAG --timeout=600s
## Ensure the DPUServices are created and have been reconciled.
kubectl wait --for=condition=ApplicationsReconciled --namespace dpf-operator-system dpuservices -l svc.dpu.nvidia.com/owned-by-dpudeployment=dpf-operator-system_hbn-ovnk-snap-nvme
## Ensure the DPUServiceIPAMs have been reconciled
kubectl wait --for=condition=DPUIPAMObjectReconciled --namespace dpf-operator-system dpuserviceipam --all
## Ensure the DPUServiceInterfaces have been reconciled
kubectl wait --for=condition=ServiceInterfaceSetReconciled --namespace dpf-operator-system dpuserviceinterface --all
## Ensure the DPUServiceChains have been reconciled
kubectl wait --for=condition=ServiceChainSetReconciled --namespace dpf-operator-system dpuservicechain --all
6. Test Traffic
Add worker nodes to the cluster
At this point workers should be added to the cluster. Each worker node should be configured in line with the prerequisites. As workers are added to the cluster DPUs will be provisioned and DPUServices will begin to be spun up.
You can verify the status of the DPUDeployment and its components with the following command:
kubectl wait --for=condition=ready pod -l app=storage-test-pod-nvme-vf --timeout=300s
Verify the block device is available in the pod:
kubectl exec -it storage-test-pod-nvme-vf-0 -- ls -l /dev/xvda
You can test read/write operations on the block device:
## Write test data to the block device
kubectl exec -it storage-test-pod-nvme-vf-0 -- sh -c "echo 'test data' | dd of=/dev/xvda bs=512 count=1"
## Read back the test data
kubectl exec -it storage-test-pod-nvme-vf-0 -- dd if=/dev/xvda bs=512 count=1 2>/dev/null
Uninstall
This section describes how to clean up the cluster after the DPF setup has been completed. It is important to follow the steps in the correct order to ensure that all components are removed cleanly and that the cluster remains functional.
kubectl delete -f manifests/04-enable-accelerated-cni --wait --ignore-not-found=true
helm uninstall -n nvidia-network-operator network-operator --wait
## Note: Uninstalling OVN Kubernetes as primary CNI is not supported but this command must be run to remove the webhook and restore a functioning cluster.
helm uninstall -n ovn-kubernetes ovn-kubernetes-resource-injector --wait
Delete the DPF Operator system and DPF Operator
First we have to delete some DPUServiceInterfaces. This is necessary because of a known issue during uninstallation.
Note: there can be a race condition with deleting the underlying Kamaji cluster which runs the DPU cluster control plane in this guide. If that happens it may be necessary to remove finalizers manually from DPUCluster and Datastore objects.
Limitations of DPF Setup
Host network pod services
The Kubelet process on the Kubernetes nodes use the OOB interface IP address to register in Kubernetes. This means that the nodes have the OOB IP addresses as node IP addresses. This means that pods using host networking have the OOB IP address of the hosts as pod IP address. However, that interface is not accelerated. This means that any component using the addresses of the pods using host networking will not benefit from hardware acceleration and high-speed ports.
For example, this means that when creating a Kubernetes NodePort service selecting pods using host networking, even if the user uses the high-speed IP of the host, the traffic will not be accelerated. In order to solve this, it is possible to create dedicated endpointSlices that contain the host high-speed port IP addresses instead of OOB port IP addresses. This way, the entire path to the pods will be accelerated and benefit from high performances, if the user uses the high speed IP address of the host with the nodePort port. This requires the workload running on the pod with host networking to also listen on the high-speed port IP address.