This patch release of the DOCA Platform Framework (DPF) includes bug fixes and improvements to enhance the provisioning and orchestration of NVIDIA BlueField DPUs in Kubernetes environments.
Features
-
Enabled etcd compaction and added an etcd-defrag CronJob when using Kamaji clusters to optimize database performance.
-
Introduced a new CLI tool,
dpfctl, for debugging and troubleshooting DPF components.
Fixed Issues from previous release
Detailed information about the fixed issues can be found in the release notes for v24.10.
-
Avoid race condition when multiple DPUDeployments depend on same objects
-
DPFOperatorConfig CR deletion hangs on uninstallation process
-
Changes to a DPU Object will trigger reprovisioning
-
DPU object in pending phase without clear reason
-
DPU Cluster CNI not deployed when hitting docker rate limit
-
DPUServiceInterface is not Immutable
-
[OVN-Kubernetes DPUService] Fragmented packets are silently dropped when using custom MTU
-
[OVN-Kubernetes DPUService] traffic between workloads stops working after 5 minutes
-
[OVN-Kubernetes DPUService] DPU re-provisioning or VTEP IP change may result in lost traffic between cluster components
Supported DPU Services:
-
OVN-Kubernetes with OVS offload to the DPU
-
DOCA Host-Based Networking (HBN)
-
DOCA Telemetry Service (DTS)
-
DOCA BlueMan
Dependencies
Hardware and Software Requirements
Refer to Prerequisites for detailed requirements.
-
DPU Hardware: NVIDIA BlueField-3 DPUs
-
Minimal DOCA BFB Image: DOCA v2.5 or higher (must be pre-installed on DPUs to support DPF provisioning)
-
Provisioned DOCA BFB Image:
bf-bundle-2.9.1-40
Known Issues and Limitations
-
On initial deployment DPF CRs maybe stuck in initial state (Pending, Initializing, etc.) and not progressing
-
In case DPF CRs were created before DPF components are running they maybe be stuck in their initial state. DPF CRs need to be created after the DPF components have been deployed. In case CRs were created before they may remain in an initial state.
-
Internal Ref #4241297
-
Workaround: Delete any CRs that were created before the System components have been deployed and recreate them.
-
-
DPUService stuck in its Deleting phase
-
DPUService can be stuck in its Deleting phase when a pod on the DPU created as part of the DPUService can not be deleted.
-
Internal Ref #4213229
-
Workaround: Reprovision the DPU where the DPUService is deployed.
-
-
Incompatible DPUFlavor can cause DPU to get into an unstable state
-
Using an incompatible DPUFlavor can cause the DPU Device to get into an error state which requires manual intervention. For example allocating 14GB of hugepages in a DPU of 16GB memory.
-
Internal Ref #4200717
-
Workaround: Manually provision DPU or follow DOCA troubleshooting documentation to return DPU to operational state https://docs.nvidia.com/networking/display/bfswtroubleshooting.
-
-
Traffic loss after reconfiguration of DPUServices with chain between
-
Reconfiguration of DPUServices with chain between them may cause traffic loss due to outdated service chains.
-
Internal Ref #4178445
-
Workaround: Recreated SFC object between services.
-
-
Stale ports after DPU reboot
-
When rebooting DPU, the old DPU service ports won’t get deleted from DPU’s OVS and would be stale
-
Internal Ref #4174183
-
Workaround: No workaround, known issue, shouldn’t affect performance.
-
-
BFB filename must be unique
-
If BFB CR#1 bfb.spec.filename is the same as a BFB CR#2 bfb.spec.filename but references a different URL (actual bfb file to download) then BFB CR#1 would reference the wrong bfb.
-
Internal Ref #4143309
-
Workaround: Use unique bfb.spec.filename when creating new bfb CRs.
-
-
DPU Cluster control-plane connectivity is lost when physical port P0 is down on the worker node
-
Link down of p0 port on the DPU will result in DPU control plane connectivity loss of DPU components.
-
Internal Ref #3751863
-
Workaround: Make sure P0 link is up on the DPU, if down either restart DPU or refer to DOCA troubleshooting https://docs.nvidia.com/networking/display/bfswtroubleshooting.
-
-
DPU Provisioning operations wouldn’t be retried
-
DPU Provisioning operations wouldn’t be retried, this can lead to DPU object in an error phase because of small environment glitch which would have worked if retried.
-
Internal Ref #4202272
-
Workaround: Delete the DPU object in an error phase which will cause it to get recreated and operation to begin from scratch.
-
-
Cluster MTU value cannot be dynamically changed
-
It is possible to deploy DPF cluster with a custom MTU value, however once deployed, it is not possible to modify the MTU value which is applied on multiple distributed components.
-
Internal Ref #3917006
-
Workaround: Uninstall DPF and re-install from scratch using the new MTU value.
-
-
nvidia-k8s-ipam and servicechainset-controller DPF system DPUServices are in “Pending” phase
-
As long as there are no provisioned DPUs in the system, the nvidia-k8s-ipam and servicechainset-controller will appear as not ready / pending when querying dpuservices.
This has no impact on performance or functionality since DPF system components are only relevant when there are DPUs to provision services on. -
Internal Ref #4241324
-
Workaround: No workaround, known issue
-
-
[OVN-Kubernetes DPUService] Nodes marked as NotReady
-
When installing OVN-Kubernetes as a CNI on a node running containerd version 1.7.0 and above the Node never becomes ready.
-
Internal Ref #4178221
-
Workaround:
Option 1: Use containerd version below v1.7.0 when using OVN-Kubernetes as a primary CNI.
Option 2: Manually restart containerd on the host.
-
-
[OVN-Kubernetes DPUService] control plane node is not functional after reboot or network restart
-
During OVN-Kubernetes CNI installation on the control plane nodes, the management interface is moved with its IP into a newly created OVS bridge. Since this network configuration is not persistent it will be lost during node or network restart.
-
Internal Ref #4241306
-
Workaround:
-
Pre-define the OVS bridge on each control plane node with the OOB port MAC and IP address and ensure it gets a persistent IP
yaml #Ubuntu example for netplan persistent network configuration: network: ethernets: oob: match: # the mac address of the oob macaddress: xx:xx:xx:xx:xx:xx set-name: oob bridges: br0: addresses: x.x.x.x/x interfaces: [oob] # the mac address of the oob macaddress: xx:xx:xx:xx:xx:xx openvswitch: {} version: 2-
Set OVS bridge "bridge-uplink" in OVS metadata.
bash ovs-vsctl br-set-external-id br0 bridge-id br0 -- br-set-external-id br0 bridge-uplink oob -
-
-
[OVN-Kubernetes DPUService] Only a single OVN-Kubernetes DPU service version can be deployed across the cluster
-
OVN-Kubernetes service does not fully support customization using Helm parameters, meaning we only support a single OVN-Kubernetes DPU service across the entire cluster.
-
Internal Ref #4209524
-
Workaround: No workaround, known limitation.
-
-
[OVN-Kubernetes DPUService] Lost traffic from workloads to control plane components or K8S services after dpu reboot, port flapping, ovs restart or manual network configuration
-
Connectivity issues between workload pods to control plane components or K8S services may occur after the following events: DPU reboot without host reboot, high speed port flapping (link down/up), ovs restart, DPU network configuration change (for example using "netplan apply" command on DPU).
The issues are caused by network configuration that was applied by ovn CNI on DPUs and won't get reapplied automatically.
When rebooting DPU without the host, or high speed port link is going down/up, or manually changing dpu network ( for example with netplan apply), network configuration which was applied by the dpu CNI components may be lost and won’t reapply automatically. -
Internal Ref #4202272
-
Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.
-
-
[OVN-Kubernetes DPUService] host network configuration may result in lost traffic from host workloads (on overlay)
-
When changing host network (for example with netplan apply) custom network configuration which is done by the host CNI components may be lost and won’t reapply automatically.
-
Internal Ref #4188044
-
Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.
-
-
[OVN-Kubernetes DPUService] No network connectivity for SR-IOV accelerated workload pods after DPU reboot
-
SR-IOV accelerated workload pod is losing its VF interface upon DPU reboot. VF is available on the host however not injected back into the pod.
-
Internal Ref #4236521
-
Workaround: Recreate the SR-IOV accelerated workload pods.
-
-
[HBN DPUService] HBN DPUService cannot dynamically reload configurations
-
When updating HBN configuration through a configmap, the running HBN container won’t reload it, and need to get restarted with the new configuration.
-
Internal Ref #4290426
-
Workaround: Recreate HBN DPU service after changing configuration.
-
-
[HBN DPUService] Invalid HBN configuration is not reflected to user in case it is syntactically valid
-
If the HBN YAML configuration is valid but contains values that are illegal from an NVUE perspective then the HBN service will start with the last known valid configuration and it won’t be reflected to the end user.
-
Workaround: No workaround, known issue.
-
-
[HBN + OVN-Kubernetes DPUServices] HBN service restarts on DPU causes worker to lose traffic
-
If the HBN pod on the DPU will reset then the workloads on the host (any traffic on the OVN overlay) will not receive traffic.
-
Internal Ref #4220185, #4223176
-
Workaround: Wait 15 minutes for the system to recover or restart the OVN-Kubernetes Pod on that particular DPU.
-
-
[DTS DPUService] DTS appears as OutOfSync
-
When creating a DPUDeployment for DTS DPU service, the DPUService object can be marked as OutOfSync although the pods are running on the DPUs.
-
Internal Ref #4182929
-
Workaround: No workaround, known issue.
-
-
[OVN-Kubernetes DPUService] SR-IOV test Pod cannot reach Kubernetes API Service
-
When running a SR-IOV test Pod, the pod cannot reach a Kubernetes API Service. The issue is that the related conntrack entries miss the un-nat sometimes.
-
Internal Ref #4313629
-
Workaround: Unblock is to run the following on the DPUs:
ovs-appctl revalidator/purge ovs-appctl dpctl/flush-conntrack
-
Last updated: