This is the GA release of the DOCA Platform Framework (DPF). It includes bug fixes and improvements to enhance the provisioning and orchestration of NVIDIA BlueField DPUs in Kubernetes environments.
Revision History
|
Date |
Description |
|---|---|
|
May 2026 |
General Availability (GA) release of DOCA Platform Framework v26.4.0 |
Features
-
DPU operational health conditions
-
Description: DPU resources now expose runtime health via
status.operationalConditions, separate from provisioning lifecycle conditions. Conditions includeOperationalReady,NodeProblemsReady,DPUServiceCriticalPodsReady,DPUServiceNonCriticalPodsReady,DPUServiceInterfacesReady, andDPUServiceChainsReady. TheOPERATIONALcolumn is shown inkubectl get dpuoutput. -
For details, see DPU Operational Readiness.
-
-
DPU Agent based provisioning
-
Description: DPF now uses the DPU Agent for DPU-side provisioning operations and status reporting. The agent reports provisioning progress through DPU status and supports secure communication with the control plane.
-
-
BFB LTS to LTS upgrade support
-
DPF now supports upgrading DPUs running DOCA BFB LTS releases across LTS boundaries (e.g., from BFB based on DOCA 25.10 LTS to BFB based on DOCA 26.10 LTS). Previously, DPF enforced a strict N-1 version policy that required sequential upgrades through every intermediate release. With this change, DPUs provisioned with an older LTS BFB can be managed by a newer DPF release without being blocked by version validation. See the BlueField BFB Support Matrix for supported version combinations.
-
As part of this feature, DPF now enforces Kubernetes version skew policy validation during upgrades. The DPU Agent reports the kubelet version running on each DPU, and the operator validates that all DPU kubelet versions are within the supported skew relative to the DPU cluster's kube-apiserver version before proceeding with an upgrade.
-
-
Optional BFB PVC for provisioning
-
Description:
DPFOperatorConfig.spec.provisioningController.bfbPVCNameis no longer required. If it is not set, the provisioning controller uses node-local storage for BFB downloads by default.
-
-
Multi DPUCluster support
-
Description: DPF now supports more than one DPUCluster to enable environments with bigger scale. Bin packing allocation is applied when multiple DPUClusters are utilized. DPF APIs now support selecting target DPU clusters with
dpuClusterSelector. This allows DPU services, networking resources, and DPU deployments to target specific DPU clusters by label.
-
-
DPF-managed SR-IOV device plugin for host VF resources
-
Description: DPF can now manage SR-IOV device plugin pods for host VF resources through
NodeSRIOVDevicePluginConfigobjects. This allows DPF to expose DPU-backed VF resources on host nodes without relying on the SR-IOV Network Operator for this function.
-
-
Secure Boot configuration for Zero Trust provisioning
-
Description: DPF can now configure UEFI Secure Boot during Zero Trust provisioning through the
secureBootfield in the DPU template.
-
-
DPF-operator-managed observability components on DPU clusters
-
Description: The DPF operator now manages observability components on each DPU cluster via
DPFOperatorConfig.spec.monitoring:-
Kube-State-Metrics (enabled by default): Exposes metrics on host cluster for DPF Custom Resources (ServiceChain, ServiceInterface, IPPool, CIDRPool, etc.) deployed on each DPU cluster
-
Node-Problem-Detector (enabled by default): Runs DPU-specific health checks (OVS, SR-IOV, uplink, DPU mode, MTU) and reports conditions to the DPU object.
-
OpenTelemetry Collector: Collects and forwards logs from DPU cluster pods to a configurable endpoint via
spec.monitoring.openTelemetryCollector.logging.endpoint
-
-
Kube-State-Metrics and Node-Problem-Detector are enabled by default. OpenTelemetry Collector requires explicit endpoint configuration.
-
For details, see DPF-Operator-Managed Components.
-
-
New Grafana dashboards for DPU fleet monitoring
-
Description: Two new Grafana dashboards are deployed by default:
-
DPU Fleet Health: Fleet-wide overview of DPU operational status, condition breakdown, and health trends
-
DPU Detail: Per-DPU drill-down with operational conditions, agent status, and provisioning history
-
-
-
dpfctl SOS Report Collection:
-
New
dpfctl sosreportcommand for collecting system diagnostics from host and DPU cluster nodes -
Subcommands:
start,status,download,collect(one-step workflow),cleanup -
Support for targeting specific environments (
--target host/dpu/all), individual nodes (--nodes), and specific DPU clusters (--dpu-cluster) -
NFS output mode for writing reports directly to a shared mount
-
--archiveflag to create a single.tar.gzfor ticket attachment -
Watch mode (
status -w) for real-time job monitoring -
See dpfctl sosreport for usage details
-
Fixed Issues from Previous Release
-
DPUFlavor
nvconfignow supports targeting specific devices instead of applying the same configuration to all devices -
Updating
DPUNode.spec.nodeRebootMethodduring DPU provisioning is now rejected unless the DPU is in a valid phase -
DPUVolumeAttachment for emulated NVMe devices on hot-plugged PFs no longer reports an invalid
00:00.0PCI address
Supported DPU Services:
-
OVN-Kubernetes
Dependencies
Hardware and Software Requirements
For Host Trusted mode: Refer to Host Trusted Prerequisites for detailed requirements.
For Zero Trust mode: Refer to Zero Trust Prerequisites for detailed requirements. In addition, note the additional requirements for deploying via Redfish.
-
DPU Hardware: NVIDIA BlueField-3 DPUs
-
Minimal DOCA BFB Image: DOCA v2.5 or higher (must be pre-installed on DPUs to support DPF provisioning)
-
Supported DOCA BFB Image:
bf-bundle-X.y.z
Installation & Upgrade Notes
Installation Notes
None.
Upgrade Notes
DPF supports upgrades from the immediate previous GA release. For more information refer to Upgrade Procedures
-
[IMPORTANT] DPUs provisioned with BFB LTS 3.2 via DPF v25.10 must be reprovisioned after upgrade
-
Description: Users who want to keep running DOCA BFB LTS 3.2 on their DPUs after upgrading to DPF v26.4 must reprovision all affected DPUs. This is required because DPUs provisioned under DPF v25.10 do not have the DPU Agent installed and do not report their kubelet version, which is now required for Kubernetes version skew validation during future upgrades.
-
Impact: If affected DPUs are not reprovisioned, upgrades to subsequent DPF releases may be blocked by version skew validation failures.
-
Required action: After upgrading to DPF v26.4 and verifying that the
DPFOperatorConfigstatus is healthy, reprovision each affected DPU by deleting its DPU CR. The DPU will be automatically reprovisioned with the current DPF version:Bash# List DPUs in the DPF namespace kubectl get dpus -n $DPF_NAMESPACE # Delete each DPU CR to trigger reprovisioning kubectl delete dpu $DPU_NAME -n $DPF_NAMESPACE # Validate DPUs have the kubelet version reported in their status after reprovisioning kubectl get dpu $DPU_NAME -n $DPF_NAMESPACE -o jsonpath='{.status.agentStatus.kubeletVersion}'
-
-
Lease names changed for DPF controllers
-
Description: Kubernetes lease objects name has been modified to better reflect the DPF component that is using the lease for controller leader election. The new lease name is formatted as follows:
<component-name>.dpu.nvidia.come.gdpf-operator.dpu.nvidia.com. As a result, the old lease objects are no longer used. -
Impact: No impact, stale lease objects will remain in the cluster after upgrade. These do not affect functionality but can be removed for cleanliness.
-
Note: If desired, after upgrade, Remove the stale lease objects from the
dpf-operator-systemandovn-kubernetesnamespaces. The following lease objects can be removed:-
kubectl -n dpf-operator-system delete lease --ignore-not-found 19f9f38b.nvidia.com 204fbe18.dpu.nvidia.com 507jei28.dpu.nvidia.com 8a3114c5.dpu.nvidia.com e361afcf.nvidia.com snap-host-controller.nvidia.com snap-node-driver.nvidia.com -
kubectl -n ovn-kubernetes delete lease --ignore-not-found 8a3114c5.dpu.nvidia.com
-
-
-
ArgoCD prerequisite upgrade from v2 to v3
-
Description: DPF is tested with ArgoCD v3.3.0 (chart version v9.4.1) as a prerequisite. ArgoCD is a user-managed prerequisite component that must be upgraded independently before upgrading DPF.
ArgoCD v3 changes the default resource tracking method from labels to annotations. Applications require a sync operation after the ArgoCD upgrade to migrate tracking information and prevent resource management issues.
For complete details on the tracking method change and migration requirements, refer to the ArgoCD v3.0 Upgrade Guide.
-
Impact: No impact if ArgoCD Applications were synced after upgrading ArgoCD and before upgrading DPF. Otherwise, resources related to components such as the servicechain controller may be leaked.
-
Note: For users manually upgrading ArgoCD, trigger a sync operation on all Applications after upgrading ArgoCD and before upgrading DPF:
Bash# Trigger sync for all applications in dpf-operator-system namespace for app in $(kubectl get application -n dpf-operator-system -o name); do kubectl patch $app -n dpf-operator-system --type=merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{"syncStrategy":{"hook":{}}}}}' done # Wait for the sync operation to succeed for all applications kubectl wait application --all -n dpf-operator-system --for=jsonpath='{.status.operationState.phase}'=Succeeded --timeout=300s
-
-
revisionHistoryLimitno longer accepts values less than 1-
Description:
revisionHistoryLimitvalues less than 1 are now rejected. Setting a value less than 1 would result in the object being treated as non-disruptive for in-place replacement, which could lead to unexpected behavior. -
Impact: Users who previously set
revisionHistoryLimitto 0 or a negative value will need to update their configuration to use a value of 1 or greater.
-
-
Zero Trust mode requires
kubernetesAPIServerVIPandkubernetesAPIServerPortin DPFOperatorConfig-
Description: The
kubernetesAPIServerVIPandkubernetesAPIServerPortfields under.spec.overridesin theDPFOperatorConfigare now required when using Zero Trust mode (Redfish-based provisioning). In previous releases, these fields were not enforced for Zero Trust mode. -
Impact: If these fields are not set, DPU provisioning in Zero Trust mode will get stuck and the DPU condition will report the message:
KubernetesAPIServerVIP and KubernetesAPIServerPort must be set in DPFOperatorConfig for zero-trust mode. Existing Zero Trust deployments upgrading to this release must ensure these fields are configured before provisioning or re-provisioning DPUs. -
Note: Update your
DPFOperatorConfigto include the API server VIP and port:YAMLspec: overrides: kubernetesAPIServerVIP: "<api-server-vip>" kubernetesAPIServerPort: <api-server-port>
-
-
Required field changes in DPUSet and DPUDeployment:
-
Description: The following API fields have changed from optional pointers to required value types:
-
DPUSet.spec.strategy -
DPUSet.spec.dpuTemplate.spec.nodeEffect -
DPUDeployment.spec.dpus.dpuSetStrategy -
DPUDeployment.spec.dpus.nodeEffect
-
-
Impact: After upgrading from v25.10, the DPUDeployment controller will fail to reconcile DPUSets for existing
DPUDeploymentobjects that are missing the new required fields. TheDPUSetsReconciledcondition on affected DPUDeployments will report an error. Existing DPUSets and DPUs continue running during this window; only reconciliation of new changes is blocked until the fields are set. -
Required action: After upgrading to v26.4, patch each existing
DPUDeploymentto add the new required fields. For example:Bash# Host Trusted: kubectl patch dpudeployment <name> -n <namespace> --type merge -p '{"spec":{"dpus":{"dpuSetStrategy":{"type":"RollingUpdate"},"nodeEffect":{"drain":true}}}}' # Zero Trust: kubectl patch dpudeployment <name> -n <namespace> --type merge -p '{"spec":{"dpus":{"dpuSetStrategy":{"type":"OnDelete"},"nodeEffect":{"hold":true}}}}'Replace the example values with the
dpuSetStrategyandnodeEffectthat match your deployment. See the DPUDeployment and DPUSet API references for available options.
-
-
Migration to
NodeSRIOVDevicePluginController-
Disclaimer: This migration note applies only to Host Trusted use-cases that relied on the SR-IOV Network Operator to configure VF resources on the host.
-
Description: Starting in v26.4, DPF manages host VF resources exposed from DPUs through the
NodeSRIOVDevicePluginController. This removes the dependency on the SR-IOV Network Operator for managing the SR-IOV device plugin used for host VF resources. The DPF-managed device plugin is configured throughNodeSRIOVDevicePluginConfigobjects and linked to the relevant DPUs through thenoderesources.dpu.nvidia.com/nodesriovdevicepluginconfigannotation. -
Impact: After upgrading from v25.10, existing
SriovNetworkNodePolicyobjects are not migrated automatically. Host VF resources will not be managed by DPF until the new controller is enabled, theNodeSRIOVDevicePluginConfigobjects are created, and the relevantDPUobjects are annotated. UpdatingDPUDeployment.spec.dpus.dpuSets[*].dpuAnnotationsafter the upgrade does not automatically propagate the annotation to already existingDPUobjects, so an explicit annotation step is required to apply the configuration without reprovisioning.During migration there can be a temporary gap where existing workloads continue to run, but new workloads that request VF resources may fail to schedule. This gap starts after the SR-IOV Network Operator removes or reconfigures the device plugin it manages and ends after the affected
DPUobjects are annotated and the DPF-managed device plugin is running on the nodes. -
Required action:
-
Enable
nodeSRIOVDevicePluginControllerinDPFOperatorConfig.Bashkubectl patch dpfoperatorconfig dpfoperatorconfig -n dpf-operator-system --type merge -p '{"spec":{"nodeSRIOVDevicePluginController":{"disable":false}}}' -
Create the required
NodeSRIOVDevicePluginConfigobjects for the VF resources that should be exposed on each host, and make sure the configuration matches the PFs and VF ranges you intend to manage. -
Remove the
SriovNetworkNodePolicyobjects that manage the same host VF resources, and wait until the SR-IOV Network Operator processes the change and restarts or removes the device plugin it manages on the affected nodes. -
Update each
DPUDeploymentso the relevantdpuSets[*].dpuAnnotationspoint to the correctNodeSRIOVDevicePluginConfig.YAMLspec: dpus: dpuSets: - nameSuffix: dpuset1 dpuAnnotations: noderesources.dpu.nvidia.com/nodesriovdevicepluginconfig: <config-name> -
Explicitly annotate all existing
DPUobjects that should use the new configuration. This step is required to apply the new configuration without reprovisioning the DPUs. If all affected DPUs in the namespace should use the same config, you can annotate them with:Bashkubectl -n dpf-operator-system annotate dpus.provisioning.dpu.nvidia.com --all 'noderesources.dpu.nvidia.com/nodesriovdevicepluginconfig=<config-name>' --overwrite
-
-
Deprecated API Fields
The following API fields have been deprecated and will be removed in future releases. Users should migrate to the recommended alternatives.
-
DPFOperatorConfig Fields:
-
Description: Multiple fields in
DPFOperatorConfigremain deprecated in v26.4.-
Top-level
imagefields have been replaced with container-specific image override fields for more granular control. -
Legacy BFB registry fields under
installViaRedfishhave been replaced by the top-levelregistryfield. -
The legacy gNOI installation interface has been replaced by the host agent installation interface.
-
-
Deprecated Fields:
-
Image fields:
-
spec.provisioningController.image-> Usespec.provisioningController.controller.imageinstead -
spec.dpuServiceController.image-> Usespec.dpuServiceController.controller.imageinstead -
spec.dpuDetector.image-> Usespec.dpuDetector.daemon.imageinstead -
spec.kamajiClusterManager.image-> Usespec.kamajiClusterManager.controller.imageinstead -
spec.staticClusterManager.image-> Usespec.staticClusterManager.controller.imageinstead -
spec.serviceSetController.image-> Usespec.serviceSetController.controller.imageinstead -
spec.flannel.image-> Usespec.flannel.cniandspec.flannel.daemoninstead -
spec.multus.image-> Usespec.multus.cni.imageinstead -
spec.nvipam.image-> Usespec.nvipam.controller.imageinstead -
spec.sriovDevicePlugin.image-> Usespec.sriovDevicePlugin.deviceplugin.imageinstead -
spec.ovsCNI.image-> Usespec.ovsCNI.cni.imageinstead -
spec.sfcController.image-> Usespec.sfcController.controller.imageinstead -
Provisioning controller fields:
-
spec.provisioningController.bfCFGTemplateConfigMap-> Usespec.provisioningController.enableDynamicBFCFGTemplatesinstead -
spec.provisioningController.installInterface.installViaGNOI-> Usespec.provisioningController.installInterface.installViaHostAgentinstead -
spec.provisioningController.installInterface.installViaRedfish.bfbRegistryAddress-> Usespec.provisioningController.registryinstead -
spec.provisioningController.installInterface.installViaRedfish.bfbRegistry-> Usespec.provisioningController.registryinstead -
spec.provisioningController.registry.address-> Leave unset to use the controller's default logic for determining the bfb-registry address. Usespec.provisioningController.registry.loadBalancerAddresswhen the registry must be reachable through an external endpoint -
spec.provisioningController.registry.port-> Leave unset to use the controller's default logic for determining the bfb-registry port. Usespec.provisioningController.registry.loadBalancerAddresswhen the registry must be reachable through an external endpoint
-
-
-
DPUDevice Fields:
-
Description: The following fields in
DPUDevice.spechave been deprecated in favor of their corresponding status fields, which are automatically populated by the system. -
Deprecated Fields:
-
spec.psid-> Usestatus.psidinstead -
spec.opn-> Usestatus.opninstead -
spec.pf0Name-> Usestatus.pf0Nameinstead
-
-
-
DPUNode Fields:
-
Description: The
gNOIreboot method andnodeDMSAddressfield have been deprecated. -
Deprecated Fields:
-
spec.nodeRebootMethod.gNOI-> Usespec.nodeRebootMethod.hostAgentinstead -
spec.nodeDMSAddress-> This field is no longer used
-
-
-
DPUSet and DPUDeployment Fields:
-
Description: Legacy DPU selection fields have been deprecated in favor of explicit DPU node and DPU device selectors.
-
Deprecated Fields:
-
DPUSet.spec.strategy.rollingUpdate.maxUnavailable-> UseDPFOperatorConfig.spec.provisioningController.maxUnavailableDPUNodesinstead -
DPUSet.spec.dpuSelector-> UseDPUSet.spec.dpuDeviceSelectorinstead -
DPUDeployment.spec.dpus.dpuSets[*].nodeSelector-> UseDPUDeployment.spec.dpus.dpuSets[*].dpuNodeSelectorinstead -
DPUDeployment.spec.dpus.dpuSets[*].dpuSelector-> UseDPUDeployment.spec.dpus.dpuSets[*].dpuDeviceSelectorinstead
-
-
-
DPU Fields:
-
Description: The
bmcIPfield in DPU spec should be obtained from the associated DPUDevice instead. -
Deprecated Fields:
-
spec.bmcIP-> UseDPUDevice.spec.bmcIporDPUDevice.status.bmcIpinstead
-
-
-
DPUService DPUCluster Selector Fields:
-
Description: Legacy
clusterSelectorfields have been deprecated in favor ofdpuClusterSelector. -
Deprecated Fields:
-
DPUServiceChain.spec.clusterSelector-> UseDPUServiceChain.spec.dpuClusterSelectorinstead -
DPUServiceIPAM.spec.clusterSelector-> UseDPUServiceIPAM.spec.dpuClusterSelectorinstead -
DPUServiceInterface.spec.clusterSelector-> UseDPUServiceInterface.spec.dpuClusterSelectorinstead
-
-
-
DPUServiceIPAM Fields:
-
Description: The legacy single-IP exclusion list has been deprecated in favor of explicit exclusion ranges.
-
Deprecated Fields:
-
spec.ipv4Subnet.exclusions-> Usespec.ipv4Subnet.excludeRangesinstead
-
-
-
ServiceInterface Fields:
-
Description: The legacy OVN interface definition has been deprecated in favor of patch interface configuration.
-
Deprecated Fields:
-
spec.ovn-> UseinterfaceType="patch"withspec.patch.peerBridgeandspec.patch.peerPatchNameinstead
-
-
Known Issues and Limitations
-
Socket Direct environments are not supported
-
DPF does not currently support environments where NVIDIA Mellanox Socket Direct adapters are used. Socket Direct is a network adapter architecture that provides direct PCIe access from multiple CPU sockets to a single NIC, bypassing the inter-processor bus. DPUs in Socket Direct configurations are not tested or validated with DPF.
-
For more details, see Platform Support - Limitations.
-
-
Stale ports after DPU reboot
-
When rebooting DPU, the old DPU service ports won't get deleted from DPU's OVS and would be stale
-
Internal Ref #4174183
-
Workaround: No workaround, known issue, shouldn't affect performance.
-
-
DPU Cluster control-plane connectivity is lost when physical port P0 is down on the worker node
-
Link down of p0 port on the DPU will result in DPU control plane connectivity loss of DPU components.
-
Internal Ref #3751863
-
Workaround: Make sure P0 link is up on the DPU, if down either restart DPU or refer to DOCA troubleshooting https://docs.nvidia.com/networking/display/bfswtroubleshooting. Note: This issue is relevant for Host Trusted deployments only.
-
-
System doesn't recover after DPU Reset
-
When the user triggers a reset of a DPU in any way other than using DPF APIs (e.g. recreation of a DPU CR), the system may not recover.
-
Internal Ref #4521178, #4188044, #4732664
-
Workaround: Power cycle the host. Note that this operation is dangerous and there might be file system corruption which will require triggering the reprovisioning of the DPU. If the system does not recover, reprovision the DPU by deleting the DPU CR (it will be recreated and DPU provisioning will happen). Note: This issue is relevant for Host Trusted deployments where OVN-Kubernetes is used as a primary CNI.
-
-
Leftover CRs if worker is removed from the cluster permanently
-
When a worker was added to a cluster, optionally had DPU provisioned and later was removed from the host cluster permanently, there may be leftover DPF related CRs in both the host cluster and the DPU cluster.
-
Internal Ref #4403130, #4571788
-
Workaround: No workaround, known issue.
-
-
DPUDeployment or DPUSet created in namespace different from the dpf-operator-system namespace do not trigger DPU provisioning
-
When creating a DPUDeployment or a DPUSet in a namespace other than dpf-operator-system, there are no DPU CRs created due to the DPUNode CRs residing in the dpf-operator-system namespace.
-
Internal Ref #4427091
-
Workaround: Create the DPUDeployment or DPUSet in the dpf-operator-system namespace.
-
-
DPUService stops reconciling when DPUCluster is unavailable for long time
-
When the DPUCluster is unavailable for long time (more than 5 mins), changes to DPUServices (also generated ones via DPUDeployment or DPFOperatorConfig) that have happened during that time might not be reflected to the DPUCluster.
-
Internal Ref #4359857
-
Workaround: Recreate DPUServices that are stuck.
-
-
[OVN-Kubernetes DPUService] Lost traffic from workloads to control plane components or Kubernetes services after dpu reboot, port flapping, ovs restart or manual network configuration
-
Connectivity issues between workload pods to control plane components or Kubernetes services may occur after the following events: DPU reboot without host reboot, high speed port flapping (link down/up), ovs restart, DPU network configuration change (for example using "netplan apply" command on DPU). The issues are caused by network configuration that was applied by ovn CNI on DPUs and won't get reapplied automatically.
-
Internal Ref #4188044, #4521178
-
Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.
-
-
[OVN-Kubernetes DPUService] No network connectivity for SR-IOV accelerated workload pods after DPU reboot
-
SR-IOV accelerated workload pod is losing its VF interface upon DPU reboot. VF is available on the host however not injected back into the pod.
-
Internal Ref #4236521
-
Workaround: Recreate the SR-IOV accelerated workload pods.
-
-
[OVN-Kubernetes DPUService] Node annotations not updated when PF IP allocation changes on DPU-enabled hosts
-
On DPU-enabled workers, the uplink IP allocated to the physical function (PF) can change. In some cases ovn-kubernetes does not refresh node annotations to reflect the new address, which can lead to inconsistent gateway and host addressing metadata on the node.
-
Internal Ref #5016628
-
How to check: Compare
k8s.ovn.org/primary-dpu-host-addrwith the addresses recorded ink8s.ovn.org/node-primary-ifaddrand ink8s.ovn.org/l3-gateway-config(theip-address/ip-addressesfields for the default gateway entry). Ifk8s.ovn.org/primary-dpu-host-addrdisagrees with those values whilenode-primary-ifaddrandl3-gateway-configmatch each other, theprimary-dpu-host-addrannotation is likely stale relative to the PF. -
Workaround: Restart the ovn-kubernetes pod on the DPU for the affected host or worker (in the
ovn-kubernetesnamespace) so annotations and related state are reconciled.
-
-
[Firefly DPUService] Deletion of the Firefly DPUService leaves stale flows in OVS
-
When the Firefly DPUService is deleted after successful deployment or the labels of the serviceDaemonSet are modified, flows are not cleaned up from OVS.
-
Internal Ref #4382535
-
Workaround: Although these flows are very unlikely to cause an issue, reprovisioning the DPU or power cycling the host will bring the OVS in good state. Note that power cycling the host is dangerous and there might be file system corruption which will require triggering the reprovisioning of the DPU.
-
-
[DPUDeployment] Pods of service referenced in spec.serviceChain are restarted when serviceChain changes, causing disruption
-
When updating DPUDeployment.spec.serviceChain with a disruptive upgrade, the affected pods are restarted but DPUDeployment removes the node effect from the corev1.Node in the host cluster proactively. This causes the node to be marked as ready and accept workloads before DPU services are fully operational.
-
Internal Ref: #4686635
-
Workaround: No workaround, known issue.
-
-
Long DPU provisioning time when multiple DPUs are provisioned on the same node
-
In some cases, it may take a long amount of time to provision multiple DPUs on the same node.
-
Internal Ref: #4757191
-
Workaround: If the DPU is stuck in
DPU Cluster Configphase after a long installation, please reprovision the DPU and make sure no other DPUs on the same node are being provisioned at the same time
-
-
[DPUDeployment] DPU does not recover to Ready after Node Effect Removal timeout, even when service configuration is reverted
-
When a DPUDeployment is deployed and becomes ready, and then a service is updated with an invalid configuration (e.g. overriding the image with a wrong one in DPUServiceConfiguration), the DPU enters the Node Effect Removal state. If the Node Effect Removal timeout is exceeded, the DPU transitions to Error state. Even after reverting the DPUServiceConfiguration to the previous valid configuration, the DPU does not move back to Ready. The
nodeEffectRemovalTimeoutinDPFOperatorConfig.spec.provisioningControllerdefaults to0s(disabled), meaning this issue will not occur with the default configuration. If the timeout is explicitly set to a non-zero value (e.g.30m), ensure the configured duration allows sufficient time for configuration corrections before the DPU transitions to Error state. -
Internal Ref: #4952676
-
Workaround: Revert to a working configuration and delete the DPU CR to trigger reprovisioning.
-
-
Leader switch of the provisioning controller pod while the DPU is in the middle of deploying may cause OS installation failure
-
When deploying a DPU without
bfbPVCNameprovided in the DPFOperatorConfig, a leader switch in the DPU provisioning controller (for any reason) during the provisioning process may cause the DPU to transition to theErrorphase due to download failures of the BFB or bfb.cfg during OS installing. This issue produces aBFBTransferredcondition (FailToInstallreason) in the zero-trusted environment and anOSInstalledcondition (InstallationTerminatedreason) in the host-trusted environment. -
Internal Ref: #4957820, #4930198
-
Workaround: Delete the DPU CR to trigger reprovisioning.
-
-
DOCA SNAP may fail to reattach a DPUVolumeAttachment that uses VirtioFS or emulated NVMe devices on hot-plugged PFs after a host power cycle
-
After a host power cycle, a DPUVolumeAttachment that uses VirtioFS or emulated NVMe devices on hot-plugged PFs can become not ready and might not be restored automatically.
-
Internal Ref: #4734537
-
Workaround:
-
Step 1: Delete failed DPUVolumeAttachment.
-
Step 2: Create a new DPUVolumeAttachment with the same configuration.
-
-
-
[Zero Trust] Manual change of DPU mode from DPU to NIC after DPU discovery causes provisioning to stall in Rebooting
-
In a Zero Trust deployment, if a user manually changes the DPU mode from DPU to NIC after DPU discovery, DPU provisioning can remain stuck in the
Rebootingphase. -
Internal Ref: #4984034
-
Workaround: Manually set the DPU mode back to DPU, or delete the
DPUDeviceCR to triggerDPUDevicerecreation.
-
-
Upgrade from v25.10 may leave stale
dpudevice-protectionfinalizers on non-selected DPUDevices-
After upgrading from v25.10 to v26.4, some non-selected
DPUDeviceobjects can retain the legacy finalizerprovisioning.dpu.nvidia.com/dpudevice-protection. This can block deletion of thoseDPUDeviceobjects during cleanup flows. -
Internal Ref: #5048585
-
Workaround: After the v26.4 provisioning controller is running, review
DPUDeviceobjects that are not referenced by any activeDPU. If such aDPUDevicestill hasprovisioning.dpu.nvidia.com/dpudevice-protection, remove only that finalizer entry and keep theDPUDeviceobject.
-
Last updated: