DOCA Platform Framework (DPF) Documentation

DOCA Platform Framework v26.4.0

This is the GA release of the DOCA Platform Framework (DPF). It includes bug fixes and improvements to enhance the provisioning and orchestration of NVIDIA BlueField DPUs in Kubernetes environments.

Revision History

Date

Description

May 2026

General Availability (GA) release of DOCA Platform Framework v26.4.0

Features

  • DPU operational health conditions

    • Description: DPU resources now expose runtime health via status.operationalConditions, separate from provisioning lifecycle conditions. Conditions include OperationalReady, NodeProblemsReady, DPUServiceCriticalPodsReady, DPUServiceNonCriticalPodsReady, DPUServiceInterfacesReady, and DPUServiceChainsReady. The OPERATIONAL column is shown in kubectl get dpu output.

    • For details, see DPU Operational Readiness.

  • DPU Agent based provisioning

    • Description: DPF now uses the DPU Agent for DPU-side provisioning operations and status reporting. The agent reports provisioning progress through DPU status and supports secure communication with the control plane.

  • BFB LTS to LTS upgrade support

    • DPF now supports upgrading DPUs running DOCA BFB LTS releases across LTS boundaries (e.g., from BFB based on DOCA 25.10 LTS to BFB based on DOCA 26.10 LTS). Previously, DPF enforced a strict N-1 version policy that required sequential upgrades through every intermediate release. With this change, DPUs provisioned with an older LTS BFB can be managed by a newer DPF release without being blocked by version validation. See the BlueField BFB Support Matrix for supported version combinations.

    • As part of this feature, DPF now enforces Kubernetes version skew policy validation during upgrades. The DPU Agent reports the kubelet version running on each DPU, and the operator validates that all DPU kubelet versions are within the supported skew relative to the DPU cluster's kube-apiserver version before proceeding with an upgrade.

  • Optional BFB PVC for provisioning

    • Description:DPFOperatorConfig.spec.provisioningController.bfbPVCName is no longer required. If it is not set, the provisioning controller uses node-local storage for BFB downloads by default.

  • Multi DPUCluster support

    • Description: DPF now supports more than one DPUCluster to enable environments with bigger scale. Bin packing allocation is applied when multiple DPUClusters are utilized. DPF APIs now support selecting target DPU clusters with dpuClusterSelector. This allows DPU services, networking resources, and DPU deployments to target specific DPU clusters by label.

  • DPF-managed SR-IOV device plugin for host VF resources

    • Description: DPF can now manage SR-IOV device plugin pods for host VF resources through NodeSRIOVDevicePluginConfig objects. This allows DPF to expose DPU-backed VF resources on host nodes without relying on the SR-IOV Network Operator for this function.

  • Secure Boot configuration for Zero Trust provisioning

    • Description: DPF can now configure UEFI Secure Boot during Zero Trust provisioning through the secureBoot field in the DPU template.

  • DPF-operator-managed observability components on DPU clusters

    • Description: The DPF operator now manages observability components on each DPU cluster via DPFOperatorConfig.spec.monitoring:

      • Kube-State-Metrics (enabled by default): Exposes metrics on host cluster for DPF Custom Resources (ServiceChain, ServiceInterface, IPPool, CIDRPool, etc.) deployed on each DPU cluster

      • Node-Problem-Detector (enabled by default): Runs DPU-specific health checks (OVS, SR-IOV, uplink, DPU mode, MTU) and reports conditions to the DPU object.

      • OpenTelemetry Collector: Collects and forwards logs from DPU cluster pods to a configurable endpoint via spec.monitoring.openTelemetryCollector.logging.endpoint

    • Kube-State-Metrics and Node-Problem-Detector are enabled by default. OpenTelemetry Collector requires explicit endpoint configuration.

    • For details, see DPF-Operator-Managed Components.

  • New Grafana dashboards for DPU fleet monitoring

    • Description: Two new Grafana dashboards are deployed by default:

      • DPU Fleet Health: Fleet-wide overview of DPU operational status, condition breakdown, and health trends

      • DPU Detail: Per-DPU drill-down with operational conditions, agent status, and provisioning history

  • dpfctl SOS Report Collection:

    • New dpfctl sosreport command for collecting system diagnostics from host and DPU cluster nodes

    • Subcommands: start, status, download, collect (one-step workflow), cleanup

    • Support for targeting specific environments (--target host/dpu/all), individual nodes (--nodes), and specific DPU clusters (--dpu-cluster)

    • NFS output mode for writing reports directly to a shared mount

    • --archive flag to create a single .tar.gz for ticket attachment

    • Watch mode (status -w) for real-time job monitoring

    • See dpfctl sosreport for usage details

Fixed Issues from Previous Release

  • DPUFlavor nvconfig now supports targeting specific devices instead of applying the same configuration to all devices

  • Updating DPUNode.spec.nodeRebootMethod during DPU provisioning is now rejected unless the DPU is in a valid phase

  • DPUVolumeAttachment for emulated NVMe devices on hot-plugged PFs no longer reports an invalid 00:00.0 PCI address

Supported DPU Services:

Dependencies

Hardware and Software Requirements

For Host Trusted mode: Refer to Host Trusted Prerequisites for detailed requirements.

For Zero Trust mode: Refer to Zero Trust Prerequisites for detailed requirements. In addition, note the additional requirements for deploying via Redfish.

  • DPU Hardware: NVIDIA BlueField-3 DPUs

  • Minimal DOCA BFB Image: DOCA v2.5 or higher (must be pre-installed on DPUs to support DPF provisioning)

  • Supported DOCA BFB Image: bf-bundle-X.y.z

Installation & Upgrade Notes

Installation Notes

None.

Upgrade Notes

DPF supports upgrades from the immediate previous GA release. For more information refer to Upgrade Procedures

  • [IMPORTANT] DPUs provisioned with BFB LTS 3.2 via DPF v25.10 must be reprovisioned after upgrade

    • Description: Users who want to keep running DOCA BFB LTS 3.2 on their DPUs after upgrading to DPF v26.4 must reprovision all affected DPUs. This is required because DPUs provisioned under DPF v25.10 do not have the DPU Agent installed and do not report their kubelet version, which is now required for Kubernetes version skew validation during future upgrades.

    • Impact: If affected DPUs are not reprovisioned, upgrades to subsequent DPF releases may be blocked by version skew validation failures.

    • Required action: After upgrading to DPF v26.4 and verifying that the DPFOperatorConfig status is healthy, reprovision each affected DPU by deleting its DPU CR. The DPU will be automatically reprovisioned with the current DPF version:

      Bash
      # List DPUs in the DPF namespace
      kubectl get dpus -n $DPF_NAMESPACE
      
      # Delete each DPU CR to trigger reprovisioning
      kubectl delete dpu $DPU_NAME -n $DPF_NAMESPACE
      
      # Validate DPUs have the kubelet version reported in their status after reprovisioning
      kubectl get dpu $DPU_NAME -n $DPF_NAMESPACE -o jsonpath='{.status.agentStatus.kubeletVersion}'
      
  • Lease names changed for DPF controllers

    • Description: Kubernetes lease objects name has been modified to better reflect the DPF component that is using the lease for controller leader election. The new lease name is formatted as follows: <component-name>.dpu.nvidia.com e.g dpf-operator.dpu.nvidia.com. As a result, the old lease objects are no longer used.

    • Impact: No impact, stale lease objects will remain in the cluster after upgrade. These do not affect functionality but can be removed for cleanliness.

    • Note: If desired, after upgrade, Remove the stale lease objects from the dpf-operator-system and ovn-kubernetes namespaces. The following lease objects can be removed:

      • kubectl -n dpf-operator-system delete lease --ignore-not-found 19f9f38b.nvidia.com 204fbe18.dpu.nvidia.com 507jei28.dpu.nvidia.com 8a3114c5.dpu.nvidia.com e361afcf.nvidia.com snap-host-controller.nvidia.com snap-node-driver.nvidia.com

      • kubectl -n ovn-kubernetes delete lease --ignore-not-found 8a3114c5.dpu.nvidia.com

  • ArgoCD prerequisite upgrade from v2 to v3

    • Description: DPF is tested with ArgoCD v3.3.0 (chart version v9.4.1) as a prerequisite. ArgoCD is a user-managed prerequisite component that must be upgraded independently before upgrading DPF.

      ArgoCD v3 changes the default resource tracking method from labels to annotations. Applications require a sync operation after the ArgoCD upgrade to migrate tracking information and prevent resource management issues.

      For complete details on the tracking method change and migration requirements, refer to the ArgoCD v3.0 Upgrade Guide.

    • Impact: No impact if ArgoCD Applications were synced after upgrading ArgoCD and before upgrading DPF. Otherwise, resources related to components such as the servicechain controller may be leaked.

    • Note: For users manually upgrading ArgoCD, trigger a sync operation on all Applications after upgrading ArgoCD and before upgrading DPF:

      Bash
      # Trigger sync for all applications in dpf-operator-system namespace
      for app in $(kubectl get application -n dpf-operator-system -o name); do
        kubectl patch $app -n dpf-operator-system --type=merge -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{"syncStrategy":{"hook":{}}}}}'
      done
      
      # Wait for the sync operation to succeed for all applications
      kubectl wait application --all -n dpf-operator-system --for=jsonpath='{.status.operationState.phase}'=Succeeded --timeout=300s
      
  • revisionHistoryLimit no longer accepts values less than 1

    • Description:revisionHistoryLimit values less than 1 are now rejected. Setting a value less than 1 would result in the object being treated as non-disruptive for in-place replacement, which could lead to unexpected behavior.

    • Impact: Users who previously set revisionHistoryLimit to 0 or a negative value will need to update their configuration to use a value of 1 or greater.

  • Zero Trust mode requires kubernetesAPIServerVIP and kubernetesAPIServerPort in DPFOperatorConfig

    • Description: The kubernetesAPIServerVIP and kubernetesAPIServerPort fields under .spec.overrides in the DPFOperatorConfig are now required when using Zero Trust mode (Redfish-based provisioning). In previous releases, these fields were not enforced for Zero Trust mode.

    • Impact: If these fields are not set, DPU provisioning in Zero Trust mode will get stuck and the DPU condition will report the message: KubernetesAPIServerVIP and KubernetesAPIServerPort must be set in DPFOperatorConfig for zero-trust mode. Existing Zero Trust deployments upgrading to this release must ensure these fields are configured before provisioning or re-provisioning DPUs.

    • Note: Update your DPFOperatorConfig to include the API server VIP and port:

      YAML
      spec:
        overrides:
          kubernetesAPIServerVIP: "<api-server-vip>"
          kubernetesAPIServerPort: <api-server-port>
      
  • Required field changes in DPUSet and DPUDeployment:

    • Description: The following API fields have changed from optional pointers to required value types:

      • DPUSet.spec.strategy

      • DPUSet.spec.dpuTemplate.spec.nodeEffect

      • DPUDeployment.spec.dpus.dpuSetStrategy

      • DPUDeployment.spec.dpus.nodeEffect

    • Impact: After upgrading from v25.10, the DPUDeployment controller will fail to reconcile DPUSets for existing DPUDeployment objects that are missing the new required fields. The DPUSetsReconciled condition on affected DPUDeployments will report an error. Existing DPUSets and DPUs continue running during this window; only reconciliation of new changes is blocked until the fields are set.

    • Required action: After upgrading to v26.4, patch each existing DPUDeployment to add the new required fields. For example:

      Bash
      # Host Trusted:
      kubectl patch dpudeployment <name> -n <namespace> --type merge -p '{"spec":{"dpus":{"dpuSetStrategy":{"type":"RollingUpdate"},"nodeEffect":{"drain":true}}}}'
      # Zero Trust:
      kubectl patch dpudeployment <name> -n <namespace> --type merge -p '{"spec":{"dpus":{"dpuSetStrategy":{"type":"OnDelete"},"nodeEffect":{"hold":true}}}}'
      

      Replace the example values with the dpuSetStrategy and nodeEffect that match your deployment. See the DPUDeployment and DPUSet API references for available options.

  • Migration to NodeSRIOVDevicePluginController

    • Disclaimer: This migration note applies only to Host Trusted use-cases that relied on the SR-IOV Network Operator to configure VF resources on the host.

    • Description: Starting in v26.4, DPF manages host VF resources exposed from DPUs through the NodeSRIOVDevicePluginController. This removes the dependency on the SR-IOV Network Operator for managing the SR-IOV device plugin used for host VF resources. The DPF-managed device plugin is configured through NodeSRIOVDevicePluginConfig objects and linked to the relevant DPUs through the noderesources.dpu.nvidia.com/nodesriovdevicepluginconfig annotation.

    • Impact: After upgrading from v25.10, existing SriovNetworkNodePolicy objects are not migrated automatically. Host VF resources will not be managed by DPF until the new controller is enabled, the NodeSRIOVDevicePluginConfig objects are created, and the relevant DPU objects are annotated. Updating DPUDeployment.spec.dpus.dpuSets[*].dpuAnnotations after the upgrade does not automatically propagate the annotation to already existing DPU objects, so an explicit annotation step is required to apply the configuration without reprovisioning.

      During migration there can be a temporary gap where existing workloads continue to run, but new workloads that request VF resources may fail to schedule. This gap starts after the SR-IOV Network Operator removes or reconfigures the device plugin it manages and ends after the affected DPU objects are annotated and the DPF-managed device plugin is running on the nodes.

    • Required action:

      • Enable nodeSRIOVDevicePluginController in DPFOperatorConfig.

        Bash
        kubectl patch dpfoperatorconfig dpfoperatorconfig -n dpf-operator-system --type merge -p '{"spec":{"nodeSRIOVDevicePluginController":{"disable":false}}}'
        
      • Create the required NodeSRIOVDevicePluginConfig objects for the VF resources that should be exposed on each host, and make sure the configuration matches the PFs and VF ranges you intend to manage.

      • Remove the SriovNetworkNodePolicy objects that manage the same host VF resources, and wait until the SR-IOV Network Operator processes the change and restarts or removes the device plugin it manages on the affected nodes.

      • Update each DPUDeployment so the relevant dpuSets[*].dpuAnnotations point to the correct NodeSRIOVDevicePluginConfig.

        YAML
        spec:
          dpus:
            dpuSets:
              - nameSuffix: dpuset1
                dpuAnnotations:
                  noderesources.dpu.nvidia.com/nodesriovdevicepluginconfig: <config-name>
        
      • Explicitly annotate all existing DPU objects that should use the new configuration. This step is required to apply the new configuration without reprovisioning the DPUs. If all affected DPUs in the namespace should use the same config, you can annotate them with:

        Bash
        kubectl -n dpf-operator-system annotate dpus.provisioning.dpu.nvidia.com --all 'noderesources.dpu.nvidia.com/nodesriovdevicepluginconfig=<config-name>' --overwrite
        
Deprecated API Fields

The following API fields have been deprecated and will be removed in future releases. Users should migrate to the recommended alternatives.

  • DPFOperatorConfig Fields:

    • Description: Multiple fields in DPFOperatorConfig remain deprecated in v26.4.

      • Top-level image fields have been replaced with container-specific image override fields for more granular control.

      • Legacy BFB registry fields under installViaRedfish have been replaced by the top-level registry field.

      • The legacy gNOI installation interface has been replaced by the host agent installation interface.

    • Deprecated Fields:

      • Image fields:

      • spec.provisioningController.image -> Use spec.provisioningController.controller.image instead

      • spec.dpuServiceController.image -> Use spec.dpuServiceController.controller.image instead

      • spec.dpuDetector.image -> Use spec.dpuDetector.daemon.image instead

      • spec.kamajiClusterManager.image -> Use spec.kamajiClusterManager.controller.image instead

      • spec.staticClusterManager.image -> Use spec.staticClusterManager.controller.image instead

      • spec.serviceSetController.image -> Use spec.serviceSetController.controller.image instead

      • spec.flannel.image -> Use spec.flannel.cni and spec.flannel.daemon instead

      • spec.multus.image -> Use spec.multus.cni.image instead

      • spec.nvipam.image -> Use spec.nvipam.controller.image instead

      • spec.sriovDevicePlugin.image -> Use spec.sriovDevicePlugin.deviceplugin.image instead

      • spec.ovsCNI.image -> Use spec.ovsCNI.cni.image instead

      • spec.sfcController.image -> Use spec.sfcController.controller.image instead

      • Provisioning controller fields:

      • spec.provisioningController.bfCFGTemplateConfigMap -> Use spec.provisioningController.enableDynamicBFCFGTemplates instead

      • spec.provisioningController.installInterface.installViaGNOI -> Use spec.provisioningController.installInterface.installViaHostAgent instead

      • spec.provisioningController.installInterface.installViaRedfish.bfbRegistryAddress -> Use spec.provisioningController.registry instead

      • spec.provisioningController.installInterface.installViaRedfish.bfbRegistry -> Use spec.provisioningController.registry instead

      • spec.provisioningController.registry.address -> Leave unset to use the controller's default logic for determining the bfb-registry address. Use spec.provisioningController.registry.loadBalancerAddress when the registry must be reachable through an external endpoint

      • spec.provisioningController.registry.port -> Leave unset to use the controller's default logic for determining the bfb-registry port. Use spec.provisioningController.registry.loadBalancerAddress when the registry must be reachable through an external endpoint

  • DPUDevice Fields:

    • Description: The following fields in DPUDevice.spec have been deprecated in favor of their corresponding status fields, which are automatically populated by the system.

    • Deprecated Fields:

      • spec.psid -> Use status.psid instead

      • spec.opn -> Use status.opn instead

      • spec.pf0Name -> Use status.pf0Name instead

  • DPUNode Fields:

    • Description: The gNOI reboot method and nodeDMSAddress field have been deprecated.

    • Deprecated Fields:

      • spec.nodeRebootMethod.gNOI -> Use spec.nodeRebootMethod.hostAgent instead

      • spec.nodeDMSAddress -> This field is no longer used

  • DPUSet and DPUDeployment Fields:

    • Description: Legacy DPU selection fields have been deprecated in favor of explicit DPU node and DPU device selectors.

    • Deprecated Fields:

      • DPUSet.spec.strategy.rollingUpdate.maxUnavailable -> Use DPFOperatorConfig.spec.provisioningController.maxUnavailableDPUNodes instead

      • DPUSet.spec.dpuSelector -> Use DPUSet.spec.dpuDeviceSelector instead

      • DPUDeployment.spec.dpus.dpuSets[*].nodeSelector -> Use DPUDeployment.spec.dpus.dpuSets[*].dpuNodeSelector instead

      • DPUDeployment.spec.dpus.dpuSets[*].dpuSelector -> Use DPUDeployment.spec.dpus.dpuSets[*].dpuDeviceSelector instead

  • DPU Fields:

    • Description: The bmcIP field in DPU spec should be obtained from the associated DPUDevice instead.

    • Deprecated Fields:

      • spec.bmcIP -> Use DPUDevice.spec.bmcIp or DPUDevice.status.bmcIp instead

  • DPUService DPUCluster Selector Fields:

    • Description: Legacy clusterSelector fields have been deprecated in favor of dpuClusterSelector.

    • Deprecated Fields:

      • DPUServiceChain.spec.clusterSelector -> Use DPUServiceChain.spec.dpuClusterSelector instead

      • DPUServiceIPAM.spec.clusterSelector -> Use DPUServiceIPAM.spec.dpuClusterSelector instead

      • DPUServiceInterface.spec.clusterSelector -> Use DPUServiceInterface.spec.dpuClusterSelector instead

  • DPUServiceIPAM Fields:

    • Description: The legacy single-IP exclusion list has been deprecated in favor of explicit exclusion ranges.

    • Deprecated Fields:

      • spec.ipv4Subnet.exclusions -> Use spec.ipv4Subnet.excludeRanges instead

  • ServiceInterface Fields:

    • Description: The legacy OVN interface definition has been deprecated in favor of patch interface configuration.

    • Deprecated Fields:

      • spec.ovn -> Use interfaceType="patch" with spec.patch.peerBridge and spec.patch.peerPatchName instead

Known Issues and Limitations

  • Socket Direct environments are not supported

    • DPF does not currently support environments where NVIDIA Mellanox Socket Direct adapters are used. Socket Direct is a network adapter architecture that provides direct PCIe access from multiple CPU sockets to a single NIC, bypassing the inter-processor bus. DPUs in Socket Direct configurations are not tested or validated with DPF.

    • For more details, see Platform Support - Limitations.

  • Stale ports after DPU reboot

    • When rebooting DPU, the old DPU service ports won't get deleted from DPU's OVS and would be stale

    • Internal Ref #4174183

    • Workaround: No workaround, known issue, shouldn't affect performance.

  • DPU Cluster control-plane connectivity is lost when physical port P0 is down on the worker node

    • Link down of p0 port on the DPU will result in DPU control plane connectivity loss of DPU components.

    • Internal Ref #3751863

    • Workaround: Make sure P0 link is up on the DPU, if down either restart DPU or refer to DOCA troubleshooting https://docs.nvidia.com/networking/display/bfswtroubleshooting. Note: This issue is relevant for Host Trusted deployments only.

  • System doesn't recover after DPU Reset

    • When the user triggers a reset of a DPU in any way other than using DPF APIs (e.g. recreation of a DPU CR), the system may not recover.

    • Internal Ref #4521178, #4188044, #4732664

    • Workaround: Power cycle the host. Note that this operation is dangerous and there might be file system corruption which will require triggering the reprovisioning of the DPU. If the system does not recover, reprovision the DPU by deleting the DPU CR (it will be recreated and DPU provisioning will happen). Note: This issue is relevant for Host Trusted deployments where OVN-Kubernetes is used as a primary CNI.

  • Leftover CRs if worker is removed from the cluster permanently

    • When a worker was added to a cluster, optionally had DPU provisioned and later was removed from the host cluster permanently, there may be leftover DPF related CRs in both the host cluster and the DPU cluster.

    • Internal Ref #4403130, #4571788

    • Workaround: No workaround, known issue.

  • DPUDeployment or DPUSet created in namespace different from the dpf-operator-system namespace do not trigger DPU provisioning

    • When creating a DPUDeployment or a DPUSet in a namespace other than dpf-operator-system, there are no DPU CRs created due to the DPUNode CRs residing in the dpf-operator-system namespace.

    • Internal Ref #4427091

    • Workaround: Create the DPUDeployment or DPUSet in the dpf-operator-system namespace.

  • DPUService stops reconciling when DPUCluster is unavailable for long time

    • When the DPUCluster is unavailable for long time (more than 5 mins), changes to DPUServices (also generated ones via DPUDeployment or DPFOperatorConfig) that have happened during that time might not be reflected to the DPUCluster.

    • Internal Ref #4359857

    • Workaround: Recreate DPUServices that are stuck.

  • [OVN-Kubernetes DPUService] Lost traffic from workloads to control plane components or Kubernetes services after dpu reboot, port flapping, ovs restart or manual network configuration

    • Connectivity issues between workload pods to control plane components or Kubernetes services may occur after the following events: DPU reboot without host reboot, high speed port flapping (link down/up), ovs restart, DPU network configuration change (for example using "netplan apply" command on DPU). The issues are caused by network configuration that was applied by ovn CNI on DPUs and won't get reapplied automatically.

    • Internal Ref #4188044, #4521178

    • Workaround: Recreate the OVN-Kubernetes node pod on the host to reapply the configuration.

  • [OVN-Kubernetes DPUService] No network connectivity for SR-IOV accelerated workload pods after DPU reboot

    • SR-IOV accelerated workload pod is losing its VF interface upon DPU reboot. VF is available on the host however not injected back into the pod.

    • Internal Ref #4236521

    • Workaround: Recreate the SR-IOV accelerated workload pods.

  • [OVN-Kubernetes DPUService] Node annotations not updated when PF IP allocation changes on DPU-enabled hosts

    • On DPU-enabled workers, the uplink IP allocated to the physical function (PF) can change. In some cases ovn-kubernetes does not refresh node annotations to reflect the new address, which can lead to inconsistent gateway and host addressing metadata on the node.

    • Internal Ref #5016628

    • How to check: Compare k8s.ovn.org/primary-dpu-host-addr with the addresses recorded in k8s.ovn.org/node-primary-ifaddr and in k8s.ovn.org/l3-gateway-config (the ip-address / ip-addresses fields for the default gateway entry). If k8s.ovn.org/primary-dpu-host-addr disagrees with those values while node-primary-ifaddr and l3-gateway-config match each other, the primary-dpu-host-addr annotation is likely stale relative to the PF.

    • Workaround: Restart the ovn-kubernetes pod on the DPU for the affected host or worker (in the ovn-kubernetes namespace) so annotations and related state are reconciled.

  • [Firefly DPUService] Deletion of the Firefly DPUService leaves stale flows in OVS

    • When the Firefly DPUService is deleted after successful deployment or the labels of the serviceDaemonSet are modified, flows are not cleaned up from OVS.

    • Internal Ref #4382535

    • Workaround: Although these flows are very unlikely to cause an issue, reprovisioning the DPU or power cycling the host will bring the OVS in good state. Note that power cycling the host is dangerous and there might be file system corruption which will require triggering the reprovisioning of the DPU.

  • [DPUDeployment] Pods of service referenced in spec.serviceChain are restarted when serviceChain changes, causing disruption

    • When updating DPUDeployment.spec.serviceChain with a disruptive upgrade, the affected pods are restarted but DPUDeployment removes the node effect from the corev1.Node in the host cluster proactively. This causes the node to be marked as ready and accept workloads before DPU services are fully operational.

    • Internal Ref: #4686635

    • Workaround: No workaround, known issue.

  • Long DPU provisioning time when multiple DPUs are provisioned on the same node

    • In some cases, it may take a long amount of time to provision multiple DPUs on the same node.

    • Internal Ref: #4757191

    • Workaround: If the DPU is stuck in DPU Cluster Config phase after a long installation, please reprovision the DPU and make sure no other DPUs on the same node are being provisioned at the same time

  • [DPUDeployment] DPU does not recover to Ready after Node Effect Removal timeout, even when service configuration is reverted

    • When a DPUDeployment is deployed and becomes ready, and then a service is updated with an invalid configuration (e.g. overriding the image with a wrong one in DPUServiceConfiguration), the DPU enters the Node Effect Removal state. If the Node Effect Removal timeout is exceeded, the DPU transitions to Error state. Even after reverting the DPUServiceConfiguration to the previous valid configuration, the DPU does not move back to Ready. The nodeEffectRemovalTimeout in DPFOperatorConfig.spec.provisioningController defaults to 0s (disabled), meaning this issue will not occur with the default configuration. If the timeout is explicitly set to a non-zero value (e.g. 30m), ensure the configured duration allows sufficient time for configuration corrections before the DPU transitions to Error state.

    • Internal Ref: #4952676

    • Workaround: Revert to a working configuration and delete the DPU CR to trigger reprovisioning.

  • Leader switch of the provisioning controller pod while the DPU is in the middle of deploying may cause OS installation failure

    • When deploying a DPU without bfbPVCName provided in the DPFOperatorConfig, a leader switch in the DPU provisioning controller (for any reason) during the provisioning process may cause the DPU to transition to the Error phase due to download failures of the BFB or bfb.cfg during OS installing. This issue produces a BFBTransferred condition (FailToInstall reason) in the zero-trusted environment and an OSInstalled condition (InstallationTerminated reason) in the host-trusted environment.

    • Internal Ref: #4957820, #4930198

    • Workaround: Delete the DPU CR to trigger reprovisioning.

  • DOCA SNAP may fail to reattach a DPUVolumeAttachment that uses VirtioFS or emulated NVMe devices on hot-plugged PFs after a host power cycle

    • After a host power cycle, a DPUVolumeAttachment that uses VirtioFS or emulated NVMe devices on hot-plugged PFs can become not ready and might not be restored automatically.

    • Internal Ref: #4734537

    • Workaround:

      • Step 1: Delete failed DPUVolumeAttachment.

      • Step 2: Create a new DPUVolumeAttachment with the same configuration.

  • [Zero Trust] Manual change of DPU mode from DPU to NIC after DPU discovery causes provisioning to stall in Rebooting

    • In a Zero Trust deployment, if a user manually changes the DPU mode from DPU to NIC after DPU discovery, DPU provisioning can remain stuck in the Rebooting phase.

    • Internal Ref: #4984034

    • Workaround: Manually set the DPU mode back to DPU, or delete the DPUDevice CR to trigger DPUDevice recreation.

  • Upgrade from v25.10 may leave stale dpudevice-protection finalizers on non-selected DPUDevices

    • After upgrading from v25.10 to v26.4, some non-selected DPUDevice objects can retain the legacy finalizer provisioning.dpu.nvidia.com/dpudevice-protection. This can block deletion of those DPUDevice objects during cleanup flows.

    • Internal Ref: #5048585

    • Workaround: After the v26.4 provisioning controller is running, review DPUDevice objects that are not referenced by any active DPU. If such a DPUDevice still has provisioning.dpu.nvidia.com/dpudevice-protection, remove only that finalizer entry and keep the DPUDevice object.

Last updated: