DOCA Platform Framework

DPUSet

The DPUSet is a Kubernetes CRD which managed the DPU CRs in DPF.

Updating the DPUSet

An update to the DPUSet can be done for upgrading the BFB or modifying provisioning parameters.


This operation will result in a network disruption and also a host reboot.
A rolling update can be configured to control the number of nodes that will be out-of-service in parallel (Please see the DPUSet YAML example below).
The cluster can also be divided into several DPU-Sets, please refer to the section "Using several DPU Sets"

These are the required steps for upgrading the BFB on a set of DPUs (The BFB is specified as part of the DPU Set CRD):

  1. Create a BFB YAML that includes the required BFB file and also assigns a distinct name for the object (Different from the currently used BFB objects). After applying the YAML, the BFB will be pulled from the specified URL to the shared storage:

YAML
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: BFB
metadata:
  name: bf-bundle-new
  namespace: dpf-operator-system
spec:
  url: https://content.mellanox.com/BlueField/BFBs/Ubuntu22.04/bf-bundle-3.0.0-135_25.04_ubuntu-22.04_prod.bfb
  1. Update the DPUSet YAML to point to the new BFB object:

YAML
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUSet
metadata:
  name: dpuset
  namespace: dpf-operator-system
spec:
  dpuNodeSelector:
    matchLabels:
      feature.node.kubernetes.io/dpu-enabled: "true"
  strategy:
    rollingUpdate:
      maxUnavailable: "10%"
    type: RollingUpdate
  dpuTemplate:
    spec:
      dpuFlavor: dpf-provisioning-hbn-ovn
      bfb:
        name: bf-bundle-new
      nodeEffect:
        taint:
          key: "dpu"
          value: "provisioning"
          effect: NoSchedule
  1. Then delete the DPU objects of the relevant DPUs.

This will initiate a provisioning cycle for the DPUs using the new BFB image:

kubectl delete dpu -n dpf-operator-system worker1-0000-2b-00 worker2-0000-2b-00
  1. You can later delete the previous BFB object:

kubectl delete bfb -n dpf-operator-system bf-bundle

Using several DPU Sets

There's an option to create several DPU-Set objects, and assign them to different groups of worker nodes. This is done by adding relevant labels to the node selector in the DPUSet object YAML. Each DPU Set can use a different BFB object, can have a different DPU flavor, a different rolling update strategy, etc.

For example:

YAML
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUSet
metadata:
  name: dpuset-dk
  namespace: dpf-operator-system
spec:
  dpuNodeSelector:
    matchLabels:
      e2e.servers/dk: "true"
  strategy:
    rollingUpdate:
      maxUnavailable: "10%"
    type: RollingUpdate
  dpuTemplate:
    spec:
      dpuFlavor: dpf-provisioning-hbn-ovn
      bfb:
      name: bf-bundle-dk-ga
      nodeEffect:
      taint:
        key: "dpu"
        value: "provisioning"
        effect: NoSchedule

Host Power-cycle in DPU provisioning

If the version of running BFB is lower than 2.7 before DPU provisioning, the BlueField firmware upgrades and mlxconfig parameter changes require a host power-cycle. Once the version of BFB is updated to be greater than or equal to 2.7 a regular reboot would be enough.

For enabling this, the DPUSet provides one annotations in dpuTemplate: provisioning.dpu.nvidia.com/host-power-cycle-required - trigger the host power-cycle (cold boot) instead of warm reboot after DPU provisioning, notice that after the power cycle command is done the annotation would be removed from the DPU and DPUSet objects.

Following is an example to enable host power-cycle:

YAML
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUSet
metadata:
  name: dpuset
  namespace: dpf-operator-system
spec:
  dpuNodeSelector:
    matchLabels:
      feature.node.kubernetes.io/dpu-enabled: "true"
  strategy:
    rollingUpdate:
      maxUnavailable: "10%"
    type: RollingUpdate
  dpuTemplate:
    annotations:
      provisioning.dpu.nvidia.com/host-power-cycle-required: "true"
    spec:
      dpuFlavor: dpf-provisioning-hbn-ovn
      bfb:
        name: bf-bundle-new
      nodeEffect:
        taint:
          key: "dpu"
          value: "provisioning"
          effect: NoSchedule

IPMI Command Annotation for Kubernetes Worker Node

The provisioning controller will issue a ipmi command to the host to do host power-cycle(cold boot) or warm reboot after DPU provisioning. The default host power-cycle command is ipmitool chassis power cycle and warm reboot command is ipmitool chassis power reset

For some kinds of servers that uses ipmitool chassis power reset command for host cold power-cycle instead of ipmitool chassis power cycle. DPF supports changing the host power-cycle/warm reboot command by setting the following annotation on such kind of worker nodes:

YAML
provisioning.dpu.nvidia.com/powercycle-command: reset
provisioning.dpu.nvidia.com/reboot-command: cycle

Node effect

Node effect specifies how changes to the DPU should affect the Kubernetes Node the DPU belongs to. Only the following options can be specified:

  • noEffect (bool) - no effect on the node at all

  • customLabel (object) - adds the label provided to the Kubernetes Node and DPUNode, ONLY relevant to Kubernetes environment, when there is a match of DPUNode and Kubernetes Node

  • taint (object) - marks the node as tainted, ONLY relevant to Kubernetes environment, when there is a match of DPUNode and Kubernetes Node

  • drain (bool)(default) - drains the node and waits till the draining is finished, ONLY relevant to Kubernetes environment, when there is a match of DPUNode and Kubernetes Node - this is the default behaviour in a Kubernetes environment

  • customAction (string) - name of a ConfigMap which contains a Pod definition - in YAML - to run which will apply the node effect. The Pod is expected to exit when node effect is done, if the Pod terminates with error then DPU would move to Error phase. First, create a ConfigMap with the pod definition:

YAML
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-node-effect
  namespace: dpf-operator-system
data:
  pod.yaml: |
    apiVersion: v1
    kind: Pod
    metadata:
      name: custom-node-effect-pod
      namespace: dpf-operator-system
    spec:
      containers:
      - name: node-effect
        image: ubuntu:20.04
        command: ["/bin/bash"]
        args:
        - -c
        - |
          # Example custom node effect script
          echo "Applying custom node effect..."
          # Add your custom logic here
          # For example: network configuration, system checks, reboot.
          sleep 10  # Simulating some work
          exit 0    # Exit successfully when done
      restartPolicy: Never

Then, create the DPUSet that uses this custom action

YAML
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUSet
metadata:
  name: dpuset-custom-effect
  namespace: dpf-operator-system
spec:
  dpuNodeSelector:
    matchLabels:
      feature.node.kubernetes.io/dpu-enabled: "true"
  strategy:
    rollingUpdate:
      maxUnavailable: "10%"
    type: RollingUpdate
  dpuTemplate:
    spec:
      dpuFlavor: dpf-provisioning-hbn-ovn
      bfb:
        name: bf-bundle-new
      nodeEffect:
        customAction: custom-node-effect
  • hold (bool) - places an annotation with key wait-for-external-nodeeffect on the DPU object and waits for it to be removed - this is the default behaviour in a non Kubernetes environment

Last updated: