This section includes advanced configuration and additional information for the Zero Trust use case.
DPU Discovery and DPUNode and DPUDevice Object Creation
DPF provides two approaches for discovering and creating DPU resources:
-
Automated Discovery: Using
DPUDiscoveryto automatically scan for DPUs and createDPUDeviceandDPUNoderesources. -
Manual Creation: Manually creating
DPUDeviceandDPUNoderesources for each DPU.
You can choose either approach based on your deployment requirements. Automated discovery is recommended for larger deployments, while manual creation provides more control for smaller or specific configurations.
Automated DPU Discovery
DPUDiscovery enables automatic discovery of DPU devices and nodes by scanning specified IP ranges. This approach automatically creates DPUDevice and DPUNode resources for any discovered DPUs.
1. First, create a YAML file for the DPUDiscovery resource. Let's call it dpudiscovery.yaml:
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUDiscovery
metadata:
name: dpu-discovery-192.168.1-10
namespace: dpf-operator-system
spec:
# Define the IP range to scan
ipRangeSpec:
ipRange:
startIP: "10.0.110.120" # Replace with your start IP
endIP: "10.0.110.125" # Replace with your end IP
# Optional: Set scan interval
scanInterval: "3m"
# Optional: Set number of workers (default is 1 per 255 IPs)
workers: 1
2. Apply the resource using kubectl:
kubectl apply -f dpudiscovery.yaml
3. Check the status of the crawler:
kubectl get dpudiscovery dpu-discovery-192.168.1-10 -o yaml
The DPU discovery will:
-
Start scanning the specified IP range
-
Create DPUDevice and DPUNode* resources for any discovered DPUs
-
Continue scanning at the specified interval
-
Update its status with the last scan time and found DPUs
You can monitor the discovered DPUs with:
# List discovered DPU devices
kubectl get dpudevices
# List discovered DPU nodes
kubectl get dpunodes
* DPUDiscovery will skip the creation of a DPUNode if there is an existing one with the spec.dpus field containing the DPUDevices serial number.
Limitations
-
When using autodiscovery for DPUNodes, the created DPUNodes will be named after
dpunode-<DPU_SERIAL_NUMBER>. In case the HBN DPUService is used in conjuction with this DPU provisioning mode, the HBN configuration needs to be adjusted to match the discovered nodes accordingly.
Manual DPU Resource Creation
If you prefer to manually create DPU resources or need more control over the creation process, you can create DPUDevice and DPUNode resources manually.
Creating DPUDevice manually
Create a DPUDevice resource for each DPU:
The DPUDevice is immutable, and creating a DPUDevice will not trigger DPU provisioning.
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUDevice
metadata:
name: dpu-device-1
namespace: dpf-operator-system
spec:
bmcIp: 10.0.110.122
Creating a DPUNode manually
Create a DPUNode resource for each host that has a DPU:
The .spec.dpus field contains the names of each DPUDevice attached to the node.
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUNode
metadata:
labels:
feature.node.kubernetes.io/dpu-enabled: "true"
name: worker1
namespace: dpf-operator-system
spec:
dpus:
- name: dpu-device-1
nodeRebootMethod:
external: {}
Secure Boot
DPF supports configuring UEFI Secure Boot on DPUs during Zero Trust provisioning. When secureBoot is set in the DPUDeployment (or DPUSet), the controller detects the current hardware state via the BMC and configures it automatically, performing the required ARM force restarts.
For configuration details, mode-specific behavior, and the impact of changing this setting on existing DPUs, see Secure Boot.
For more information on BlueField Secure Boot, see Secure Boot in the NVIDIA documentation.
External Host Reboot
In the Zero Trust scenario, DPF cannot manage the DPU's host machine. During the DPU provisioning process, when the DPU CR reaches the rebooting phase, manual power-cycling is required by the user. The power-cycle operation must be completed within two hours; otherwise, the DPU join cluster's secret will expire, causing DPU CR pending in DPU Cluster Config phase. After the worker node boots up, the provisioning.dpu.nvidia.com/dpunode-external-reboot-required annotation on the DPUNode must be manually removed.
If you use script-based host reboot (nodeRebootMethod.script on the DPUNode) instead of external power cycle, see DPUNode: Script reboot job failures and recovery for how Jobs, DPU phase DPURebooting, and recovery interact.
Last updated: