The DPUDevice is a Kubernetes CRD that represents a physical DPU (Data Processing Unit) device that was discovered. The DPUDevice contain all the information required to identify and provision the DPU by the DPU Controller.
Overview
The DPUDevice resource serves as an inventory and management interface for physical DPU devices. It contains device-specific information such as serial numbers, product identifiers, BMC (Base Management Controller) details, and PCI addresses. The DPUDevice is can be created automatically through discovery processes or manually by administrators.
DPUDevice Specification
DPUDeviceSpec
The spec section defines the desired configuration for the DPU device:
|
Field |
Type |
Required |
Description |
|---|---|---|---|
|
|
string |
Yes |
The serial number of the device for inventory management |
|
|
string |
No |
Product Serial ID (deprecated, use status.psid) |
|
|
string |
No |
Ordering Part Number (deprecated, use status.opn) |
|
|
string |
No |
IP address of the BMC for remote management |
|
|
uint32 |
No |
Port number for BMC communication (default: 443) |
|
|
int |
No |
Number of Physical Functions on the device (default: 1) |
|
|
string |
No |
Name of the first Physical Function |
DPUDeviceStatus
The status section contains the observed state of the DPU device:
|
Field |
Type |
Description |
|---|---|---|
|
|
string |
Product Serial ID discovered from the device |
|
|
string |
Serial number discovered from the device |
|
|
string |
Ordering Part Number discovered from the device |
|
|
string |
BMC IP address discovered from the device |
|
|
uint32 |
BMC port discovered from the device |
|
|
string |
PCI address of the device in the host system |
|
|
string |
MAC address of the first Physical Function |
|
|
array |
Array of condition objects describing device state |
Conditions
The DPUDevice resource uses several condition types to track its state:
-
DpuDeviceDiscovered: Indicates that the DPU has been discovered
-
DpuDeviceNodeAttached: Indicates that the DPU is attached to a node
-
DpuDeviceInitialized: Indicates that the DPU interface has been initialized
-
DpuDeviceError: Indicates that the DPUDevice has an error
-
DpuDeviceReady: Indicates that the DPUDevice is ready for use
Example Usage
Basic DPUDevice Creation
Determine the serial number of the DPUDevice. In zero-trust mode, serial number will be discovered from the BMC. In trusted mode, run: lspci -vvs ${pci_address} | grep "SN".
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUDevice
metadata:
name: MT25066004C7
namespace: dpf-operator-system
spec:
serialNumber: "MT25066004C7"
bmcIp: "10.1.2.3"
numberOfPFs: 1
pf0Name: "eth0"
Lifecycle Management
Creation
DPUDevice resources are typically created through: * Automatic Discovery: * Zero-Trust: Via DPUDiscovery controller scanning IP ranges * Host-Trusted: Via dpudetector daemon on host nodes * Manual Creation: By administrators with known device details * DPU Detection: Via dpudetector daemon on host nodes
Firmware Update: - In zero-trust mode, BMC firmware will be updated to the latest version.
Updates
Most fields in DPUDevice are immutable once set. Only the following can be updated: - Labels and annotations - Status fields (managed by controllers)
Deletion
DPUDevice resources are protected by a finalizer (provisioning.dpu.nvidia.com/dpudevice-protection) to prevent accidental deletion while the device is in use.
Integration with Other Resources
DPUNode
DPUDevice resources are referenced by DPUNode resources through the dpus field by their serial numbers:
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUNode
metadata:
name: dpu-node-001
spec:
dpus:
- name: MT25066004C7
- name: MT25066004C8
DPU
DPU resources reference DPUDevice resources through the dpuDeviceName field:
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPU
metadata:
name: dpu-001
spec:
dpuDeviceName: MT25066004C7
dpuNodeName: dpu-node-001
# ... other fields
Monitoring and Troubleshooting
Checking Device Status
# Get all DPUDevice resources
kubectl get dpudevices -n dpf-operator-system
# Get detailed information about a specific device
kubectl describe dpudevice MT25066004C7 -n dpf-operator-system
# Check device conditions
kubectl get dpudevice MT25066004C7 -n dpf-operator-system -o jsonpath='{.status.conditions}'
Common Issues
-
Device Not Discovered when in Zero Trust setup: Check if the device is reachable via BMC IP
-
Invalid Serial Number: Ensure the serial number matches the required pattern
-
BMC Connection Issues: Verify BMC IP and port configuration
-
PCI Address Not Found: Check if the device is properly installed in the host
Status Conditions
Monitor the following conditions for device health:
# Check if device is ready
kubectl get dpudevice MT25066004C7 -n dpf-operator-system -o jsonpath='{.status.conditions[?(@.type=="DpuDeviceReady")].status}'
# Check for errors
kubectl get dpudevice MT25066004C7 -n dpf-operator-system -o jsonpath='{.status.conditions[?(@.type=="DpuDeviceError")]}'
Related Resources
-
DPUNode - Node-level DPU management
-
DPUDiscovery - Automatic DPU discovery
-
DPU - DPU provisioning and deployment
-
DPUSet - Bulk DPU management
Last updated: