Introduction
DPU resources expose health information across three condition sets that track different lifecycle phases.
Condition Lifecycle
-
Provisioning Phase (
status.conditions): High-level provisioning lifecycle (Initializing → Installing → Rebooting → Ready). Indicates the overall progress of the DPU through provisioning stages. -
DPU Agent Phase (
status.agentStatus.conditions): Tracks individual DPU agent operations during provisioning (network configuration, NVConfig, kernel modules, kubelet setup, etc.). This is the place to look when provisioning is stuck — the agent retries each operation until it succeeds, and conditions report the current state of each step. -
Operational Phase (
status.operationalConditions): Tracks runtime health after provisioning completes. Aggregates signals from Node-Problem-Detector, DPUService pods, DPUServiceInterfaces, and DPUServiceChains into a single view of DPU operational readiness.
A DPU can be in Ready provisioning state but have OperationalReady: False if runtime issues occur (e.g., OVS service down, pod failures).
Operational Condition Types
|
Condition Type |
Description |
|---|---|
|
|
Overall operational health (aggregate of all conditions) |
|
|
Node-level health from Node-Problem-Detector |
|
|
Critical DPU service pods are running |
|
|
Non-critical DPU service pods are running |
|
|
Service interfaces are configured |
|
|
Service chains are configured |
Viewing Operational Conditions
Check DPU operational health:
$ kubectl -n dpf-operator-system get dpu
NAME READY OPERATIONAL PHASE AGE
worker1-mt2413xz0b67 True True Ready 73d
worker2-mt2413xz0b6w True True Ready 73d
Or the status.operationalConditions field:
$ kubectl -n dpf-operator-system get dpu worker1-mt2413xz0b67 -oyaml | yq -P .status.operationalConditions
- lastTransitionTime: "2026-04-16T21:01:42Z"
message: All node health checks passing (9 Conditions)
reason: NoProblemsDetected
status: "True"
type: NodeProblemsReady
- lastTransitionTime: "2026-03-26T19:12:03Z"
message: All critical Pods are ready (0)
reason: AllPodsReady
status: "True"
type: DPUServiceCriticalPodsReady
- lastTransitionTime: "2026-04-16T21:01:42Z"
message: All Pods are ready (4)
reason: AllPodsReady
status: "True"
type: DPUServiceNonCriticalPodsReady
- lastTransitionTime: "2026-03-26T19:13:50Z"
message: All ServiceInterfaces are ready (3)
reason: AllServiceInterfacesReady
status: "True"
type: DPUServiceInterfacesReady
- lastTransitionTime: "2026-04-14T15:12:52Z"
message: All ServiceChains are ready (1)
reason: AllServiceChainsReady
status: "True"
type: DPUServiceChainsReady
- lastTransitionTime: "2026-04-16T21:01:42Z"
message: All operational conditions are ready
reason: AllReady
status: "True"
type: OperationalReady
Alerting Example
Use operational conditions for alerting in Prometheus:
- alert: DPUOperationalNotReady
expr: |
dpu_operational_conditions{type="OperationalReady",status="True"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "DPU {{ $labels.name }} is not operationally ready"
Last updated: