Introduction
The components Kube-State-Metrics and Node-Problem-Detector are deployed by default and configured by the DPF operator via the DPFOperatorConfig.spec.monitoring field. OpenTelemetry Collector requires additional explicit endpoint configuration.
To disable all DPF-operator-managed monitoring components:
apiVersion: operator.dpu.nvidia.com/v1alpha1
kind: DPFOperatorConfig
metadata:
name: dpfoperatorconfig
namespace: dpf-operator-system
spec:
monitoring:
disable: true
Kube-State-Metrics
Kube-State-Metrics (KSM) exposes metrics about Kubernetes object states. This DPF-operator-managed KSM instance monitors DPU cluster resources only.
For Host Cluster Kubernetes resource metrics, a separate KSM instance must be deployed by the user — see User-Managed Components.
Deployment Architecture
KSM is deployed with a split architecture:
-
Host Cluster Deployment: A single Deployment that connects remotely to each DPU cluster's API server to collect metrics
-
DPU Cluster RBAC: RBAC-only resources on each DPU cluster grant permissions for the Host Cluster KSM
Monitored DPU Resources
KSM collects metrics for the following DPU custom resources:
IPAM Resources:
-
IPPool: IP address pool status and allocation metrics -
CIDRPool: CIDR pool status and allocation metrics
Service Function Chaining:
-
ServiceChain: Service chain status and configuration -
ServiceChainSet: Service chain set status -
ServiceInterface: Service interface status and health -
ServiceInterfaceSet: Service interface set status
Kubernetes Resources:
-
Pods,Deployments,DaemonSets, and more. See Kube-State-Metrics documentation for a complete list.
Configuration
KSM is enabled by default. To disable KSM while keeping other monitoring components enabled:
apiVersion: operator.dpu.nvidia.com/v1alpha1
kind: DPFOperatorConfig
metadata:
name: dpfoperatorconfig
namespace: dpf-operator-system
spec:
monitoring:
kubeStateMetrics:
disable: true
To customize KSM image and resources:
KSM metrics are automatically scraped by Prometheus via ServiceMonitor.
Node-Problem-Detector
Node-Problem-Detector (NPD) monitors DPU node health and reports problems as Node conditions. It runs as a DaemonSet on each DPU cluster node.
Health Checks
NPD includes DPU-specific health checks that run every 30 seconds:
|
Condition Type |
Check Description |
|---|---|
|
|
Verifies ovs-vswitchd service is running |
|
|
Verifies ovsdb-server service is running |
|
|
Checks for OVS process OOM kills |
|
|
Verifies SR-IOV VF representors are present |
|
|
Checks physical uplink is operational |
|
|
Verifies DPU is in embedded mode |
|
|
Validates network MTU configuration |
Additionally, NPD monitors standard Kubernetes node problems (kernel deadlocks, read-only filesystems, disk pressure, OOM events).
Integration with DPU Status
Node conditions from NPD are aggregated into the DPU resource's operationalConditions field via the NodeProblemsReady condition, providing centralized visibility into node health.
Configuration
NPD is enabled by default. To disable NPD while keeping other monitoring components enabled:
spec:
monitoring:
nodeProblemDetector:
disable: true
To customize NPD:
Monitoring DPU Health
Node-Problem-Detector health checks are aggregated into the DPU's operational status. To monitor DPU health:
$ kubectl -n dpf-operator-system get dpu
NAME READY OPERATIONAL PHASE AGE
worker1-mt2413xz0b67 True True Ready 73d
worker2-mt2413xz0b6w True True Ready 73d
See DPU Operational Readiness for more details on operational conditions and alerting.
NPD also exposes Prometheus metrics on port 20257.
OpenTelemetry Collector
OpenTelemetry Collector (OTEL) provides centralized log collection from DPU clusters to a user-specified endpoint.
Architecture
-
OTEL Collector DaemonSet: Collects logs from DPU cluster pods and forwards to the configured endpoint, tagged with cluster name
-
OTEL Collector Endpoint: Receives logs from DPU clusters via OTLP and exports to a backend
Configuration
OTEL Collector is disabled by default and requires a logging endpoint configuration:
apiVersion: operator.dpu.nvidia.com/v1alpha1
kind: DPFOperatorConfig
metadata:
name: dpfoperatorconfig
namespace: dpf-operator-system
spec:
monitoring:
openTelemetryCollector:
logging:
endpoint: "http://<host-node-ip>:30318"
The endpoint can be any OTLP-compatible receiver (OpenTelemetry Collector, observability gateway, cloud service, etc.).
If using the OpenTelemetry Collector deployed via Helm values (default configuration), it uses NodePort 30318:
# Get Host Cluster node IP
kubectl get nodes -o wide
# Use format: http://<NODE_IP>:30318
Last updated: