Troubleshooting DOCA Platform Framework
This section provides comprehensive troubleshooting guidance for common issues you may encounter while deploying, configuring, or operating the DOCA Platform Framework (DPF).
Quick Diagnostic Tools
DPF CLI (dpfctl)
Command-line tool for visualizing, debugging, and troubleshooting DPU resources in Kubernetes. Essential for real-time visibility into resource states and conditions.
Use when:
-
DPU provisioning is failing
-
Need to understand resource dependencies
-
Debugging component readiness issues
SOS Report Collection (dpfctl sosreport)
Collect system diagnostics from host and DPU cluster nodes for support cases.
Use when:
-
Need detailed system information for support cases
-
Investigating complex infrastructure issues
-
Preparing diagnostic data for NVIDIA support
DPU Cluster
Accessing the Kamaji DPU Cluster
How to retrieve the admin kubeconfig for a Kamaji-backed DPUCluster when direct cluster access is needed for advanced troubleshooting.
Use when:
-
You need to inspect workloads or nodes running inside the DPU cluster directly
-
DPF-level status fields do not provide enough detail for a specific investigation
Common Issues
Service Function Chaining (SFC)
If a ServiceChain or ServiceChainSet is stuck at Ready=False or flapping between Ready and Pending, the most common cause is a ServiceInterface uniqueness conflict. See the DPUServiceChain Constraints section for detailed error messages, root causes, and resolution steps.
Escalation Path
If you cannot resolve the issue using the guides above:
-
Collect Diagnostic Information * Collect a sosreport for your environment
-
Check Known Issues * Review Release Notes for known issues * Search the GitHub repository for similar problems
-
Contact Support * Open an issue on the GitHub repository * Include diagnostic information and steps to reproduce * For enterprise customers, contact NVIDIA support with your diagnostic package
Additional Resources
-
User Guides - Operational procedures and best practices
-
Architecture - Understanding system design for better troubleshooting
-
API Reference - Complete API documentation for debugging configurations
Last updated: