BlueField Troubleshooting Guide

DOCA Services


Preface

DOCA Services are primarily containerized DOCA programs designed for deployment on the DPU and, in some cases, on the host, using a Kubernetes-based deployment approach.

Before troubleshooting any issues related to DOCA Services, ensure that the deployment process is followed according to the steps outlined in the "Review Container Deployment" section of the NVIDIA BlueField Container Deployment Guide.

Refer to the NVIDIA BlueField Container Deployment Guide for more details about the container ecosystem for the DPU

Command Cheat Sheet

Command

Description

crictl pods

Displays currently active K8S pods, and their IDs (it might take up to 20-30 seconds for the pod to start)

crictl ps

Displays currently active containers and their IDs

crictl ps -a

Displays all containers, including containers that recently finished their execution

crictl logs <container-id>

Examines the logs of a given container

crictl exec -it <container-id> /bin/bash

Attaches a shell to a running container

journalctl -u kubelet

Examines the Kubelet logs. Useful when a pod/container fails to spawn.

crictl stopp <pod-id>

Stops a running K8S pod

crictl stop <container-id>

Stops a running container

crictl rmi <image-id>

Removes a container image from the local K8S registry

Logging and Counters

Logging commands are provided in the "Command Cheat Sheet" section.

Debug Info Packages

Not relevant. DOCA Services are primarily containerized DOCA programs, hence deployment performed using containers not packages.

Scenarios

YAML Syntax Error #1

When deploying the container using the respective YAML file, the pod fails to start.

Error

The error may happen after modifying a service's YAML file, or after copying an example YAML file from one of the guides.

$ crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
$ journalctl -u kubelet
...
Oct 06 12:10:08 dpu-name kubelet[3260]: E1006 12:10:08.552306    3260 file.go:108] "Unable to process watch event" err="can't process config file \"/etc/kubelet.d/file_name.yaml\": invalid pod: [metadata.name: Invalid value: \"-dpu-name\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*') spec.containers: Required value]"
...

This indicates that some of the fields in the YAML file fail to comply with RFC 1123.

Solution

Both the pod name and container name have a strict alphabet (RFC 1123) restrictions. This means that users can only use dash ("-") and not underscore ("_") as the latter is an illegal character and cannot be used in the pod/container name. However, for the container's image name, use underscore ("_") instead of dash ("-") to help differentiate the two.

YAML Syntax Error #2

When deploying the container using the respective YAML file, the pod fails to start.

Error

The error may happen after modifying a service's YAML file, or after copying an example YAML file from one of the guides.

This error can occur when there is a whitespace issue if the YAML file has been copied from one of the guides causing a formatting mistake. It is important to ensure that the space characters used in the files are indeed spaces (" ") and not some other whitespace character.


$ crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
$ journalctl -u kubelet
...
Oct 04 12:35:58 dpu-name kubelet[3046]: E1004 12:35:58.744406    3046 file.go:187] "Could not process manifest file" err="/etc/kubelet.d/file_name.yaml: couldn't parse as pod(yaml: line 48: did not find expected '-' indicator), please check config file" path="/etc/kubelet.d/file_name.yaml"
...

This indicates that there is a probable indentation issue in line 48 or in the line above it.

Solution

Go over the file and make sure that the file only uses spaces ("  ") for indentations (2 per indent). Using any other number of spaces causes undefined behavior.

Missing Huge Pages

When deploying the container using the respective YAML file, the pod fails to start.

Error

$ crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT             RUNTIME
$ journalctl -u kubelet
...
Oct 04 12:39:41 dpu-name kubelet[3046]: I1004 12:39:41.643621    3046 predicate.go:103] "Failed to admit pod, unexpected error while attempting to recover from admission failure" pod="default/file_name" err="preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: hugepages-2Mi, q: 1021313024), ]"
...

This error indicates that the service expected 1GB (1021313024 bytes) of huge pages of size 2MB per page, and could not find them.

Solution

  1. Remove the YAML file of the service from the deployment directory (/etc/kubelet.d).

  2. Allocate huge pages as described in the service's prerequisites steps:Make sure that the huge pages are allocated as required per the desired container.Both the amount and size of the pages are important and must match precisely.

  3. Restart the container infrastructure daemons:

    sudo systemctl restart kubelet.service 
    sudo systemctl restart containerd.service
    


  4. Once the above operations are completed successfully, the container could be deployed (YAML can be copied to /etc/kubelet.d).

Failed to Reserve Sandbox Name

After rebooting the DPU, the respective pods start. However, the containers repeatedly fail to spawn and their "attempt" counter does not increment.

Error

$ crictl pods
POD ID              CREATED                  STATE               NAME                                      NAMESPACE           ATTEMPT             RUNTIME
bee147792a85b       Less than a second ago   Ready               doca-hbn-service-my-dpu                   default             0                   (default)
ea66ee46e75a5       Less than a second ago   Ready               doca-telemetry-service-my-dpu             default             0                   (default)

$ crictl ps -a
CONTAINER           IMAGE               CREATED                  STATE               NAME                       ATTEMPT             POD ID              POD
6a35c025a3590       ce4c0cafd583e       Less than a second ago   Exited              init-sfs                   0                   bee147792a85b       doca-hbn-service-my-dpu
9048f4c7b8f3c       095a5833a3f80       Less than a second ago   Running             doca-telemetry-service     0                   ea66ee46e75a5       doca-telemetry-service-my-dpu
059d0aa8a3199       095a5833a3f80       Less than a second ago   Exited              init-telemetry-service     0                   ea66ee46e75a5       doca-telemetry-service-my-dpu
bcfbe536271ea       ce4c0cafd583e       33 seconds ago           Running             init-sfs                   1                   bee147792a85b       doca-hbn-service-my-dpu

$ journalctl -u containerd
...
"2023-11-28T08:43:42.408173348+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:doca-hbn-service-my-dpu,Uid:823b1ad0e241a10475edde26e905856b,Namespace:default,Attempt:0,} failed, error" error="failed to reserve sandbox name \"doca-hbn-service-my-dpu_default_823b1ad0e241a10475edde26e905856b_0\": name \"doca-hbn-service-my-dpu_default_823b1ad0e241a10475edde26e905856b_0\" is reserved for \"bee147792a85bc23a3629a9fcd0a5f388794f6b67ef552c959d4d5e49d04f5b2\""
...

This error indicates that there has been some collision with prior instances of the doca-hbn-service container, probably pre-reboot.

This issue indicates irregularities in the time of the machine, and usually that the DPU's time pre-reboot was later than the time post-reboot. This leads to bugs in the recovery of the container infrastructure daemons. It is of utmost importance that the time of the system does not jump backwards.

Solution

  1. Remove all YAML files from the deployment directory (/etc/kubelet.d).

  2. Stop all pods:

    sudo crictl stopp $(crictl pods | tail -n +2 | awk '{ print $1 }')
    


  3. Clear all containers:

    sudo ctr -n k8s.io container rm $(ctr -n k8s.io container ls | tail -n +2 | awk '{ print $1 }')
    


  4. Make sure the system's time is correct, and adjust it if needed:

    date
    


  5. Restart the container infrastructure daemons:

    sudo systemctl restart kubelet.service 
    sudo systemctl restart containerd.service
    


  6. Once the above operations are completed successfully, the container could be deployed (YAML can be copied to /etc/kubelet.d).

Last updated: