BlueField Troubleshooting Guide

Firefly


Preface

The DOCA Firefly Service delivers Precision Time Protocol (PTP) based time synchronization for the BlueField DPU.

PTP is a protocol designed to synchronize clocks within a network. When paired with hardware support, PTP can achieve sub-microsecond accuracy, significantly surpassing the typical precision of Network Time Protocol (NTP). PTP functionality is managed across both kernel and user space, with the ptp4l program handling PTP boundary and ordinary clocks. With hardware time stamping, ptp4l ensures synchronization of the PTP hardware clock to the master clock.

For further details about DOCA Firefly, please consult the relevant NVIDIA DOCA Firefly Service Guide.

Command Cheat Sheet

Firmware Settings

Query status of the Real Time Clock (RTC)

To check the status of the RTC on the DPU, use the following command:

Bash
$ sudo mlxconfig -d 03:00.0 q | grep REAL_TIME_CLOCK_ENABLE
# Example output
        REAL_TIME_CLOCK_ENABLE                      True(1) 

Enabling the Real Time Clock (RTC)

To enable RTC, run:

Bash
$ sudo mlxconfig -d 03:00.0 set REAL_TIME_CLOCK_ENABLE=1

A graceful shutdown and power cycle of the DPU are required for the changes to take effect.

Open vSwitch (OVS) Configuration

Check Hardware Offload Support

To verify if hardware offload is enabled, run:

Bash
$ sudo ovs-vsctl get Open_vSwitch . other_config | grep hw-offload
# Example output
         {hw-offload="true"}

Enable Hardware Offload Support

  1. Activate hardware offloading, run:

    Bash
    $ sudo ovs-vsctl set Open_vSwitch . other_config:hw-offload=true;
    


  2. Restart the OVS service:

    Bash
    $ sudo /etc/init.d/openvswitch-switch restart
    


Examine Switch Settings

To examine the current switch settings, run:

Bash
$ sudo ovs-vsctl show
    Bridge uplink
        Port pf0hpf
            Interface pf0hpf
        Port en3f0pf0sf4
            Interface en3f0pf0sf4
        Port p0
            Interface p0
        Port uplink
            Interface uplink
                type: internal

Add New Bridge / Port

To add a new bridge or port, run:

Bash
$ sudo ovs-vsctl add-br <bridge name>
$ sudo ovs-vsctl add-port <bridge name> <port name>

Example for DOCA Firefly Deployment

Bash
$ sudo ovs-vsctl add-br uplink
$ sudo ovs-vsctl add-port uplink p0
$ sudo ovs-vsctl add-port uplink en3f0pf0sf4
# This port is needed to ensure we have traffic host<->network as well
$ sudo ovs-vsctl add-port uplink pf0hpf

Network Interface Configuration

Enable Hardware Tx Port Timestamping

Bash
$ sudo ethtool --set-priv-flags enp3s0f0s4 tx_port_ts on

Configure IP Address for Interface

Bash
$ sudo ifconfig enp3s0f0s4 <ip-addr> up

Container Runtime Commands

When deploying a new container, it is recommended to follow this procedure to ensure the successful execution of each step throughout the deployment process:

View Currently Active Pods and their IDs

$ sudo crictl pods


It may take up to 20 seconds for the pod to start.

When deploying a new container, look for a corresponding entry line in the command's output:

POD ID              CREATED             STATE               NAME                                     NAMESPACE           ATTEMPT             RUNTIME
06bd84c07537e       4 seconds ago       Ready               doca-firefly-my-dpu                      default             0                   (default)

Review Kubelet Logs

If no matching line appears, it is recommended to check the Kubelet logs for more details about the error:

$ sudo journalctl -u kubelet --since -5m

Once the issue is resolved, proceed to the next steps. 

Verify Download of Container Image from NGC

Verify that the container image is successfully downloaded from NGC into the DPU's container registry (download time may vary based on the size of the container image):

$ sudo crictl images

Example output:

IMAGE                              TAG                 IMAGE ID            SIZE
k8s.gcr.io/pause                   3.2                 2a060e2e7101d       251kB
nvcr.io/nvidia/doca/doca_firefly   1.1.0-doca2.0.2     134cb22f34611       87.4MB


View Currently Active Containers

View currently active containers and their IDs: 

$ sudo crictl ps

Once again, find a corresponding entry line for the deployed container (boot time may vary depending on the container's image size):

CONTAINER           IMAGE               CREATED             STATE               NAME                     ATTEMPT             POD ID              POD
b505a05b7dc23       134cb22f34611       4 minutes ago       Running             doca-firefly             0                   06bd84c07537e       doca-firefly-my-dpu

In case of failure to find a matching container, review the list of all recent container deployments:

$ sudo crictl ps -a

It is possible that the container encountered an error during boot and exited right away:

CONTAINER           IMAGE               CREATED             STATE               NAME                     ATTEMPT             POD ID              POD
de2361ec15b61       134cb22f34611       1 second ago        Exited              doca-firefly             1                   4aea5f5adc91d       doca-firefly-my-dpu

Review Logs of a Container

During the container's runtime, and for a short timespan after it exits, you can view the containers logs that were printed to the standard output:

$ sudo crictl logs <container-id>

In this case, the user can learn from the log that the wrong configuration was passed to the container:

$ sudo crictl logs de2361ec15b61
Starting DOCA Firefly - Version 1.1.0
...
Requested the following PTP interface: p10
Failed to find interface "p10". Aborting


For additional information and guides on using crictl, refer to the Debugging Kubernetes Nodes with crictl.

Stop a Running Container

The recommended way to stop a pod and its containers is as follows:

  1. Delete the .yaml configuration file for Kubelet to stop the pod: 

    $ rm /etc/kubelet.d/<file name>.yaml
    


  2. Stop the pod directly (only if it still shows "Ready"):

    $ sudo crictl stopp <pod-id>
    


  3. Once the pod stops, it may also be necessary to stop the container itself: 

    $ sudo crictl stop <container-id>
    


Logging and Counters

DOCA Firefly generates multiple log files, each corresponding to a specific module:

Runtime (Administrator) Logs

  • Main container log: /var/log/doca/firefly/firefly.log

  • ptp4l log: /var/log/doca/firefly/ptp4l.log

  • phc2sys log: /var/log/doca/firefly/phc2sys.log

  • SyncE log: /var/log/doca/firefly/synced.log

Developer Logs

  • Firefly (PTP) Monitor - /var/log/doca/firefly/firefly_monitor_dev.log

Debug Info Package

DOCA Firefly operates as a containerized DOCA Service and does not require separate packages for installation.

Nonetheless, the service offers enhanced debugging capabilities for the finalized configuration file. Detailed instructions on how to utilize these debugging features are provided in the relevant section of the NVIDIA DOCA Firefly Service Guide.

Scenarios

When troubleshooting container deployment issues, it is highly recommended to follow the deployment steps and tips in the "Review Container Deployment" section of the NVIDIA DOCA Container Deployment Guide.

Debugging config File

To debug the finalized configuration file used by Firefly, users can connect to the container as follows:

  1. Open a shell session on the running container using the container ID:

    Bash
    $ sudo crictl exec -it <container-id> /bin/bash
    


  2. Once connected the to container, the finalized configuration file can be found under the /tmp directory using the same filename as the original configuration file. 

    More information regarding the configuration files can be found under section "Ensuring and Debugging Correctness of Config File" in the service guide.


Pod is Marked as "Ready" and No Container is Listed

Error

When deploying the container, the pod's STATE is marked as Ready, an image is listed, however no container can be seen running:

Bash
$ sudo crictl pods
POD ID              CREATED             STATE               NAME                                     NAMESPACE           ATTEMPT             RUNTIME
06bd84c07537e       4 seconds ago       Ready               doca-firefly-my-dpu                      default             0                   (default)

$ sudo crictl images
IMAGE                              TAG                 IMAGE ID            SIZE
k8s.gcr.io/pause                   3.2                 2a060e2e7101d       251kB
nvcr.io/nvidia/doca/doca_firefly   1.1.0-doca2.0.2     134cb22f34611       87.4MB

$ sudo crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                     ATTEMPT             POD ID              POD

Solution

In most cases, the container did start, but immediately exited. This could be checked using the following command:

Bash
$ sudo crictl ps -a
CONTAINER           IMAGE               CREATED             STATE               NAME                     ATTEMPT             POD ID              POD
556bb78281e1d       134cb22f34611       7 seconds ago       Exited              doca-firefly             1                   06bd84c07537e       doca-firefly-my-dpu

Should the container fail (i.e., state of Exited) it is recommended to examine Firefly's main log at /var/log/doca/firefly/firefly.log.

In addition, for a short period of time after termination, the container logs could also be viewed using the the container's ID:

Bash
$ sudo crictl logs 556bb78281e1d
Starting DOCA Firefly - Version 1.1.0
...
Requested the following PTP interface: p10
Failed to find interface "p10". Aborting

Custom Config File is Not Found

Error

When DOCA Firefly is deployed using a custom configuration file, a deployment error occurs and the following log message appears:

Bash
...
2023-09-07 14:04:23 - Firefly - Init    - ERROR    - Custom config file not found: my_file.conf. Aborting
...

Solution

Check the custom file name written in the YAML file and make sure that you properly placed the file with that name under the /etc/firefly/ directory of the DPU.

Profile is Not Supported

Error

When DOCA Firefly is deployed, a deployment error occurs and the following log message appears:

Bash
...
2023-09-07 14:04:23 - Firefly - Init    - ERROR    - profile <name> is not supported. Aborting
...

Solution

Verify that the profile selected in the YAML file matches one of the supported profiles as listed in the profiles table.

The profile name is case sensitive. The name must be specified in lower-case letters.

PPS Capability is Missing

Error

When DOCA Firefly is deployed and configured to use the PPS module, a deployment error occurs and the following log message appears:

Bash
...
2023-09-07 14:04:23 - Firefly - Init    - INFO     - Starting PPS configuration
2023-09-07 14:04:23 - Firefly - Init    - WARNING  - [-] PPS capability is missing, seems that the card doesn't support PPS
2023-09-07 14:04:23 - Firefly - Init    - INFO     - capabilities:
2023-09-07 14:04:23 - Firefly - Init    - INFO     -   50000000 maximum frequency adjustment (ppb)
2023-09-07 14:04:23 - Firefly - Init    - INFO     -   0 programmable alarms
2023-09-07 14:04:23 - Firefly - Init    - INFO     -   0 external time stamp channels
2023-09-07 14:04:23 - Firefly - Init    - INFO     -   0 programmable periodic signals
2023-09-07 14:04:23 - Firefly - Init    - INFO     -   0 pulse per second
2023-09-07 14:04:23 - Firefly - Init    - INFO     -   0 programmable pins
2023-09-07 14:04:23 - Firefly - Init    - INFO     -   0 cross timestamping
...

Solution

This log indicates that the DPU hardware does not support PPS. However, PTP can still run on this hardware and you should see the line Running ptp4l in the container log, indicating that PTP is running successfully.

Timed Out While Polling for Tx Timestamp

Error

When the BlueField is operating in DPU mode, DOCA Firefly gets stuck in a fault loop while waiting to receive the Tx timestamp events:

ptp4l[2912.797]: timed out while polling for tx timestamp
ptp4l[2912.797]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
ptp4l[2912.797]: port 1 (enp3s0f0s4): send sync failed
ptp4l[2923.528]: timed out while polling for tx timestamp
ptp4l[2923.528]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
ptp4l[2923.528]: port 1 (enp3s0f0s4): send sync failed


DOCA Firefly has a known gap leading to this error appearing once, after which ptp4l recovers from it. This section only covers the case in which there is a fault loop and no recovery occurs.

Solution

DOCA Firefly's configurations were already adjusted to accommodate for Tx port timestamping. For more information about the reason for this error and for the designed recovery mechanism from it, refer to the "Tx Timestamping Support on DPU Mode" section in DOCA Firefly Service Guide.

Warning - Time Jumped Backwards

Error

When using Firefly's Servo module, the following warning log message is encountered on start:

 2024-01-01 14:04:23 - Firefly - SERVO   - WARNING  - Clock is going to jump backwards in time - this might have a system-wide impact

Solution

This warning messages indicates that the system's time jumped backwards with a value of at least one minute. This event is logged by Firefly given that such jumps might have system-wide implications. For more information, refer to NVIDIA DOCA Troubleshooting Guide.

Such jumps can only happen during Firefly's boot, before the Servo achieved an initial time synchronization with the reference clock. 

Last updated: