BlueField Troubleshooting Guide

Virtio-blk


Preface

This page intends to assist SNAP users and developers in troubleshooting and resolving common issues when working with SNAP containers or source packages. 

It is recommended to consult this page only after reviewing the latest BlueField SNAP for NVMe and Virtio-blk Documentation.

Command Cheat Sheet

Verbosity Level

Examine the logs of the SNAP container:

crictl logs -f $(crictl ps -s running -q --name snap)

SNAP RPC Help

Use snap_rpc.py --help to review the supported RPCs and learn the parameters per command.

SNAP State Query

List Emulation Devices

Use the following command to list emulation devices:

snap_rpc.py emulation_function_list

For detailed information, refer to emulation_function_list.

List Virtio-Blk Controllers and Their Configurations

To list Virtio-Blk controllers and view the configurations and state for each one, use the following command:

snap_rpc.py virtio_blk_controller_list

For detailed information, refer to virtio_blk_controller_list.

List NVMe Subsystems, Controllers and Namespaces

To list NVMe subsystems, controllers, and namespaces under each subsystem, including their configurations and state, use the following command: 

snap_rpc.py nvme_subsystem_list

For detailed information, refer to nvme_subsystem_list

List NVMe Controllers Including Configurations and State

To list NVMe controllers along with their configurations and state, use the following command: 

snap_rpc.py nvme_controller_list

For detailed information, refer to nvme_controller_list.

List NVMe Namespaces

To list NVMe namespaces, use the following command: 

snap_rpc.py nvme_namespace_list

For detailed information, refer to nvme_namespace_list

List SPDK BDevs

To list SPDK BDevs, use the following command: 

spdk_rpc.py bdev_get_bdevs

 For detailed information, refer to SPDK JSON-RPC Documentation.

Logging and Counters

RPC Log History

SNAP IO Level Statistics

Debug counters provide I/O statistics for each controller, offering insights into the distribution of I/O across different queues and the total I/O received by the controller.

Debug Info Package

N/A

Scenarios

For details on known bugs and limitations, please refer to the SNAP Known Issues.

SPDK and SNAP Compatibility Issues

Each SNAP container release is bundled with the latest available NVIDIA SPDK. If you need to replace the SPDK version with a custom one, follow the instructions provided here:

If a build failure occurs due to compatibility issues between SNAP and SPDK, the /service/compat/spdk folder contains the necessary infrastructure to address these compatibility issues.

SNAP Service Fails to Load Due to a Incorrect Firmware Configuration

If the firmware is not configured for NVMe or Virtio-blk, an error log message appears when attempting to load SNAP.

To enable the storage emulation based on your desired configuration, follow the instructions provided in Firmware Configuration.

Fewer Queues Created than Configured

By default, SNAP attempts to create the maximum number of queues within the MSIX limitation (up to 63).

If a higher number of queues are requested during controller creation, they are still subject to the MSIX limitation.

For configuring the number of MSIX entries, refer to: DPU Firmware Configuration

To dynamically manage MSIX, refer to: SR-IOV Dynamic MSIX Management

IO Failures During High Throughput with NVMeTCP Using XLIO

The default XLIO TCP configuration required for NVMeTCP is included in the SNAP container or source package. However, when scaling up tests, IO failures may occur specifically when using NVMeTCP. It is recommended to consult the Monitoring, Debugging, and Troubleshooting section of the NVIDIA Accelerated IO (XLIO) Documentation for guidance.

Deploying Container on Setups Without Internet

To deploy a container in environments without internet access, refer to Deploying Container on Setups Without Internet Connectivity.

Managing SNAP Service Memory Consumption

DPU memory is shared among all DPU services, and scaling the SNAP configuration may lead to memory shortages.

To understand SNAP memory configuration and usage, please refer to SNAP Memory Consumption.

Container Image Corruption

If the container image becomes corrupted and the container status shows as as exited with the error message /usr/bin/supervisord: exec format error, follow these steps:

  • Remove the YAML from kubelet.

  • Use crictl images to list the images and crictl rmi <image-id> to remove the image.

  • Restart the containerd and kubelet services with systemctl restart containerd and systemctl restart kubelet, respectively.

  • Reapply/copy the YAML file to kubelet.

For additional information on container deployment and debugging, refer to SNAP Container Deployment.

Troubleshooting Issues When Enabling SR-IOV

When enabling SNAP virtual functions, ensure the host is configured according to the guidelines outlined in Host OS Configuration.

Additionally, follow the DPU firmware configuration instructions provided in SR-IOV Firmware Configuration.

Tuning the Host to Avoid Poor Performance Issues

To optimize host performance, follow the Intel configuration guidelines provided in Intel Host OS Configuration.

For AMD configurations, refer to AMD Host OS Configuration.

Recovering from a Service Crash

To recover from a service crash, review the recovery procedures and instructions for enabling recovery mode for NVMe and Virtio-Blk in Recovery.

Essential Debug Information for SNAP Service/Container Issues

To effectively debug issues encountered during the deployment and operation of the SNAP service/container, gathering the following information is required:

From the Host Machine

  1. Host OS/kernel version, use: uname -a and cat /etc/os-release

  2. Host CPU model, use: lscpu

  3. Host commands:Driver load/unload commandsFunctions management (VF assignment, SR-IOV commands, FLR events)Storage application commands (e.g., fio testing app command)

  4. Host dmesg output

From the BlueField Platform

  1. HW model, use:/hpc/local/bin/lshca

  2. BFB version, use:  cat /etc/os-release

  3. SPDK version (if a non-default version is used)

  4. XLIO version (if a non-default version is used)

  5. FW version, use: mlxfwmanager

  6. FW configuration (refer to SR-IO Firmware Configuration)

  7. Hugepage memory configuration (Refer to Allocate Hugepages under /etc/sysctl.conf)

  8. SNAP container configuration (Refer to YAML file)

  9. SNAP state, using the output of the following RPCs:snap_rpc.py emulation_function_list (refer to emulation_function_list)snap_rpc.py virtio_blk_controller_list (in case of Virtio-Blk, refer to virtio_blk_controller_list)snap_rpc.py nvme_subsystem_list (in case of NVMe, refer to nvme_subsystem_list )spdk_rpc.py bdev_get_bdevs  (see SPDK JSON-RPC Documentation)

  10. The SNAP and SPDK initialization RPCs (defined in etc/nvda_snap/spdk_init.conf and etc/nvda_snap/snap_init.conf)

  11. SNAP RPC logs (check the SNAP RPC log history, refer to RPC Log History)

  12. SNAP logs

    • To export container logs if the container crashes, use ls /var/log/containers/. Logs are prefixed with the container ID (e.g., 94cdeb21b031b), visible under crictl ps. If the container restarts, you may find multiple logs.


  13. Additional Logs: Kubernetes logs, use: journalctl -u kubelet > #log_file_nameDPU logs, check /var/log/messages and /var/log/dmesg

Last updated: