BlueField Troubleshooting Guide

SoC Management Interface


Preface

RShim, the SoC management interface in the BlueField System-on-Chip (SoC), enables management, monitoring, and debugging of the device. It offers key functions like firmware updates, system status checks, Arm console access, and network communication through device files (e.g., boot, misc, console) and the RShim network interface (i.e., tmfifo_net0). This guide focuses on practical usage and troubleshooting from the user's side.

Command Cheat Sheet

Command

Description

rshim --version 

Check version

echo 'DISPLAY_LEVEL 2' > /dev/rshim0/misc

cat /dev/rshim0/misc

Check RShim log

journalctl -u rshim > rshim_logs.txt

Check RShim system log

journalctl > all_logs.txt

Check all system log

minicom -D /dev/rshim0/console -C rshim_console.txt

screen /dev/rshim0/console 115200

Access RShim Console

cat new_firmware.bfb > /dev/rshim0/boot

dd if=new_firmware.bfb of=/dev/rshim0/boot bs=1M

bfb-install -b /tmp/new_firmware.bfb  -r /dev/rshim0

Update BlueField firmware (BFB) locally

scp new_firmware.bfb  root@<bf-bmc-hostname>:/dev/rshim0/boot

bfb-install -b new_firmware.bfb -r 15.22.111.63:rshim0

Update BlueField firmware (BFB) remotely

ifconfig tmfifo_net0 192.168.100.2 netmask 255.255.255.252 up

Configure the RShim network interface

Logging and Counters

RShim logging uses an internal 1KB hardware buffer to track booting progress and record important messages. It is written by the NVIDIA BlueField Arm cores and is displayed by the RShim driver from the USB/PCIe host machine.

The RShim log messages can be displayed described in the following:

  1. Check the DISPLAY_LEVEL level in file /dev/rshim0/misc:

    # cat /dev/rshim0/misc
    DISPLAY_LEVEL   0 (0:basic, 1:advanced, 2:log)
    …
    


  2. Set DISPLAY_LEVEL to 2:

    # echo "DISPLAY_LEVEL 2" > /dev/rshim0/misc
    


  3. Log messages are displayed in the misc file. The following is an example output from BlueField-2:

    # cat /dev/rshim0/misc
    ...
    ---------------------------------------
    	Log Messages
    ---------------------------------------
     INFO[BL2]: start
     INFO[BL2]: no DDR on MSS0
     INFO[BL2]: calc DDR freq (clk_ref 53836948)
     INFO[BL2]: DDR POST passed
     INFO[BL2]: UEFI loaded
     INFO[BL31]: start
     INFO[BL31]: runtime
     INFO[UEFI]: eMMC init
     INFO[UEFI]: eMMC probed
     INFO[UEFI]: PCIe enum start
     INFO[UEFI]: PCIe enum end
    


The BFB installation flow can be traced using the following interfaces:

  • From the host – 

    • RShim console (/dev/rshim0/console)

    • RShim log buffer (/dev/rshim0/misc); also included in bfb-install's output

    • UART console (/dev/ttyUSB0)

  • From the BMC console – 

    • SSH to the BMC and run obmc-console-client
      Additional information about BMC interfaces is available in BMC software documentation.

  • From the BlueField – 

    • /root/<OS>.installation.log available on the DPU OS after installation

Debug Info Package

Non-secure BlueField devices support GDB using OpenOCD. BlueField RShim support is up-streamed to the OpenOCD project which implements a GDB server for BlueField debugging. OpenOCD can use the RShim driver to access the Arm debug access port (DAP) directly on the BlueField SoC from the RShim. For more information, refer to documentation in /auto/sw_soc_dev/bluefield-rel-4.7.0/2024-04-26/build/install/Documentation/HOWTO-openocd which also describes how to use GDB to debug the Linux kernel.

  1. To get started, boot the BlueField with the EFI stub debug image to reproduce the crash and halt the system when the Synchronous Exception occurs. It is also possible to add an infinite loop to the code where attaching the debugger is desired, and to then manually set the program counter to jump past the loop.

  2. Run GDB and OpenOCD on the host server machine connected to the BlueField. It is best practice to copy the OpenOCD binary and config files to a separate directory so the config can be edited as needed:

    # Create writeable OpenOCD copy. Edit target/bluefield.cfg to specify which rshim device to use.
    root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp openocd ~/james/openocd/
    root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp interface/rshim.cfg ~/james/openocd/interface/
    root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp target/bluefield.cfg ~/james/openocd/target/
    root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd# cp board/bluefield.cfg ~/james/openocd/board/
    
    # Run OpenOCD (GDB server communicating with BF through rshim) in one window
    root@bu-lab102:~/james/openocd# ./openocd -f board/bluefield.cfg
    
    # In another window source toolchain and run GDB client
    root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/dist# ./poky-glibc-x86_64-core-image-initramfs-aarch64-bluefield-toolchain-BlueField-4.7.0.13127.2.7.4.sh
    root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/dist# . /opt/poky/2.7.4/environment-setup-aarch64-poky-linux
    root@bu-lab102:/auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/dist# aarch64-poky-linux-gdb
    


  3. In GDB client window, perform:

    # Connect to GDB server and set remote timeout (seconds)
    (gdb) target extended-remote :3333
    (gdb) set remotetimeout 60
    
    # Source helpful debug functions
    (gdb) source /auto/sw_soc_dev/bluefield-rel-4.7.0/last/build/install/lib/openocd/scripts/bfdbg.py
    
    # Available commands
    (gdb) bf-help
     bf-edk2 symbol [all]                 -- Load symbols
     bf-info                              -- Display info
     bf-mmu <virt2phys | lookup> <vaddr>  -- MMU operation
     bf-reg [<reg-name [value]> | all]    -- Show/Set registers
    
    # Verify the BF is in EDK2 mode (may need to reboot and restart if not)
    (gdb) bf-info
    PC = 0x45fee7294, EL = 2
    EDK2
    
    # Load all EDK2/UEFI symbols (this can take a while)
    (gdb) bf-edk2 symbol all
    
    # Now can look at the backtrace with symbol information
    (gdb) bt
    #0  0x000000045fee6680 in CpuDeadLoop () at /home/scratch/james/build/edk2/edk2/MdePkg/Library/BaseLib/CpuDeadLoop.c:31
    #1  0x000000045fee6a14 in DefaultExceptionHandler (ExceptionType=<optimized out>, SystemContext=...) at /home/scratch/james/build/edk2/edk2/MlxPlatformPkg/Library/DefaultExceptionHandlerLib/AArch64/DefaultExceptionHandler.c:336
    #2  0x000000045fee7340 in ExceptionHandlersEnd ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)
    

    The backtrace above does not provide much helpful information in this case (it shows the device is halted in the EDK2 exception handler), but may be useful depending on the issue.

  4. The RShim log can provide the PC address:

    Synchronous Exception at 0x459B89420
    
     ERR[UEFI]: PC=0x459B89420(B900003F D5033F9F 94000076 34000960)
     ERR[UEFI]: PC=0x459B88F48
     ERR[UEFI]: PC=0x459B84998
     ERR[UEFI]: PC=0x98D05A68 (0x13A68) [ 1] DxeCore.dll
     ERR[UEFI]: PC=0x45A7973A8 (0x103A8) [ 2] BdsDxe.dll
     ERR[UEFI]: X0=0x45FFE0018 X1=0x400000 X2=0x99FFF548 X3=0x99FFF568
     ERR[UEFI]: X4=0x99FFF570 X5=0x82000000 X6=0x45F2363C0 X7=0x11A18F858A986D85
     ERR[UEFI]: X8=0x4A3823DC9042A9DE X9=0x4D54D42AC44A6076 X10=0x1 X11=0x99FFF3F7
     ERR[UEFI]: X12=0x45F2AC018 X13=0x99FFF3F8 X14=0x1 X15=0x88000C40
    


  5. Dump the 32 bit instructions at that address:

    (gdb) x /32i 0x459B89420
       0x459b89420: str     wzr, [x1]
       0x459b89424: dsb     sy
       0x459b89428: bl      0x459b89600
       0x459b8942c: cbz     w0, 0x459b89558
       0x459b89430: bl      0x459b89610
       0x459b89434: cbz     w0, 0x459b89544
       0x459b89438: and     x1, x19, #0xffffffffffe00000
       0x459b8943c: adrp    x26, 0x45a1e2000
    ...
    

    This shows that the issue is related to wzr, [x1] which shows zero is being written to the memory address contained in a variable x1 (brackets are dereferencing). This hints that x1 contains a memory address that cannot be written to. Looking at the RShim logs, this variable is actually printed and its value/address can be seen as 0x400000 (secure RAM that the executing code cannot write to causing synchronous exception):

    ERR[UEFI]: X0=0x45FFE0018 X1=0x400000 X2=0x99FFF548 X3=0x99FFF568
    

    Note that this x1 variable is part of the EDK2 EFI_SYSTEM_CONTEXT_AARCH64 structure and the assembly code can be read to determine which register this is stored in for more debug if needed.

  6. Various system registers can also be inspected with GDB (refer to /auto/sw_soc_dev/bluefield-rel-4.7.0/2024-04-30/build/install/lib/openocd/scripts/aarch64.py or Arm spec for list of relevant register names to use):

    (gdb) bf-reg ttbr0_el2
    ttbr0_el2 = 0x99feb000
    
    (gdb) info reg
    ...
    


    There may be issues accessing some registers depending on the current exception level (reference the Arm specifications for more information).


Using Breakpoints

Make sure to use hardware breakpoints (hbreak) rather than software breakpoints with BlueField due to issues that can occur when software breakpoints are inserted. To demonstrate breakpoint usage the following example adds an infinite loop to the code before the crash occurs so that the debugger can be attached and breakpoints can be added. The following diff has been added to the test/crash image:

Diff
--- a/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c
+++ b/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c
@@ -1790,6 +1790,8 @@ EfiBootManagerBoot (
     return;
   }

+  __asm__ volatile("b .");
+
...

OpenOCD SMP support also has to be disabled for hardware breakpoints to avoid halting all cores. Make the following change to your target/bluefield.cfg:

Diff
 # Configure SMP
 if { $_cores > 1 } {
-    eval $_smp_command
+#    eval $_smp_command
 }

Load the new test image and follow the previous instructions for attaching OpenOCD and GDB and loading EDK2 symbols. Make sure to attach to the port for a specific core.

Users may have better luck installing a preboot-install.bfb with the infinite loop and booting from flash rather than RShim because the code would not continue executing after jumping past the loop if the RShim installation times out. To reproduce the issue above, this would also mean installing the Linux image to flash.

Verify the system has stopped at the expected location:

(gdb) where
#0  0x000000045a796d60 in EfiBootManagerBoot (BootOption=BootOption@entry=0x99fff968) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Library/UefiBootManagerLib/BmBoot.c:1793
...
(gdb) stepi
0x000000045a796d60      1793      __asm__ volatile("b .");

At this point, hardware breakpoints can be added using symbol names:

# Adding breakpoint to a spot close to crash
(gdb) hbreak /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
Hardware assisted breakpoint 1 at 0x98d05a58: file /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c, line 1654.

# Use 'delete <n>' to delete breakpoint number n)
(gdb) info b
Num     Type           Disp Enb Address            What
1       hw breakpoint  keep y   0x0000000098d05a58 in CoreStartImage at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654

After breakpoints have been added, the following can be done to move the program counter past the infinite loop (a single 4-byte instruction) and continue execution:

(gdb) set $pc+=4
(gdb) c
Continuing.

Breakpoint 1, CoreStartImage (ImageHandle=0x45e205c98, ExitDataSize=0x45e205068, ExitData=0x45e205060) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
1654        Image->Status = Image->EntryPoint (ImageHandle, Image->Info.SystemTable);

(gdb) where
#0  CoreStartImage (ImageHandle=0x45e205c98, ExitDataSize=0x45e205068, ExitData=0x45e205060) at /home/scratch/james/build/edk2/edk2/MdeModulePkg/Core/Dxe/Image/Image.c:1654
...

Many of the normal GDB commands are supported. 

Sometimes adding breakpoints can cause boot issues, and if the breakpoints cannot be deleted with GDB a hard reboot may be needed to recover.


OpenOCD logs how many hardware breakpoints are available:

Info : bluefield.cpu0: hardware has 6 breakpoints, 4 watchpoints
Info : bluefield.cpu1: hardware has 6 breakpoints, 4 watchpoints
...


Scenarios

Another Backend Already Attached

BlueField devices are equipped with a USB interface in which RShim can be routed, via USB cable, to an external host running Linux and the RShim driver. In this case, typically following a system reboot, the RShim over USB prevails and the BlueField host reports the RShim status as another backend already attached. This is correct behavior as there can only be one RShim back end active at any given time. However, this means that the BlueField host does not own RShim access. To debug an issue, the user may need to access RShim from the BlueField BMC or host, but RShim is attached to the other side (host or BMC respectively).

The user is able to reclaim RShim ownership safely without logging into the other side:

  1. Stop the RShim driver on the remote Linux. Run:

    systemctl stop rshim
    systemctl disable rshim
    


  2. Restart RShim on the BlueField host. Run:

    systemctl enable rshim
    systemctl start rshim
    


This another backend already attached error can also be attributed to the RShim back end being owned by the BMC in BlueField devices with an integrated BMC. This is elaborated on further down on this page.

RShim Driver Not Loading

Verify whether your BlueField features an integrated BMC or not. Run:

# sudo sudo lspci -s $(sudo lspci -d 15b3: | head -1 | awk '{print $1}') -vvv | grep "Product Name"

Example output for a BlueField with an integrated BMC:

Product Name: BlueField-2 DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHL

If your BlueField has an integrated BMC, refer to RShim driver not loading on host with integrated BMC.

If your BlueField does not have an integrated BMC, refer to SoC Management Interface | id (1.2)SoCManagementInterface RShimdrivernotloadingonhostonDPUwithoutintegratedBMC.

RShim Driver Not Loading on DPU with Integrated BMC

RShim Driver Not Loading on Host

  1. Access the BMC via the RJ45 management port of the BlueField.

  2. Delete RShim on the BMC: 

    systemctl stop rshim
    systemctl disable rshim
    


  3. Enable RShim on the host: 

    systemctl enable rshim
    systemctl start rshim
    


  4. Restart RShim service. Run:

    sudo systemctl restart rshim
    

    If RShim service does not launch automatically, run: 

    sudo systemctl status rshim
    

    This command is expected to display active (running).

  5. Display the current setting. Run: 

    # cat /dev/rshim<N>/misc | grep DEV_NAME
    DEV_NAME        pcie-04:00.2 (ro)
    

    This output indicates that the RShim service is ready to use.

RShim Driver Not Loading on BMC

  1. Verify that the RShim service is not running on host. Run: 

    systemctl status rshim
    

    If the output is active, then it may be presumed that the host has ownership of the RShim.

  2. Delete RShim on the host. Run:

    systemctl stop rshim
    systemctl disable rshim
    


  3. Enable RShim on the BMC. Run:

    systemctl enable rshim
    systemctl start rshim
    


  4. Display the current setting. Run: 

    # cat /dev/rshim<N>/misc | grep DEV_NAME
    DEV_NAME        usb-1.0
    

    This output indicates that the RShim service is ready to use.

RShim Driver Not Loading on Host on DPU Without Integrated BMC

  1. Download the suitable deb/rpm for RShim (management interface for DPU from the host) driver.

  2. Reinstall RShim package on the host.

    • For Ubuntu/Debian, run:

      sudo dpkg --force-all -i rshim-<version>.deb
      


    • For RHEL/CentOS, run: 

      sudo rpm -Uhv rshim-<version>.rpm
      


  3. Restart RShim service. Run:

    sudo systemctl restart rshim
    

    If RShim service does not launch automatically, run: 

    sudo systemctl status rshim
    

    This command is expected to display active (running).

  4. Display the current setting. Run: 

    # cat /dev/rshim<N>/misc | grep DEV_NAME
    DEV_NAME        pcie-04:00.2 (ro)
    

    This output indicates that the RShim service is ready to use.

RShim Failed to Set Up CUSE RShim Error

Symptom

When starting the rshim service, the systemd journal may show an error similar to:

$ sudo systemctl status rshim
...
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com systemd[1]: Starting rshim driver for BlueField SoC...
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com systemd[1]: Started rshim driver for BlueField SoC.
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: Created PID file: /var/run/rshim.pid
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: Probing pcie-0000:b1:00.2(uio)
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: Create rshim pcie-0000:b1:00.2
Apr 30 14:08:20 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: pcie-0000:b1:00.2 enable
Apr 30 14:08:21 bu-lab105.wes-a.nbulabs.nvidia.com rshim[13899]: rshim1 failed to setup CUSE rshim
...

Cause

The rshim driver depends on the cuse.ko kernel module, which is typically provided by the kernel-modules-extra package. This package is usually installed as a dependency during the RShim RPM or DEB installation.

However, on some RHEL- or Rocky Linux-based systems, this dependency may not be enforced, resulting in a missing cuse.ko module and a failure during RShim initialization.

Solution

Installing kernel-modules-extra may trigger a kernel upgrade if your current kernel version is not available in the configured repositories. For example, installing this package may update the kernel from 5.14.0-570.4.1 to 5.14.0-570.12.1. This may also pull in related packages such as kernel, kernel-core, and kernel-modules.

  1. Install kernel-modules-extra. For RHEL/Rocky Linux systems, install the package using:

    sudo dnf install kernel-modules-extra

  2. Load the cuse module. If the installed kernel-modules-extra matches the currently running kernel, you can load the cuse.ko module:

    sudo modprobe cuse
    If no errors are reported, the cuse module is now available for RShim.

  3. Restart the RShim service. Once cuse is loaded, restart the RShim service:

    sudo systemctl restart rshim
    You should no longer see the failed to setup CUSE rshim error.

Additional Notes

  • If modprobe cuse fails with a message about a missing module, it likely means the newly installed kernel-modules-extra version does not match the currently running kernel.

  • In this case, reboot the system to use the updated kernel: 

    sudo reboot
    
  • After reboot, verify the running kernel version: 

    uname -r

  • Ensure it matches the version of kernel-modules-extra that was installed.

  • In rare cases, you may need to adjust the GRUB configuration to ensure the system boots into the new kernel automatically: 

    sudo grub2-set-default 0
    sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    


Failed to Read IOMMU Link Error

The following is an informational message printed by RShim driver when trying to access via IOMMU:

rshim service: /sys/bus/pci/devices/0000:01:00.2/iommu_group: failed to read iommu link

The RShim driver probes RShim in the following order: IOMMU, UIO, Direct Map. It then continues the probe until success, and one mechanism failure does not mean that the RShim driver fails unless some mechanism is really necessary (such as IOMMU) when Linux kernel lockdown is enabled.

Change Ownership of RShim from NIC BMC to Host

  1. Verify that your BlueField has an integrated BMC. Run the following on the host: 

    # sudo sudo lspci -s $(sudo lspci -d 15b3: | head -1 | awk '{print $1}') -vvv |grep "Product Name"
    Product Name: BlueField-2 DPU 25GbE Dual-Port SFP56, integrated BMC, Crypto and Secure Boot Enabled, 16GB on-board DDR, 1GbE OOB management, Tall Bracket, FHHL
    

    The product name is supposed to show integrated BMC .

  2. Access the BMC via the RJ45 management port of the BlueField.

  3. Delete RShim on the BMC: 

    systemctl stop rshim
    systemctl disable rshim
    


  4. Enable RShim on the host: 

    systemctl enable rshim
    systemctl start rshim
    


  5. Restart RShim service. Run:

    sudo systemctl restart rshim
    

    If RShim service does not launch automatically, run: 

    sudo systemctl status rshim
    

    This command is expected to display active (running).

  6. Display the current setting. Run: 

    # cat /dev/rshim<N>/misc | grep DEV_NAME
    DEV_NAME        pcie-04:00.2 (ro)
    

    This output indicates that the RShim service is ready to use.

Last updated: