BlueField Troubleshooting Guide

Security

Preface

This guide outlines how to resolve boot failures on BlueField DPUs caused by a corrupted or missing Microsoft UEFI certificate when Secure Boot is enabled. It includes preparation steps to proactively craft a recovery image and resolution steps to restore the certificate without disabling Secure Boot.

Command Cheat Sheet

Command

Description

dpu_golden_image

BMC utility used to retrieve and store a copy of the current BlueField Arm BFB image

mlx-mkbfb

BlueField Arm utility used to inject EFI capsules into BFB images

cat new_arm_dpu_golden_image.bfb > /dev/rshimX/boot

Where X is the appropriate RShim device.

Writes a BFB image from the DPU BMC to the BlueField Arm over RShim

Logging and Counters

N/A

Debug Info Package

N/A

Scenario

User has Secure Boot enabled and Microsoft DB certificate gets corrupted or deleted

If the Microsoft UEFI certificate is missing or corrupted, the BlueField Arm will fail to boot with Secure Boot enabled. The output will resemble:

3  seconds remain...
2  seconds remain...
1  seconds remain...
0  seconds remain...
Failed to boot 'ubuntu0'           <<=== Fails here
Failed to boot 'NET-NIC_P0-IPV4'
Failed to boot 'NET-NIC_P0-IPV6'

>>Start PXE over IPv4

The UEFI database contains a list of trusted X.509 certificates and hashes used to validate binaries during boot. In this case, the SHIM EFI binary (shim.efi or shimaa64.efi) is signed by Microsoft's certificate and cannot be authenticated.

Example output from mokutil showing a typical database:

root@dpu-arm:~# mokutil --db | grep "Subject:"

        Subject: C=US, ST=MA, L=Westborough, O=NVIDIA Corporation, OU=BlueField Secure Boot, CN=NVIDIA BlueField Secure Boot UEFI db Signing 2021
        Subject: C=US, ST=CA, L=Santa Clara, O=NVIDIA Corporation, OU=NBU, CN=NVIDIA BlueField Secure Boot EFI Signing 2022-A
        Subject: C=US, ST=Washington, L=Redmond, O=Microsoft Corporation, CN=Microsoft Corporation UEFI CA 2011                   <<<==== This is corrupted or missng
        Subject: C=US, ST=California, L=Palo Alto, O=VMware, Inc., CN=VMware Secure Boot Signing
        Subject: C=GB, ST=Isle of Man, L=Douglas, O=Canonical Ltd., CN=Canonical Ltd. Master Certificate Authority

The SHIM binary is typically signed by Microsoft:

shimaa64.efi (Microsoft)
signature 1
image signature issuers:
 - /C=US/ST=Washington/L=Redmond/O=Microsoft Corporation/CN=Microsoft Corporation UEFI CA 2011   <<<======== Signed by this module
image signature certificates:
 - subject: /C=US/ST=Washington/L=Redmond/O=Microsoft Corporation/CN=Microsoft Windows UEFI Driver Publisher
   issuer:  /C=US/ST=Washington/L=Redmond/O=Microsoft Corporation/CN=Microsoft Corporation UEFI CA 2011
 - subject: /C=US/ST=Washington/L=Redmond/O=Microsoft Corporation/CN=Microsoft Corporation UEFI CA 2011
   issuer:  /C=US/ST=Washington/L=Redmond/O=Microsoft Corporation/CN=Microsoft Corporation Third Party Marketplace Root

To preserve Secure Boot integrity and enable large-scale recovery, a customer solution was developed to restore the missing Microsoft certificate using the DPU BMC—without disabling Secure Boot.

Solution

To recover from a missing Microsoft certificate, the BlueField Arm BFB image must be updated with the appropriate EFI capsule (efi_sbkeysync.cap) which includes the required certificate.

This process assumes that a recovery image is prepared before the problem occurs. A single BFB image may be reused across all affected DPUs if they share the same configuration.

Prerequisites

  • Python 3 must be installed on the BlueField Arm.

  • The EFI capsule is available at:
    /usr/lib/firmware/mellanox/boot/capsule/efi_sbkeysync.cap

  • The mlx-mkbfb tool is installed:

    root@dpu-arm:~# which mlx-mkbfb
    /usr/bin/mlx-mkbfb
    
    root@dpu-arm:~# python3 --version
    Python 3.10.12
    

Preparation Steps

These steps must be performed in advance and stored on the DPU BMC:

  1. Create a golden image on the DPU BMC (if one does not already exist):

    root@dpu-bmc:~# dpu_golden_image golden_image_arm -r /tmp/arm_golden_image.bfb

  2. Verify the image was created:

    root@dpu-bmc:~# ls -l /tmp/arm_golden_image.bfb
    -rw-r--r--    1 root     root      14713136 Jul  7 13:55 /tmp/arm_golden_image.bfb

  3. Copy the golden image from the BMC to the BlueField Arm:

    root@dpu-arm:~# scp root@<bmc_ip>:/tmp/arm_golden_image.bfb ~/.

  4. Craft a new BFB image using the Microsoft EFI capsule:

    root@dpu-arm:~# /usr/bin/mlx-mkbfb --capsule /usr/lib/firmware/mellanox/boot/capsule/efi_sbkeysync.cap arm_golden_image.bfb new_arm_golden_image.bfb

    This injects a valid Microsoft UEFI certificate into the new BFB.

  5. Verify the new image was created:

    root@dpu-arm:~# ls -l new_arm_golden_image.bfb
    -rw-r--r-- 1 root root 7366872 Jul  7 14:54 new_arm_golden_image.bfb

  6. Copy the new image back to the BMC (or to other BMCs as needed):

    root@dpu-arm:~# scp new_arm_golden_image.bfb root@10.255.6.141:/tmp/.

Resolution Steps

The following steps can be triggered after the Microsoft certificate is lost or corrupted:

  1. Stop RShim on the x86 host (to allow BMC access):

    [root@x86-host]# systemctl stop rshim

  2. Confirm RShim is inactive:

    [root@x86-host]# systemctl status rshim | grep -i Active
       Active: inactive (dead)

  3. Start RShim on the DPU BMC (if not already running):

    root@dpu-bmc:~# systemctl restart rshim

  4. Confirm RShim is active on the BMC:

    root@dpu-bmc:~# systemctl status rshim | grep -i Active
         Active: active (running)

  5. Write the new recovery image to the BlueField Arm: 

    root@dpu-bmc:~# cat /tmp/new_arm_golden_image.bfb > /dev/rshim0/boot

  6. Observe the console output during boot:

    FmpDxe: EFI Capsule Authentication Successful, Status: Success.
    [PMI] DB update started.
    Enable Custom Mode, Status: Success
    Enroll key, Status: Success
    ...
    [PMI] Total number of updates: 6
    [PMI] Errors during updates  : 0
    CapsuleRuntimeDxe: ProcessCapsuleImage 0, Status: Success

The Microsoft UEFI certificate is restored, and the BlueField Arm should now boot successfully with Secure Boot enabled.

Last updated: