BlueField Troubleshooting Guide

DDR Error Reporting and Handling

Skip to sidebarSkip to main content

Preface

RAS stands for Reliability, availability and serviceability.

  • Reliability = the continuity of correct service

  • Availability = the readiness for correct service

  • Serviceability = the ability to undergo modifications and repairs

RAS reduces and avoids unplanned outages because:

  • Errors can be detected and corrected by the hardware before they cause system failures.

  • Errors can be detected by hardware and reported to software. Software can take action by identifying and replacing failing components.

  • Errors can be predicted ahead-of-time to allow replacement.

The BlueField-3 has RAS support for DRAM errors which can be either:

  • Single-bit errors aka correctable errors; or

  • Double-bit errors which could be:Non-fatal errors (recoverable)Fatal errors (non-recoverable)

In DPU mode, the OS generates an error report and prints it to the console.

Command Cheat Sheet

Command

Description

dmesg

The dmesg command in Linux displays the kernel message buffer and will show the full DRAM error report

modprobe einj

Load the error injection driver

cat 

/sys/kernel/debug/apei/einj/available_error_type

Lists available errors to inject

echo <address> > /sys/kernel/debug/apei/einj/param1

Inject error at physical address <address>

echo <address> > /sys/kernel/debug/apei/einj/param2

Physical address mask

echo <address> > /sys/kernel/debug/apei/einj/error_type

From the list of available_error_type

echo <address> > /sys/kernel/debug/apei/einj/error_inject


Trigger error after configuring all parameters above

echo "DISPLAY_LEVEL 2" > /dev/dev/rshim0/misc

cat /dev/rshim0/misc

Enable debug level in the RShim log and dump the RShim log

Logging and Counters

N/A

Debug Info Package

N/A

Scenarios

DRAM-related Issues

Although rare, DRAM errors can occur. These are handled according to their severity.

What to Do If a DRAM Error Occurs

Correctable Errors

Correctable errors (CE), also known as single-bit ECC errors, are non-fatal and are automatically corrected by hardware. No user action is required.

Expected OS report for a CE event:

ERROR:   BL31: MSS1 C0 Single bit ECC error detected. IRQ 91

[ 234.638586] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 234.638588] {1}[Hardware Error]: event severity: corrected
[ 234.638590] {1}[Hardware Error]:  Error 0, type: corrected
[ 234.638591] {1}[Hardware Error]:  section_type: memory error
[ 234.638592] {1}[Hardware Error]:  error_status: 0x0000000000010400
[ 234.638594] {1}[Hardware Error]:  physical_address: 0x0000000000000080
[ 234.638595] {1}[Hardware Error]:  physical_address_mask: 0x0000ffffffffffff
[ 234.638598] {1}[Hardware Error]:  module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 480
[ 234.638599] {1}[Hardware Error]:  error_type: 2, single-bit ECC
[ 234.638617] EDAC MC0: 1 CE Single-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 page:0x0 offset:0x80 grain:-2814749776710655 syndrome:0x0 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 status(0x0000000000010400): Storage error in DRAM memory)
Uncorrectable Fatal Errors

Uncorrectable fatal errors are double-bit ECC errors that may result in a system abort or reboot. Some Linux distributions will reboot automatically, while others may panic. In the latter case, manually reboot or power-cycle the system.

Expected Ubuntu report (PSB) for a UE fatal event:

ERROR:   BL31: MSS1 C0 Double bit ECC error detected. IRQ 93

[ 313.874148] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 313.885102] {1}[Hardware Error]: event severity: fatal
[ 313.890229] {1}[Hardware Error]:  Error 0, type: fatal
[ 313.895352] {1}[Hardware Error]:  section_type: memory error
[ 313.901080] {1}[Hardware Error]:  error_status: 0x0000000000010400
[ 313.907330] {1}[Hardware Error]:  physical_address: 0x0000000000000080
[ 313.913925] {1}[Hardware Error]:  physical_address_mask: 0x0000ffffffffffff
[ 313.920956] {1}[Hardware Error]:  module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0
[ 313.930761] {1}[Hardware Error]:  error_type: 3, multi-bit ECC
[ 313.936667] Kernel panic - not syncing: Fatal hardware error!
...
[ 315.904541] Rebooting in 10 seconds..
Uncorrectable Non-fatal Errors

Uncorrectable non-fatal errors, also known as uncorrectable recoverable errors, are double-bit ECC errors that do not interrupt services. The OS will handle the error by retiring the faulty page, and no user action is required.

Expected Linux report for a UE non-fatal event:

ERROR:   BL31: MSS1 C0 Double bit ECC error detected. IRQ XX

[ 219.784317] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 219.795253] {1}[Hardware Error]: event severity: recoverable
[ 219.800897] {1}[Hardware Error]:  Error 0, type: recoverable
[ 219.806540] {1}[Hardware Error]:  section_type: memory error
[ 219.812269] {1}[Hardware Error]:  error_status: 0x0000000000010400
[ 219.818518] {1}[Hardware Error]:  physical_address: 0x0000000000000080
[ 219.825114] {1}[Hardware Error]:  physical_address_mask: 0x0000ffffffffffff
[ 219.832146] {1}[Hardware Error]:  module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0
[ 219.841952] {1}[Hardware Error]:  error_type: 3, multi-bit ECC
[ 219.847870] EDAC MC0: 1 UE Multi-bit ECC on unknown memory ...
[ 219.847875] [Firmware Warn]: GHES: Invalid address in generic error data: 0x80

Issues with Error Injection

Error Injection Does Not Work

Ensure you follow these steps when injecting errors:

  1. Load the einj driver:

    modprobe einj
    cd /sys/kernel/debug/apei/einj

  2. Verify the available parameters:

    ls
    Example output:
    available_error_type  error_type  notrigger  param2  param4
    error_inject          flags       param1     param3

  3. Configure parameters:Target physical address (preferably near the end of available memory):  echo 0x400000000 > param1 Address mask:  echo 0xfffffffffffff000 > param2 Error type:  echo 0x8 > error_type

  4. Available error types: 

    [root@localhost einj]# cat available_error_type 
    0x00000008      Memory Correctable
    0x00000010      Memory Uncorrectable non-fatal
    0x00000020      Memory Uncorrectable fatal

  5. Trigger the error: 

    echo 1 > error_inject

    If /sys/kernel/debug/apei/einj/notrigger is set to 1, you must trigger the error manually (e.g., memory access).

    To test manually using user-space memory:

    git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools
    cd ras-tools
    make -j 8
    ./victim -d
    # Use printed physical address in your injection steps

CE Error Injection Does Not Work

By default, the CE threshold is set to 5000. You must inject 5000 errors before they are reported by the OS unless you change the threshold.

  • Change the threshold to 0 via UEFI or Redfish to report CE errors immediately.

BMC Does Not Receive Error Report

If the BMC does not report the injected error:

  1. Verify BMC firmware is up to date.

  2. Check that Arm reported the error:

    • Review the BlueField console logs.

    • Check for messages like ERROR: BL31: MSS1 C0 Double bit.

    • Run: 

      dmesg | tail
      
    • Enable and check RShim logs: 

      echo "DISPLAY_LEVEL 2" > /dev/rshim0/misc
      cat /dev/rshim0/misc

  3. If no errors appear, update the BFB and enable BMC software updates.

Last updated: