Skip to sidebarSkip to main content
Preface
RAS stands for Reliability, availability and serviceability.
-
Reliability = the continuity of correct service
-
Availability = the readiness for correct service
-
Serviceability = the ability to undergo modifications and repairs
RAS reduces and avoids unplanned outages because:
-
Errors can be detected and corrected by the hardware before they cause system failures.
-
Errors can be detected by hardware and reported to software. Software can take action by identifying and replacing failing components.
-
Errors can be predicted ahead-of-time to allow replacement.
The BlueField-3 has RAS support for DRAM errors which can be either:
-
Single-bit errors aka correctable errors; or
-
Double-bit errors which could be:Non-fatal errors (recoverable)Fatal errors (non-recoverable)
In DPU mode, the OS generates an error report and prints it to the console.
Command Cheat Sheet
|
Command |
Description |
|---|---|
|
|
The dmesg command in Linux displays the kernel message buffer and will show the full DRAM error report |
|
modprobe einj |
Load the error injection driver |
|
/sys/kernel/debug/apei/einj/available_error_type |
Lists available errors to inject |
|
|
Inject error at physical address |
|
|
Physical address mask |
|
|
From the list of |
|
echo <address> > /sys/kernel/debug/apei/einj/error_inject
|
Trigger error after configuring all parameters above |
|
|
Enable debug level in the RShim log and dump the RShim log |
Logging and Counters
N/A
Debug Info Package
N/A
Scenarios
DRAM-related Issues
Although rare, DRAM errors can occur. These are handled according to their severity.
What to Do If a DRAM Error Occurs
Correctable Errors
Correctable errors (CE), also known as single-bit ECC errors, are non-fatal and are automatically corrected by hardware. No user action is required.
Expected OS report for a CE event:
ERROR: BL31: MSS1 C0 Single bit ECC error detected. IRQ 91
[ 234.638586] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 234.638588] {1}[Hardware Error]: event severity: corrected
[ 234.638590] {1}[Hardware Error]: Error 0, type: corrected
[ 234.638591] {1}[Hardware Error]: section_type: memory error
[ 234.638592] {1}[Hardware Error]: error_status: 0x0000000000010400
[ 234.638594] {1}[Hardware Error]: physical_address: 0x0000000000000080
[ 234.638595] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff
[ 234.638598] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 480
[ 234.638599] {1}[Hardware Error]: error_type: 2, single-bit ECC
[ 234.638617] EDAC MC0: 1 CE Single-bit ECC on unknown memory (module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 page:0x0 offset:0x80 grain:-2814749776710655 syndrome:0x0 - APEI location: module:0 rank:0 bank:0 bank_group:0 row:0 col:0 bit_pos:480 status(0x0000000000010400): Storage error in DRAM memory)
Uncorrectable Fatal Errors
Uncorrectable fatal errors are double-bit ECC errors that may result in a system abort or reboot. Some Linux distributions will reboot automatically, while others may panic. In the latter case, manually reboot or power-cycle the system.
Expected Ubuntu report (PSB) for a UE fatal event:
ERROR: BL31: MSS1 C0 Double bit ECC error detected. IRQ 93
[ 313.874148] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 313.885102] {1}[Hardware Error]: event severity: fatal
[ 313.890229] {1}[Hardware Error]: Error 0, type: fatal
[ 313.895352] {1}[Hardware Error]: section_type: memory error
[ 313.901080] {1}[Hardware Error]: error_status: 0x0000000000010400
[ 313.907330] {1}[Hardware Error]: physical_address: 0x0000000000000080
[ 313.913925] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff
[ 313.920956] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0
[ 313.930761] {1}[Hardware Error]: error_type: 3, multi-bit ECC
[ 313.936667] Kernel panic - not syncing: Fatal hardware error!
...
[ 315.904541] Rebooting in 10 seconds..
Uncorrectable Non-fatal Errors
Uncorrectable non-fatal errors, also known as uncorrectable recoverable errors, are double-bit ECC errors that do not interrupt services. The OS will handle the error by retiring the faulty page, and no user action is required.
Expected Linux report for a UE non-fatal event:
ERROR: BL31: MSS1 C0 Double bit ECC error detected. IRQ XX
[ 219.784317] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 219.795253] {1}[Hardware Error]: event severity: recoverable
[ 219.800897] {1}[Hardware Error]: Error 0, type: recoverable
[ 219.806540] {1}[Hardware Error]: section_type: memory error
[ 219.812269] {1}[Hardware Error]: error_status: 0x0000000000010400
[ 219.818518] {1}[Hardware Error]: physical_address: 0x0000000000000080
[ 219.825114] {1}[Hardware Error]: physical_address_mask: 0x0000ffffffffffff
[ 219.832146] {1}[Hardware Error]: module: 0 rank: 0 bank: 0 bank_group: 0 row: 0 column: 0 bit_position: 0
[ 219.841952] {1}[Hardware Error]: error_type: 3, multi-bit ECC
[ 219.847870] EDAC MC0: 1 UE Multi-bit ECC on unknown memory ...
[ 219.847875] [Firmware Warn]: GHES: Invalid address in generic error data: 0x80
Issues with Error Injection
Error Injection Does Not Work
Ensure you follow these steps when injecting errors:
-
Load the
einjdriver:modprobe einj cd /sys/kernel/debug/apei/einj
-
Verify the available parameters:
Example output:ls
available_error_type error_type notrigger param2 param4 error_inject flags param1 param3
-
Configure parameters:Target physical address (preferably near the end of available memory): echo 0x400000000 > param1 Address mask: echo 0xfffffffffffff000 > param2 Error type: echo 0x8 > error_type
-
Available error types:
[root@localhost einj]# cat available_error_type 0x00000008 Memory Correctable 0x00000010 Memory Uncorrectable non-fatal 0x00000020 Memory Uncorrectable fatal
-
Trigger the error:
echo 1 > error_inject
If
/sys/kernel/debug/apei/einj/notriggeris set to1, you must trigger the error manually (e.g., memory access).To test manually using user-space memory:
git clone https://kernel.googlesource.com/pub/scm/linux/kernel/git/aegl/ras-tools cd ras-tools make -j 8 ./victim -d # Use printed physical address in your injection steps
CE Error Injection Does Not Work
By default, the CE threshold is set to 5000. You must inject 5000 errors before they are reported by the OS unless you change the threshold.
-
Change the threshold to
0via UEFI or Redfish to report CE errors immediately.
BMC Does Not Receive Error Report
If the BMC does not report the injected error:
-
Verify BMC firmware is up to date.
-
Check that Arm reported the error:
-
Review the BlueField console logs.
-
Check for messages like
ERROR: BL31: MSS1 C0 Double bit. -
Run:
dmesg | tail -
Enable and check RShim logs:
echo "DISPLAY_LEVEL 2" > /dev/rshim0/misc cat /dev/rshim0/misc
-
-
If no errors appear, update the BFB and enable BMC software updates.
Last updated: