NVIDIA NVOS User Manual for InfiniBand Switches

Appendix—NVLink Switch SSD Firmware Update Customer Guide

This document describes a rare issue where the system SSD (NVMe device) becomes unresponsive during normal operation while traffic continues to run. It outlines the symptoms, affected components, and provides a detailed procedure for installing a firmware fix through the regular NVOS update mechanism.

Issue Description

During normal operation, the system SSD (NVMe device) may become unresponsive even while traffic continues to flow. When this happens, the operating system (NVOS) error logs will resemble the following output:

WARNING kernel: [2729682.680668] nvme nvme0: I/O 132 QID 3 timeout, aborting
WARNING kernel: [2729682.682007] nvme nvme0: Abort status: 0x0
WARNING kernel: [2729713.399278] nvme nvme0: I/O 132 QID 3 timeout, reset controller
WARNING kernel: [2729774.836562] nvme nvme0: I/O 21 QID 0 timeout, reset controller
ERR kernel: [2729795.339632] nvme nvme0: Device not ready; aborting reset, CSTS=0x1
WARNING kernel: [2729815.909129] nvme nvme0: Removing after probe failure status: -19
...
EXT4-fs error (device nvme0n1p3): ext4_get_inode_loc:4513: inode #531990: block 2097665: comm healthd: unable to read itable block
EXT4-fs error (device nvme0n1p3): ext4_get_inode_loc:4513: inode #534476: block 2097820: comm python3: unable to read itable block

Affected Components

Affected Part Number: VTPM24CEXI080-BM110006 (MEM000490)

Affected Firmware: CE00A450, CE00A400

Solution: Issue is fixed in the new SSD firmware version CE00A474

SSD Firmware Update Methods

NVOS supports two options for updating SSD firmware:

  1. Automatic installation of the SSD firmware as part of the NVOS upgrade—available from version 

    25.02.8008

    onward.

  2. Manual installation using the NVUE CLI commands — available from version

    25.02.8008

    onward.

Auto SSD Firmware Installation 

In NVOS systems, SSD (NVMe) firmware is automatically installed during the standard NVOS software upgrade process. No special user action is required. 
The updated SSD firmware becomes active after the system completes its routine power cycles/cold reboot, which occur naturally as part of the NVOS upgrade flow.

Verification

After the NVOS update and power cycle/cold reboot of the system, do the following:

  1. Log in to the system.

  2. Run the following command to check the SSD firmware version:

    admin@nvos:~$ nv show platform firmware SSD
    


    Expected output:

                     operational
    part-number      Virtium VTPM24CEXI080-BM110006
    actual-firmware  CE00A474
    fw-source        N/A
    
  • actual-firmware should show CE00A474 (indicating the new firmware).

If the version matches CE00A474, the update was successful.

Manual Installation Procedure on NVOS 

Prerequisites

  • SSH access to the system

Steps for Upgrade

  1. (Optional) Check the current firmware version to see if the firmware version is different than CE00A474:

    admin@nvos:~$ nv show platform firmware SSD
    

    Example output: 

                     operational 
    part-number      Virtium VTPM24CEXI080-BM110006
    actual-firmware  CE00A400
    fw-source        N/A
    


  2. Fetch and install SSD firmware package:

    admin@nvos:~$ nv action fetch platform firmware SSD scp://username[:password]@hostname/path/ssd_fw.pkg
    admin@nvos:~$ nv action install platform firmware SSD files ssd_fw.pkg
    


  • Upgrading or downgrading to SSD firmware version CE00A474 or later is supported on-the-fly; no power cycle is required or performed automatically.

  • Upgrading or downgrading to SSD firmware version CE00A400 or CE00A450 will automatically trigger a system power cycle after the firmware update.

Expected Behavior After Installation

  • The system will automatically perform a power cycle/cold reboot (if needed) after the firmware update. 

  • Once the system is back online, the SSD should operate normally.

  • No additional steps are required from the customer after the update completes.

Verification

After the firmware update and system power cycle/cold reboot (if needed):

  1. Log in to the system.

  2. Run the following command to check the SSD firmware version:

    admin@nvos:~$ nv show platform firmware SSD
    


    Expected output:

                     operational 
    part-number      Virtium VTPM24CEXI080-BM110006
    actual-firmware  CE00A474
    fw-source        N/A
    

actual-firmware should show CE00A474 (indicating the new firmware).

If the version matches CE00A474, the update was successful.

Last updated: