How to update Mellanox ConnectX-7 NICs Firmware on OSS Servers

How to update Mellanox ConnectX-7 NICs Firmware on OSS Servers

1. Purpose

This article describes the procedure to upgrade the Mellanox ConnectX-7 network adapter firmware on the affected OSS servers to version 28.45.1200 in order to ensure compatibility, stability, and optimal performance.

2. Scope

This procedure applies to all OSS servers equipped with Mellanox ConnectX-7 network interface cards that currently run an earlier firmware version.

3. Prerequisites

  1. Firmware package: Obtain the firmware image file (e.g., fw-ConnectX7-rel-28_45_1200-MCX755106AS-xxx.bin) from NVIDIA/Mellanox official site or internal repository.

  2. Backup current firmware and configuration.

  3. Maintenance window approved.

  4. Ensure server console or iDRAC/iLO access is available.

  5. Network impact: Firmware update requires NIC reset; plan downtime.

  6. Root/sudo privileges.

4. Procedure

Step 1: Verify Cluster State

pcs status

If it is normal the state would show online

You can check the current firmware version by using command:

ibstat


Step 2: Back Up Existing Firmware

mstflint -d /dev/mst/mt4125_pciconf0 query > /root/mellanox_fw_backup.txt


Step 3: Load the Mellanox Firmware Update Tool

Make sure the firmware file is accessible, e.g.:

ls /root/fw-ConnectX7-rel-28_45_1200-MCX755106AS-xxx.bin


Step 4: Change Cluster status to Standby

For the node targeted for upgrade, Place the node into standby mode to drain all BeeGFS services:

pcs node standby <HOSTNAME>

Eg: pcs node standby T10PHGXSTOSSER01

Note: This might take a few minutes


Step 5: Verify Status

Verify that the node’s Services have drained by running:

pcs status

After the status is on standby move towards the next step


Step 6: Updating Firmware

We can update the firmware using 'mlxfwmanager' utility

mlxgwmanager -i </path/tp/firmware_file>.bin -u

Eg: mlxgwmanager -i /root/fw-ConnectX7-rel-28_45_1200-SN37B06010_SN37B06011_AX-UEFI-14.38.16-FlexBoot-3.7.500.signed.bin -u


Step 7: Disable force_ib_speed

Before rebooting you have to disable and stop force_ib_speed by running commands:

systemctl disable force_ib_speed systemctl stop forcr_ib_speed


Step 8: Reboot the Server

reboot

Note: This might take some time (Approximate 10 to 15 minutes)


Step 9: Verify Storage Connections

This is to verify that the storage connections automatically reconnected

ststemctl status eseries_nvme_ib nvme list-subsys

Note: If you do not see the number of expected connections for your cluster, restart the 'eseries_nvme_ib' service and wait for all connections to be established.


Step 10: Make node Unstandby

To bring the node out of standby state you can the command:

pcs node unstandby <HOSTNAME>

Check cluster status

pcs status


Step 11: Verify FW Version

ib_stat

If it does not work, try after starting the pacemaker:

systemctl status pacemaker systemctl start pacemaker (if not started)

Again, try step 10


Step 12: Resource relocation

To relocate all the beegfs serices back to their preffered node, run:

pcs resource relocate run

Then check cluster status: pcs status

Repeat the above steps for each node until all nodes in the cluster are updated


    • Related Articles

    • SOS Report collection from NetApp OSS Servers

      Purpose This article details the process of generating and collecting SOS Reports from NetApp OSS Servers. These reports are often required by the NetApp Support Team for detailed analysis and troubleshooting. Scope Applicable to: NetApp OSS Servers ...
    • How to Collect NVIDIA Bug Report

      Purpose This article provides step-by-step instructions to collect an NVIDIA bug report from servers equipped with NVIDIA GPUs. The NVIDIA bug report is commonly required by NVIDIA Support for troubleshooting GPU driver, CUDA, NVLink, PCIe, and ...
    • Collect Logs from NVIDIA QM9700 InfiniBand Switch (Sysdump) - Web GUI

      Purpose This article describes the procedure to collect diagnostic logs (sysdump) from an NVIDIA QM9700 InfiniBand switch. The sysdump file is typically requested by NVIDIA Networking Support for troubleshooting fabric, port, firmware, or stability ...