How to update Mellanox ConnectX-7 NICs Firmware on OSS Servers

How to update Mellanox ConnectX-7 NICs Firmware on OSS Servers

1. Purpose

This article describes the procedure to upgrade the Mellanox ConnectX-7 network adapter firmware on the affected OSS servers to version 28.45.1200 in order to ensure compatibility, stability, and optimal performance.

2. Scope

This procedure applies to all OSS servers equipped with Mellanox ConnectX-7 network interface cards that currently run an earlier firmware version.

3. Prerequisites

  1. Firmware package: Obtain the firmware image file (e.g., fw-ConnectX7-rel-28_45_1200-MCX755106AS-xxx.bin) from NVIDIA/Mellanox official site or internal repository.

  2. Backup current firmware and configuration.

  3. Maintenance window approved.

  4. Ensure server console or iDRAC/iLO access is available.

  5. Network impact: Firmware update requires NIC reset; plan downtime.

  6. Root/sudo privileges.

4. Procedure

Step 1: Verify Cluster State

pcs status

If it is normal the state would show online

You can check the current firmware version by using command:

ibstat


Step 2: Back Up Existing Firmware

mstflint -d /dev/mst/mt4125_pciconf0 query > /root/mellanox_fw_backup.txt


Step 3: Load the Mellanox Firmware Update Tool

Make sure the firmware file is accessible, e.g.:

ls /root/fw-ConnectX7-rel-28_45_1200-MCX755106AS-xxx.bin


Step 4: Change Cluster status to Standby

For the node targeted for upgrade, Place the node into standby mode to drain all BeeGFS services:

pcs node standby <HOSTNAME>

Eg: pcs node standby T10PHGXSTOSSER01

Note: This might take a few minutes


Step 5: Verify Status

Verify that the node’s Services have drained by running:

pcs status

After the status is on standby move towards the next step


Step 6: Updating Firmware

We can update the firmware using 'mlxfwmanager' utility

mlxgwmanager -i </path/tp/firmware_file>.bin -u

Eg: mlxgwmanager -i /root/fw-ConnectX7-rel-28_45_1200-SN37B06010_SN37B06011_AX-UEFI-14.38.16-FlexBoot-3.7.500.signed.bin -u


Step 7: Disable force_ib_speed

Before rebooting you have to disable and stop force_ib_speed by running commands:

systemctl disable force_ib_speed systemctl stop forcr_ib_speed


Step 8: Reboot the Server

reboot

Note: This might take some time (Approximate 10 to 15 minutes)


Step 9: Verify Storage Connections

This is to verify that the storage connections automatically reconnected

ststemctl status eseries_nvme_ib nvme list-subsys

Note: If you do not see the number of expected connections for your cluster, restart the 'eseries_nvme_ib' service and wait for all connections to be established.


Step 10: Make node Unstandby

To bring the node out of standby state you can the command:

pcs node unstandby <HOSTNAME>

Check cluster status

pcs status


Step 11: Verify FW Version

ib_stat

If it does not work, try after starting the pacemaker:

systemctl status pacemaker systemctl start pacemaker (if not started)

Again, try step 10


Step 12: Resource relocation

To relocate all the beegfs serices back to their preffered node, run:

pcs resource relocate run

Then check cluster status: pcs status

Repeat the above steps for each node until all nodes in the cluster are updated


    • Related Articles

    • SOS Report collection from NetApp OSS Servers

      Purpose This article details the process of generating and collecting SOS Reports from NetApp OSS Servers. These reports are often required by the NetApp Support Team for detailed analysis and troubleshooting. Scope Applicable to: NetApp OSS Servers ...
    • Configure Date & Time on ASUS HGX Servers via ASMB11-iKVM (BMC)

      Purpose This article explains how to configure and synchronize the Date & Time on ASUS HGX servers using the ASMB11‑iKVM (BMC) interface. This ensures that all HGX servers synchronize their time with the NTP servers configured on Head Node 1 and Head ...
    • How to collect diagnostic logs using the NetApp Log Collection Script

      1. Purpose This document describes the procedure to collect diagnostic logs using the NetApp Log Collection Script in environments running: BeeGFS NetApp E-Series backend storage HA cluster using Pacemaker and Corosync This script is typically ...
    • How to use official ThinkParQ script to collect detailed BeeGFS Logs

      1. Purpose This document describes how to collect a full BeeGFS diagnostic bundle using the official ThinkParQ script. Applicable for environments running: BeeGFS This procedure is typically requested by: BeeGFS / ThinkParQ Support NetApp (when ...
    • NVMe Devices Not Detected During Early Boot with Existing BIOS

      Overview This KB document addresses an issue where some NVMe devices are not detected during system startup, causing the operating system to fail to recognize all installed NVMe drives. The issue is related to limitations in PCIe device enumeration ...