This article describes the procedure to upgrade the Mellanox ConnectX-7 network adapter firmware on the affected OSS servers to version 28.45.1200 in order to ensure compatibility, stability, and optimal performance.
This procedure applies to all OSS servers equipped with Mellanox ConnectX-7 network interface cards that currently run an earlier firmware version.
Firmware package: Obtain the firmware image file (e.g., fw-ConnectX7-rel-28_45_1200-MCX755106AS-xxx.bin) from NVIDIA/Mellanox official site or internal repository.
Backup current firmware and configuration.
Maintenance window approved.
Ensure server console or iDRAC/iLO access is available.
Network impact: Firmware update requires NIC reset; plan downtime.
Root/sudo privileges.
pcs statusIf it is normal the state would show online
You can check the current firmware version by using command:
ibstat
mstflint -d /dev/mst/mt4125_pciconf0 query > /root/mellanox_fw_backup.txtMake sure the firmware file is accessible, e.g.:
ls /root/fw-ConnectX7-rel-28_45_1200-MCX755106AS-xxx.binFor the node targeted for upgrade, Place the node into standby mode to drain all BeeGFS services:
pcs node standby <HOSTNAME>Eg: pcs node standby T10PHGXSTOSSER01
Note: This might take a few minutes
Verify that the node’s Services have drained by running:
pcs statusAfter the status is on standby move towards the next step
We can update the firmware using 'mlxfwmanager' utility
mlxgwmanager -i </path/tp/firmware_file>.bin -uEg: mlxgwmanager -i /root/fw-ConnectX7-rel-28_45_1200-SN37B06010_SN37B06011_AX-UEFI-14.38.16-FlexBoot-3.7.500.signed.bin -u
Before rebooting you have to disable and stop force_ib_speed by running commands:
systemctl disable force_ib_speed
systemctl stop forcr_ib_speed rebootNote: This might take some time (Approximate 10 to 15 minutes)
This is to verify that the storage connections automatically reconnected
ststemctl status eseries_nvme_ib
nvme list-subsysNote: If you do not see the number of expected connections for your cluster, restart the 'eseries_nvme_ib' service and wait for all connections to be established.
To bring the node out of standby state you can the command:
pcs node unstandby <HOSTNAME>Check cluster status
pcs status ib_statIf it does not work, try after starting the pacemaker:
systemctl status pacemaker
systemctl start pacemaker (if not started)Again, try step 10
To relocate all the beegfs serices back to their preffered node, run:
pcs resource relocate runThen check cluster status: pcs status
Repeat the above steps for each node until all nodes in the cluster are updated