GPU vBIOS Verification Across HGX Nodes

GPU vBIOS Verification Across HGX Nodes

1. Purpose

This document provides a procedure to verify GPU vBIOS versions across all HGX nodes in the cluster using an automated script.

The script connects to each node from the login node using passwordless SSH and retrieves the GPU vBIOS version using nvidia-smi.
It then compares the retrieved version with the expected target vBIOS and categorizes nodes accordingly.

This helps identify nodes that:

  • Have already been updated

  • Still have the older vBIOS

  • Are unreachable


2. Scope

This procedure applies to:

  • HGX cluster nodes

  • Nodes accessible via passwordless SSH

  • Systems running NVIDIA GPUs with nvidia-smi available


3. Environment Details



Parameter
Value
 Cluster Type
 HGX
 Total Nodes
 55
 Node IP Range
 10.152.241.101 – 10.152.241.155
 Access Node
 Login Node
 Access Method
 Password less SSH

4. vBIOS Version Reference



Description
Version
Current vBIOS
 96.00.A5.00.01
Target vBIOS
 96.00.D0.00.02

5. Prerequisites

Before executing the script ensure the following:

  1. Passwordless SSH access is configured

    ssh <node_ip>
  2. NVIDIA drivers are installed

    nvidia-smi
  3. The command below works on the nodes:

      nvidia-smi -q | grep -i "VBIOS Version"

6. Script for Cluster-wide vBIOS Verification

Create a script named: check_vbios_versions.sh

check_vbios_version.sh
#!/bin/bash

START=101
END=155
BASE_IP="10.152.241"

CURRENT="96.00.A5.00.01"
TARGET="96.00.D0.00.02"

printf "%-18s %-20s %-15s\n" "NODE IP" "VBIOS VERSION" "STATUS"
printf "%-18s %-20s %-15s\n" "-------" "-------------" "------"

for i in $(seq $START $END)
do
    NODE="$BASE_IP.$i"

    VBIOS=$(ssh -o ConnectTimeout=3 $NODE "nvidia-smi -q | grep -i 'VBIOS Version' | head -n1 | awk -F': ' '{print \$2}'" 2>/dev/null)

    if [ -z "$VBIOS" ]; then
        STATUS="UNREACHABLE"
        VBIOS="N/A"

    elif [ "$VBIOS" == "$TARGET" ]; then
        STATUS="UPDATED"

    elif [ "$VBIOS" == "$CURRENT" ]; then
        STATUS="OLD"

    else
        STATUS="UNKNOWN"
    fi

    printf "%-18s %-20s %-15s\n" "$NODE" "$VBIOS" "$STATUS"

done

7. Execution Steps

Step 1: Create the Script

vi check_vbios_versions.sh

Paste the script content and save the file.


Step 2: Grant Execute Permission

chmod +x check_vbios_versions.sh

Step 3: Execute the Script

./check_vbios_versions.sh

8. Sample Output

NODE IP
VBIOS VERSION
STATUS
10.152.241.101
96.00.A5.00.01
OLD
10.152.241.102
96.00.D0.00.02
UPDATED
10.152.241.103
96.00.A5.00.01
OLD
10.152.241.104
96.00.D0.00.02
UPDATED
10.152.241.105
96.00.BC.00.04
UNKNOWN


9. Status Definitions

StatusDescription
UPDATEDNode has the target vBIOS version
OLDNode is running the previous vBIOS
UNKNOWNDetected vBIOS does not match expected versions

10. Validation

To manually verify a node:

ssh <node_ip>

Then run:

nvidia-smi -q | grep -i vbios

Expected output example:

VBIOS Version: 96.00.D0.00.02

    • Related Articles

    • How to do a remote power cycle on NVIDIA QM9700 Switch?

      1. Purpose To perform a remote reboot of NVIDIA QM9700 switch using the NVIDIA's Web GUI. If the remote reboot does not resolve any issues occurred, a physical power-cycle should be carried out onsite as per OEM recommendations. 2. Scope This MOP ...
    • Configure Date & Time on ASUS HGX Servers via ASMB11-iKVM (BMC)

      Purpose This article explains how to configure and synchronize the Date & Time on ASUS HGX servers using the ASMB11‑iKVM (BMC) interface. This ensures that all HGX servers synchronize their time with the NTP servers configured on Head Node 1 and Head ...
    • How to Collect NVIDIA Bug Report

      Purpose This article provides step-by-step instructions to collect an NVIDIA bug report from servers equipped with NVIDIA GPUs. The NVIDIA bug report is commonly required by NVIDIA Support for troubleshooting GPU driver, CUDA, NVLink, PCIe, and ...
    • Run a GPU stress test for RTX 6000 Ada and L40S on Ubuntu Server

      You can stress‑test RTX 6000 Ada and L40S GPUs on Ubuntu Server using GPU Burn, a CUDA‑based stress tool commonly used for datacenter validation. The process is the same for both GPUs because they use NVIDIA’s CUDA stack. Core Steps to Run GPU Burn ...
    • How to Collect Logs from NVIDIA Cumulus Linux Switch

      Purpose This article describes how to collect diagnostic logs from a switch running NVIDIA Cumulus Linux. These logs are typically required by NVIDIA Networking Support for troubleshooting switch-level issues such as port flaps, routing problems, ...