How to Collect InfiniBand Transceiver Temperature (Non-Root Method)

How to Collect InfiniBand Transceiver Temperature (Non-Root Method)

1. Purpose

This document explains how to collect InfiniBand (IB) transceiver (QSFP module) temperature from servers using the Linux sysfs hwmon interface without requiring root privileges.

This method applies to systems using Mellanox / NVIDIA mlx5-based adapters (e.g., ConnectX series).


2. Scope

  • HGX Nodes

  • Management Nodes

  • Any server with mlx5_* IB devices

Works when:

  • Root access is not available

  • Only temperature verification is required


3. Prerequisites

  • SSH access to the server

  • IB driver loaded (mlx5_core)

  • /sys/class/infiniband/ available

Verify IB devices:

ls /sys/class/infiniband/

Expected output:

mlx5_0 mlx5_1 mlx5_2 ...


4. Collect Transceiver Temperature (Single Device)

Example for mlx5_0:

cat /sys/class/infiniband/mlx5_0/device/hwmon/hwmon*/temp*_label
cat /sys/class/infiniband/mlx5_0/device/hwmon/hwmon*/temp*_input

Sample Output

asic
Module0
65000
62000

Interpretation

  • asic → HCA chip temperature

  • Module0 / Module1 → Transceiver temperature

  • Values are in millidegree Celsius

Convert:

62000 = 62°C


5. Collect From All IB Interfaces with Timestamp

echo "===== IB Temperature Snapshot ====="
date
for i in /sys/class/infiniband/mlx5_*; do
echo "===== $i ====="
paste \
<(cat $i/device/hwmon/hwmon*/temp*_label) \
<(cat $i/device/hwmon/hwmon*/temp*_input 2>/dev/null)
echo ""
done

To save to file:

(echo "===== IB Temperature Snapshot ====="
date
for i in /sys/class/infiniband/mlx5_*; do
echo "===== $i ====="
paste \
<(cat $i/device/hwmon/hwmon*/temp*_label) \
<(cat $i/device/hwmon/hwmon*/temp*_input 2>/dev/null)
echo ""
done
) > <hostname>_ib_temp_$(date +%F_%H-%M-%S).log


7. Temperature Threshold Reference

ComponentNormalWarningCritical
ASIC45–75°C>85°C>95°C
Transceiver30–70°C>75°C>80–85°C

If transceiver temperature exceeds 75°C:

  • Check airflow

  • Check fan speed

  • Verify BMC sensor status

  • Escalate if persistent

8. Validation

Always cross-check:

  • BMC sensor readings

  • System event logs

  • dmesg for PCIe or thermal warnings

    • Related Articles

    • How to update Mellanox ConnectX-7 NICs Firmware on OSS Servers

      1. Purpose This article describes the procedure to upgrade the Mellanox ConnectX-7 network adapter firmware on the affected OSS servers to version 28.45.1200 in order to ensure compatibility, stability, and optimal performance. 2. Scope This ...
    • Collect Logs from NVIDIA QM9700 InfiniBand Switch (Sysdump) - Web GUI

      Purpose This article describes the procedure to collect diagnostic logs (sysdump) from an NVIDIA QM9700 InfiniBand switch. The sysdump file is typically requested by NVIDIA Networking Support for troubleshooting fabric, port, firmware, or stability ...
    • How to use official ThinkParQ script to collect detailed BeeGFS Logs

      1. Purpose This document describes how to collect a full BeeGFS diagnostic bundle using the official ThinkParQ script. Applicable for environments running: BeeGFS This procedure is typically requested by: BeeGFS / ThinkParQ Support NetApp (when ...
    • How to Collect NVIDIA Bug Report

      Purpose This article provides step-by-step instructions to collect an NVIDIA bug report from servers equipped with NVIDIA GPUs. The NVIDIA bug report is commonly required by NVIDIA Support for troubleshooting GPU driver, CUDA, NVLink, PCIe, and ...
    • How to Collect Logs from NVIDIA UFM (UFM System Dump)

      Purpose This article explains how to collect diagnostic logs from NVIDIA Unified Fabric Manager (UFM) using the web-based GUI. The UFM system dump is typically required by NVIDIA Support for troubleshooting fabric health, host visibility, alerts, and ...