This document explains how to collect InfiniBand (IB) transceiver (QSFP module) temperature from servers using the Linux sysfs hwmon interface without requiring root privileges.
This method applies to systems using Mellanox / NVIDIA mlx5-based adapters (e.g., ConnectX series).
HGX Nodes
Management Nodes
Any server with mlx5_* IB devices
Works when:
Root access is not available
Only temperature verification is required
SSH access to the server
IB driver loaded (mlx5_core)
/sys/class/infiniband/ available
Verify IB devices:
ls /sys/class/infiniband/
Expected output:
mlx5_0 mlx5_1 mlx5_2 ...
Example for mlx5_0:
cat /sys/class/infiniband/mlx5_0/device/hwmon/hwmon*/temp*_label
cat /sys/class/infiniband/mlx5_0/device/hwmon/hwmon*/temp*_input
asic
Module0
65000
62000
asic → HCA chip temperature
Module0 / Module1 → Transceiver temperature
Values are in millidegree Celsius
Convert:
62000 = 62°C
echo "===== IB Temperature Snapshot ====="
date
for i in /sys/class/infiniband/mlx5_*; do
echo "===== $i ====="
paste \
<(cat $i/device/hwmon/hwmon*/temp*_label) \
<(cat $i/device/hwmon/hwmon*/temp*_input 2>/dev/null)
echo ""
done
To save to file:
(echo "===== IB Temperature Snapshot ====="
date
for i in /sys/class/infiniband/mlx5_*; do
echo "===== $i ====="
paste \
<(cat $i/device/hwmon/hwmon*/temp*_label) \
<(cat $i/device/hwmon/hwmon*/temp*_input 2>/dev/null)
echo ""
done
) > <hostname>_ib_temp_$(date +%F_%H-%M-%S).log
| Component | Normal | Warning | Critical |
|---|---|---|---|
| ASIC | 45–75°C | >85°C | >95°C |
| Transceiver | 30–70°C | >75°C | >80–85°C |
If transceiver temperature exceeds 75°C:
Check airflow
Check fan speed
Verify BMC sensor status
Escalate if persistent
Always cross-check:
BMC sensor readings
System event logs
dmesg for PCIe or thermal warnings