Troubleshooting steps if GPU is not detecting
Steps to check if GPU is not detecting on the system
Parameters for demonstrating the Troubleshooting
- System Model : ASRock
TRX40 Creator AMD sTRX4 Socket ATX
- GPU : PNY
A6000 x 4
Verify the GPUs on the system using below steps
- Execute the below command to show the GPUs list.
- #nvisia-smi
- Result will show the all installed GPUs, If you unable to see the all GPUs generate the event logs.
- #nvidia-bug-report.sh
- This Script will generate nvidia-bug-report.log.gz file, will store in current working directory.
- You can share this file with the vendor for analyzing the issue.
Level 1 Troubleshooting steps
- Open the workstation/PC and locate the PCIe slots.
- Remove the all GPUs from the system
- clean the dust from the system and PCIe slots.
- check GPU display one by one to determine which GPU have the issue.
- If you find the GPU which have the issue then proceed with RMA process, if not proceed with Level 2 Troubleshooting
Level 2 Troubleshooting steps
- Step 1. Verify the GPU Driver : check the mismatches between different components of the current display driver (installed). Therefore properly un-install the current driver, as well as any remaining tracks of any former driver; then the install the latest version driver, and reboot the system.
- Step 2. In case of no changes, the issue should be hardware related. The missing VGA device would be located on the PCIe slots the farthest from the CPU - i.e., # 3 or 4 on the photo below, assuming the system BIOS attributes resources the usual way.
The configuration may have NVLink bridges connecting RTX A6000s by pair, therefore a faulty unit may impact the functionality of the latter devices. Since there would be two units installed, we believe the second one should be removed and then the system checked → is the fourth VGA device brought back to life then? If not, the same procedure should be achieved with regards to the first NVLink bridge.
- Step 3. Finally, the source of the issue would be one of the GPU cards, of which functionality has gone off. Unless the PCIe slot itself would be the culprit?
- The VGA device from PCIe slot # 3 (PCI:75:0:0) should be removed from the system - it could be the one from PCIe slot # 4 though.
- In the first case, it should mean the removed unit would be defective; otherwise, the motherboard may suffer from an issue through that PCIe slot (and would need further investigation).
- Obviously, the issue can be double-checked using a second workstation on which the “seemingly faulty” RTX A6000 can be installed alone.
- Should all the above leads to concluding there is a defective GPU card (or NVLink bridge), the usual next step should be to proceed with an RMA procedure.