Troubleshooting steps if GPU is not detecting

Troubleshooting steps if GPU is not detecting

Steps to check if GPU is not detecting on the system

Parameters for demonstrating the Troubleshooting
  1. System Model : ASRock TRX40 Creator AMD sTRX4 Socket ATX
  2. GPU : PNY A6000 x 4

Verify the GPUs on the system using below steps

  1. Execute the below command to show the GPUs list.
  1. #nvisia-smi
  1. Result will show the all installed GPUs, If you unable to see the all GPUs generate the event logs.
  1. #nvidia-bug-report.sh
  1. This Script will generate nvidia-bug-report.log.gz file, will store in current working directory.
  2. You can share this file with the vendor for analyzing the issue.

Level 1 Troubleshooting steps

  1. Open the workstation/PC and locate the PCIe slots.
  2. Remove the all GPUs from the system
  3. clean the dust from the system and PCIe slots.
  4. check GPU display one by one to determine which GPU have the issue.
  5. If you find the GPU which have the issue then proceed with RMA process, if not proceed with Level 2 Troubleshooting

Level 2 Troubleshooting steps

  1. Step 1. Verify the GPU Driver : check the mismatches between different components of the current display driver (installed). Therefore properly un-install the current driver, as well as any remaining tracks of any former driver; then the install the latest version driver, and reboot the system.
  2. Step 2. In case of no changes, the issue should be hardware related. The missing VGA device would be located on the PCIe slots the farthest from the CPU - i.e., # 3 or 4 on the photo below, assuming the system BIOS attributes resources the usual way.

                        The configuration may have NVLink bridges connecting RTX A6000s by pair, therefore a faulty unit may impact the functionality of the latter devices. Since there would be two units installed, we believe the second one should be removed and then the system checked  →  is the fourth VGA device brought back to life then? If not, the same procedure should be achieved with regards to the first NVLink bridge.
  1. Step 3. Finally, the source of the issue would be one of the GPU cards, of which functionality has gone off. Unless the PCIe slot itself would be the culprit?
    1. The VGA device from PCIe slot # 3 (PCI:75:0:0) should be removed from the system - it could be the one from PCIe slot # 4 though.
  1. In the first case, it should mean the removed unit would be defective; otherwise, the motherboard may suffer from an issue through that PCIe slot (and would need further investigation).
  2. Obviously, the issue can be double-checked using a second workstation on which the “seemingly faulty”  RTX A6000 can be installed alone.
  3. Should all the above leads to concluding there is a defective GPU card (or NVLink bridge), the usual next step should be to proceed with an RMA procedure.