Using dmesg and Kernel Module Checks to Troubleshoot NVIDIA GPU Issues

Using dmesg and Kernel Module Checks to Troubleshoot NVIDIA GPU Issues

 Overview

This article outlines how to use dmesg logs and kernel module commands to diagnose issues where the operating system fails to detect NVIDIA GPUs—even though they appear in the system's BMC (Baseboard Management Controller).
When to Use This
You may need this guide if:
  • nvidia-smi reports: NVRM: No NVIDIA GPU found
  • GPUs are visible in BMC or IPMI but not in the OS

  • You're seeing driver or hardware initialization errors during boot

Step 1: Check Kernel Ring Buffer (dmesg) for Errors

The dmesg command shows messages from the kernel that are helpful for low-level diagnostics.
  1. sudo dmesg | grep -iE "nvidia|gpu|pci|acpi|firmware" | grep -iE "fail|error|not found|disable"
What to Look For:
  • module verification failed — Indicates Secure Boot is blocking the NVIDIA kernel module.
  • No NVIDIA GPU found — The NVIDIA driver was loaded but no GPUs were detected on the PCI bus.

  • firmware: failed to load — A required firmware blob was missing.

  • ACPI: device not enabled — BIOS/UEFI settings may have disabled PCIe devices.

Step 2: Check if the NVIDIA Kernel Module is Loaded

Linux uses kernel modules (drivers) to interface with hardware like NVIDIA GPUs.

  1. lsmod | grep nvidia

If you get output like:

  1. nvidia_uvm 860160 0
  2. nvidia_drm 57344 2
  3. nvidia_modeset 1114112 4 nvidia_drm nvidia
  4. nvidia 23474176 260 nvidia_uvm,nvidia_modeset

→ The driver is loaded and likely functioning.

If there is no output:

The module didn’t load. Check dmesg again or try loading it manually:

  1. sudo modprobe nvidia

Step 3: Check for Secure Boot Blocking the Driver

Secure Boot can prevent unsigned kernel modules like NVIDIA’s from loading.
  1. mokutil --sb
  • SecureBoot enabled → You'll see SecureBoot enabled

  • Disable it from the BIOS or sign the NVIDIA module with MOK (Machine Owner Key).

Step 4: Check PCI Bus for NVIDIA Devices

This helps confirm whether the GPU is physically present and detected at the PCIe level.
  1. lspci | grep -i nvidia

If no output appears, the GPU is not even seen by the PCIe bus. Potential causes:

  • Hardware issue

  • GPU seated improperly

  • PCIe slot/bifurcation misconfigured in BIOS

  • Faulty riser or cable

Check if nvidia kernel module is available:
  1. modinfo nvidia

If modinfo says "module not found", the driver installation might be corrupted or incomplete.

    • Related Articles

    • Troubleshooting steps if GPU is not detecting

      Steps to check if GPU is not detecting on the system Parameters for demonstrating the Troubleshooting System Model : ASRock TRX40 Creator AMD sTRX4 Socket ATX GPU : PNY A6000 x 4 Verify the GPUs on the system using below steps Execute the below ...
    • Installing NVIDIA Drivers on Ubuntu

      Ubuntu doesn't come with NVIDIA drivers installed by default as they are proprietary and closed-source. This means that you'll have issues like poor performance, displays not working, artifacts etc... while using the FOSS nouveau drivers. Graphical ...
    • Type of Kernel

      What is Kernel and it's Types What do you mean by kernel? Kernel is the core part of OS also known as heart of an OS. Have full control over everything in the system, each operation/task/program of hardware and software is managed and administrated ...
    • Enable AI Eye contact and Noise cancelling using NVIDIA Broadcast

      What is NVIDIA Broadcast? NVIDIA Broadcast allows you to enhance your voice and video for calls and live streaming with the power of AI. System Requirements You need to have an RTX Powered NVIDIA GPU and Windows 10/11 for NVIDIA Broadcast to work. ...
    • Fixing "Cable Data Invalid EEPROM" Error on NVIDIA QM9700 InfiniBand Switch

      Issue On NVIDIA QM9700 InfiniBand switches, some ports may appear down and show an error such as: This issue is often caused by outdated CPLD firmware and can be resolved by updating the CPLD version on the switch. Root Cause The EEPROM error is ...