LLMs on Strix Halo: Three Days Chasing the MES Firmware 0x83 Bug

Progress 16 / 18
Table of Contents

After getting ComfyUI running in the previous post, the next step was to run actual LLM inference on the k3s cluster. With 96GB of unified memory available, even decently large models should fit without trouble.

I spun up a lemonade Pod and sent an inference request.

Memory access fault by GPU node-1 (Agent 1, Process 5, Thread 5) on address 0x7f0000000000. Reason: Page not present
Aborted (core dumped)

“Something’s blowing up on a memory access” — that much was clear. The cause wasn’t. What followed was a roundabout path to getting it fixed.

First Suspect: ROCm Version

Searching the error led me to a known bug in ROCm 7.2.1 specific to gfx1151 (Strix Halo).

[gfx1151] Page Fault on hipMemcpy() in ROCm 7.2.1 - Even official samples fail · Issue #6146 · ROCm/ROCm

[gfx1151] Page Fault on hipMemcpy() in ROCm 7.2.1 Hardware Configuration GPU: AMD Radeon 8060S (0x1586, gfx1151) Architecture: RDNA 3 (Strix Halo, iGPU) CPU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S S...

github.com

My environment was on 7.2.1, so I upgraded to 7.2.3.

Terminal window
apt-mark showhold # confirm amdgpu-dkms is held
sudo apt update && sudo apt install --only-upgrade rocm -y
dpkg -l rocm-core # verify 7.2.3

Rebooted and tried again.

Same error.

Not the ROCm version.

Checking MES Firmware

AMD GPUs include a component called MES (Micro Engine Scheduler) with its own firmware. While browsing Reddit, I came across a thread pointing to a reverted commit as the likely culprit:

Revert "amdgpu: update GC 11.5.0 firmware" (3d5c8135) · Commits · kernel-firmware / Linux Firmware · GitLab

The 0x83 MES SCH firmware causes problems with ROCm on GC 11.5.0. This reverts commit 1c5716794ac6bb25c20852f7cbb2d56aae43f301. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4751 Signed-off-by: Mario Limonciello (AMD)

gitlab.com

I checked my MES firmware version to see if this was the issue.

Terminal window
sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep MES
MES ... version: 0x83

0x83. That was the culprit.

The amdgpu-dkms-firmware package unpacks a file called gc_11_5_1_mes_2.bin — gc_11_5_1 being gfx1151, i.e., Strix Halo — and that version contains a bug. I’d briefly convinced myself I’d broken the hardware somehow. Turned out to be a pure software bug, which was a relief.

Replacing the Firmware

The fixed firmware is available from the linux-firmware git repo. Commit a54ce0ff has the 0x5d version of the file.

One note about that Reddit commit: it fixes gc_11_5_0_mes_2.bin (GC 11.5.0), not gc_11_5_1_mes_2.bin (GC 11.5.1 / gfx1151). If you’re on Strix Halo, you need a54ce0ff.

Terminal window
cd ~
git clone --filter=blob:none --no-checkout \
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git lf-git
cd lf-git
git checkout a54ce0ff -- amdgpu/gc_11_5_1_mes_2.bin

Next, find the exact install path:

Terminal window
dpkg -L amdgpu-dkms-firmware | grep gc_11_5_1_mes_2
/lib/firmware/updates/amdgpu/gc_11_5_1_mes_2.bin

That’s /lib/firmware/updates/amdgpu/, not /lib/firmware/amdgpu/. The kernel’s firmware loader gives updates/ higher priority, so replacing the file under /lib/firmware/amdgpu/ does nothing. I wasted time on this before running dpkg -L and realizing where the package actually installs.

Verification

After replacing the file and rebooting:

Terminal window
sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep MES
MES(GFX) version: 0x5d
MES(COMP) version: 0x5d

0x5d. Redeployed the lemonade Pod, sent an inference request — and it worked. Three days of debugging, gone in an instant.

Summary

If you’re hitting Memory access faults with LLM inference on gfx1151 (Strix Halo), here’s a checklist:

  1. Check your ROCm version — 7.2.1 has a known gfx1151 bug; upgrade to 7.2.3
  2. Check MES firmware version: sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep MES
  3. If you see 0x83, amdgpu-dkms-firmware is deploying a broken firmware file
  4. The install path is /lib/firmware/updates/amdgpu/ — confirm with dpkg -L amdgpu-dkms-firmware. Not /lib/firmware/amdgpu/
  5. Get gc_11_5_1_mes_2.bin from linux-firmware commit a54ce0ff (not 3d5c8135, which only covers gc_11_5_0)

It’s also worth trying a plain apt upgrade first — the package may have been fixed upstream by now.

One more thing: after a reboot I ran into the GPU staying busy, which led me to switch from running k3s inside Incus to running it on bare metal. That’s a story for another post.


References

[gfx1151] Page Fault on hipMemcpy() in ROCm 7.2.1 - Even official samples fail · Issue #6146 · ROCm/ROCm

[gfx1151] Page Fault on hipMemcpy() in ROCm 7.2.1 Hardware Configuration GPU: AMD Radeon 8060S (0x1586, gfx1151) Architecture: RDNA 3 (Strix Halo, iGPU) CPU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S S...

github.com
Revert "amdgpu: update GC 11.5.0 firmware" (3d5c8135) · Commits · kernel-firmware / Linux Firmware · GitLab

The 0x83 MES SCH firmware causes problems with ROCm on GC 11.5.0. This reverts commit 1c5716794ac6bb25c20852f7cbb2d56aae43f301. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4751 Signed-off-by: Mario Limonciello (AMD)

gitlab.com