After getting ComfyUI running in the previous post, the next step was to run actual LLM inference on the k3s cluster. With 96GB of unified memory available, even decently large models should fit without trouble.
I spun up a lemonade Pod and sent an inference request.
Memory access fault by GPU node-1 (Agent 1, Process 5, Thread 5) on address 0x7f0000000000. Reason: Page not presentAborted (core dumped)“Something’s blowing up on a memory access” — that much was clear. The cause wasn’t. What followed was a roundabout path to getting it fixed.
First Suspect: ROCm Version
Searching the error led me to a known bug in ROCm 7.2.1 specific to gfx1151 (Strix Halo).
[gfx1151] Page Fault on hipMemcpy() in ROCm 7.2.1 Hardware Configuration GPU: AMD Radeon 8060S (0x1586, gfx1151) Architecture: RDNA 3 (Strix Halo, iGPU) CPU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S S...
My environment was on 7.2.1, so I upgraded to 7.2.3.
apt-mark showhold # confirm amdgpu-dkms is heldsudo apt update && sudo apt install --only-upgrade rocm -ydpkg -l rocm-core # verify 7.2.3Rebooted and tried again.
Same error.
Not the ROCm version.
Checking MES Firmware
AMD GPUs include a component called MES (Micro Engine Scheduler) with its own firmware. While browsing Reddit, I came across a thread pointing to a reverted commit as the likely culprit:
The 0x83 MES SCH firmware causes problems with ROCm on GC 11.5.0. This reverts commit 1c5716794ac6bb25c20852f7cbb2d56aae43f301. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4751 Signed-off-by: Mario Limonciello (AMD)
I checked my MES firmware version to see if this was the issue.
sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep MESMES ... version: 0x830x83. That was the culprit.
The amdgpu-dkms-firmware package unpacks a file called gc_11_5_1_mes_2.bin — gc_11_5_1 being gfx1151, i.e., Strix Halo — and that version contains a bug. I’d briefly convinced myself I’d broken the hardware somehow. Turned out to be a pure software bug, which was a relief.
Replacing the Firmware
The fixed firmware is available from the linux-firmware git repo. Commit a54ce0ff has the 0x5d version of the file.
One note about that Reddit commit: it fixes gc_11_5_0_mes_2.bin (GC 11.5.0), not gc_11_5_1_mes_2.bin (GC 11.5.1 / gfx1151). If you’re on Strix Halo, you need a54ce0ff.
cd ~git clone --filter=blob:none --no-checkout \ https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git lf-gitcd lf-gitgit checkout a54ce0ff -- amdgpu/gc_11_5_1_mes_2.binNext, find the exact install path:
dpkg -L amdgpu-dkms-firmware | grep gc_11_5_1_mes_2/lib/firmware/updates/amdgpu/gc_11_5_1_mes_2.binThat’s /lib/firmware/updates/amdgpu/, not /lib/firmware/amdgpu/. The kernel’s firmware loader gives updates/ higher priority, so replacing the file under /lib/firmware/amdgpu/ does nothing. I wasted time on this before running dpkg -L and realizing where the package actually installs.
Verification
After replacing the file and rebooting:
sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep MESMES(GFX) version: 0x5dMES(COMP) version: 0x5d0x5d. Redeployed the lemonade Pod, sent an inference request — and it worked. Three days of debugging, gone in an instant.
Summary
If you’re hitting Memory access faults with LLM inference on gfx1151 (Strix Halo), here’s a checklist:
- Check your ROCm version — 7.2.1 has a known gfx1151 bug; upgrade to 7.2.3
- Check MES firmware version:
sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info | grep MES - If you see
0x83,amdgpu-dkms-firmwareis deploying a broken firmware file - The install path is
/lib/firmware/updates/amdgpu/— confirm withdpkg -L amdgpu-dkms-firmware. Not/lib/firmware/amdgpu/ - Get
gc_11_5_1_mes_2.binfrom linux-firmware commita54ce0ff(not3d5c8135, which only covers gc_11_5_0)
It’s also worth trying a plain apt upgrade first — the package may have been fixed upstream by now.
One more thing: after a reboot I ran into the GPU staying busy, which led me to switch from running k3s inside Incus to running it on bare metal. That’s a story for another post.
References
[gfx1151] Page Fault on hipMemcpy() in ROCm 7.2.1 Hardware Configuration GPU: AMD Radeon 8060S (0x1586, gfx1151) Architecture: RDNA 3 (Strix Halo, iGPU) CPU: AMD Ryzen AI MAX+ 395 w/ Radeon 8060S S...
The 0x83 MES SCH firmware causes problems with ROCm on GC 11.5.0. This reverts commit 1c5716794ac6bb25c20852f7cbb2d56aae43f301. Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4751 Signed-off-by: Mario Limonciello (AMD)









