I Tried GPU Passthrough in an Incus VM and Ended Up on Bare Metal

In the previous post, I fixed the MES firmware bug and got lemonade running. But after using it for a while, a different problem surfaced. That’s what this post is about.

What Happened

lemonade crashed mid-inference. Processes dying from OOM or during model loading is pretty common with inference servers, so that part wasn’t surprising. What happened next was.

The Pod restarted, but the GPU was unusable. The VM’s dmesg showed it stalling with this error:

[   40.4] amdgpu: PSP tmr init failed!

Restarting the VM didn’t help. Rebooting the host didn’t help. Powering off and back on (short of a full discharge) didn’t help either.

That last one was unexpected. I knew a software reset wouldn’t clear firmware state, but I didn’t expect an OS-level power cycle to leave the GPU in the same dirty state.

Why Recovery Isn’t Possible

Strix Halo is an APU, so the GPU-side PSP (Platform Security Processor) is tightly coupled with the CPU-side ASP (AMD Security Processor / CCP). When lemonade exits without cleanly releasing the GPU, the PSP leaves TMR (Trusted Memory Region) in a dirty state.

On the next boot, PSP tries to initialize against that dirty state, hits UNKNOWN CMD(0xFFFFFFFF) (a timeout), and fails.

In a VFIO environment, the ASP is passed through “as a VM device,” so it can’t directly access the host’s physical resources. This appears to be why PSP resets don’t work correctly across VFIO. On bare metal, the same crash happens but the Pod just restarts and recovers — so this was a VFIO-specific problem.

Most of this is pieced together from Reddit threads, so take it with a grain of salt — but it does seem to fit the symptoms.

What I Tried

FLR (Function Level Reset)

echo 1 > /sys/bus/pci/devices/0000:c5:00.0/reset

No effect. FLR doesn’t clear PSP firmware state.

vendor-reset module

The usual fix for AMD GPU reset issues, but it fails to build on kernel 7.x. Strix Halo isn’t supported anyway.

Neither approach worked, and I ran out of options.

Switching to Bare Metal

The obvious conclusion: skip Incus VM entirely, install Ubuntu directly on the host, and run k3s there.

Not being able to match the architecture of the other nodes was a minor annoyance, but a setup where every GPU crash leaves you reaching for the power cable isn’t operationally viable.

Final Configuration

Bare-metal Ubuntu 26.04 (no Incus)
Kernel built-in amdgpu + ROCm 7.2.3
k3s agent installed directly
AMD GPU Operator + DRA driver for Kubernetes GPU access

Now when lemonade crashes, the Pod restarts and everything just works again. The stability compared to the VFIO setup isn’t even close.

Takeaway

VFIO passthrough works well when things are running smoothly, but on an APU like Strix Halo, once the GPU enters a dirty state there’s no software path back. For an inference server that you can’t fully protect from crashes, a VFIO-based setup isn’t sustainable.

Bare metal felt like admitting defeat at first, but stable beats clever.

Hope this helps anyone hitting the same wall.

I Tried GPU Passthrough in an Incus VM and Ended Up on Bare Metal

What Happened

Why Recovery Isn’t Possible

What I Tried

Switching to Bare Metal

Final Configuration

Takeaway

Related Posts

EVO-X2 as a K8s Inference Node — VFIO Passthrough Gotchas

Sharing One AMD GPU Across Pods with DRA: Strix Halo Patches

LLMs on Strix Halo: Three Days Chasing the MES Firmware 0x83 Bug

Local LLMs and ChatGPT on One Endpoint with LiteLLM

Autonomous k8s Debugging and Tuning with Claude Code Skills

Reaching Homelab Services from GitHub Actions via ARC

Interact with a Private Homelab Server via Discord BOT

Disposable Remote Dev on Homelab K8s with Coder

VM to LXC Migration: Super-Lightweight K3s on Incus

Authentik to PocketID: Embracing Passkeys & Simplicity in HomeLab