LLMs on Strix Halo: Three Days Chasing the MES Firmware 0x83 Bug
Running llama.cpp on my k3s + AMD GPU cluster kept hitting memory access faults. The culprit: a bug in MES firmware 0x83 shipped with amdgpu-dkms-firmware.
Technical blog about Infrastructure, Kubernetes, AI, and more.
Running llama.cpp on my k3s + AMD GPU cluster kept hitting memory access faults. The culprit: a bug in MES firmware 0x83 shipped with amdgpu-dkms-firmware.
I set up Strix Halo as a k3s worker via Incus VM + VFIO, then hit a wall: once the GPU enters a dirty state, recovery is impossible without bare metal.
device-plugin gives the GPU to one Pod at a time. Here's why I switched to DRA on k3s, and three Strix Halo-specific issues I had to patch around.
How I joined GMKtec EVO-X2 (Ryzen AI MAX+ 395) to my k3s cluster as a GPU node via Incus VFIO, covering APU-specific passthrough gotchas.
Migrating from MicroK8s to K3s. Real-world insights on infrastructure rebuilding, from an Ubuntu 26.04 twist to Kubeconfig traps and safe TLS switching.
How I built a Discord BOT in Go to securely interact with my private homelab server without exposing it to the internet.
How I built a self-healing, automated refactoring pipeline using Codex's subscription capacity and Temporal on a home Kubernetes cluster.
Implement popular post rankings on Astro static sites using GA4 data. Covers build-time fetching, content integrity checks, and CI fallback strategies.
Breaking free from 'just add more resources.' From Amdahl's Law, USL, and queuing theory equations to practical capacity planning based on empirical analysis.
Stop tuning by guesswork. A cheat sheet on the three legendary Linux observability tools (perf, Ftrace, BPF) for safe and surgical performance analysis.
Moving away from guesswork tuning. From crisis tool prep to /proc secrets, kprobes/uprobes, and PMCs—here's how performance tools really work.
Don't be fooled by IOPS or average latency! A guide to virtual disk traps, bimodal distributions, and hunting mysterious I/O with biostacks.