No Cloud Required: My Local AI Stack on AMD Hardware

2026-06-11T00:00:00+00:00

TL;DR: I run a fully self-hosted AI stack — inference, memory, agent orchestration, image generation, and speech-to-text — entirely on local AMD hardware. This is a follow-up to Down the Agentic AI Rabbit Hole</a>, where I covered the earlier version of this setup. A lot has changed since then, and I have a second GPU arriving any day now that I am completely calm about.

Section</th> Summary</th></tr></thead>

Then vs. now</a></td> What changed since the last post</td></tr>

How I got here</a></td> Lemonade, Gaia, LiteLLM, and why none of them stuck</td></tr>

The service map</a></td> What’s running and what it does</td></tr>

Hardware split</a></td> dGPU, iGPU, and an incoming OcuLink eGPU</td></tr>

Tuning the models</a></td> YAML configs, KV cache quantization, Ministral, and 1-bit models</td></tr>

The memory layer</a></td> LocalAGI, LocalRecall, and why Cognee didn’t last</td></tr>

OpenCode and the sandbox</a></td> Bubblewrap isolation for a tool that can run shell commands</td></tr>

Zed integration</a></td> OpenCode as a Zed agent server</td></tr>

Skills and Codegraph</a></td> Lazy-loaded domain knowledge and a live code intelligence graph</td></tr>

Open Questions</a></td>

Things I’m still chewing on</td></tr> </tbody></table>

Then vs. now

The previous post</a> covered the first serious version of this stack: Lemonade Server for inference, Gaia for a web-style agent interface, a vaguely-described personal knowledge base connected via MCP, and LiteLLM sitting in front of everything translating Claude API names to local models. It worked. It was also a collection of moving parts that each had their own opinions about how things should run.

Here’s where things stood then versus now:

Component</th> Then (May 2026)</th> Now (June 2026)</th></tr></thead>

Inference</td> Lemonade Server</a></td> LocalAI</a></td></tr>

Model proxy</td> LiteLLM</a></td> Removed — LocalAI speaks OpenAI API natively</td></tr>

Agent interface</td> Gaia</a></td> LocalAGI</a></td></tr>

Chat UI</td> Gaia web interface</td> LocalAI built-in chat</td></tr>

Memory/RAG</td> Vague MCP-connected knowledge base</td> LocalRecall</a> (replaced Cognee)</td></tr>

Image gen</td> LocalAI Vulkan backend (blocked by glibc)</td> sd-rocm</code> — stable-diffusion.cpp</code> built with ROCm</td></tr>

STT</td> None</td> Whisper large-v3-turbo via LocalAI</td></tr>

TTS</td> None</td> Piper via LocalAI</td></tr>

Code sandbox</td> None</td> Bubblewrap isolation on every OpenCode invocation</td></tr>

Code intelligence</td> None</td> Codegraph MCP server</td></tr>

Agent skills</td>

Ad-hoc</td>

Formalized lazy-loaded SKILL.md system</td></tr> </tbody></table>

The short version: Lemonade did one thing well. LocalAI does everything.

How I got here

Lemonade Server</a> was my inference backend for a long stretch — it was the foundation of the previous setup. It’s clean, the API is well-behaved, and if you’re on Windows it’s one of the easier options. But it’s focused on chat and completion. Once I wanted embeddings, image generation, and speech-to-text all coming out of the same backend, Lemonade started showing its ceiling. It ain’t designed for that, and bending it toward it felt like the wrong direction.

Gaia</a> I spent time with too — also covered in the previous post. The concept is interesting — a decentralized network of AI nodes, each running local models, addressable from anywhere. In practice, for a self-hosted setup where everything lives on one machine, the decentralized architecture is more in the way than useful. It makes assumptions about how you want to run things that don’t match how I want to run things.

LiteLLM</a> came in as a translation layer — a proxy that sat in front of everything and mapped Claude API model names (claude-sonnet-4-6</code>, claude-haiku-4-5</code>) to whatever local model was actually serving. That worked, but it was a layer of indirection that mostly just added a place for things to go wrong. Every request had to go through it, and when something misfired it wasn’t always obvious which layer was at fault.

LocalAI</a> is where I landed. It handles inference, embeddings, image generation via Stable Diffusion backends, speech-to-text via Whisper, and TTS via Piper — all under a single OpenAI-compatible API. It has real ROCm</a> support for AMD GPUs, which matters a lot when your hardware isn’t Nvidia. Once it was running, LiteLLM came out and hasn’t been missed.

The service map#</a> </h2> Everything runs as a systemd user service</a> — no root, starts on login, logs to the journal like anything else. Service</th> Role</th></tr></thead> localai.service</code></td> Primary inference — dGPU (ROCm), large models, embeddings, STT, TTS</td></tr> localai-igpu.service</code></td> Secondary inference — iGPU, small/fast models</td></tr> localrecall.service</code></td> Vector RAG server — chromem backend, fed by LocalAI embeddings</td></tr> localagi.service</code></td> Agent orchestration — routes to LocalAI for LLM, LocalRecall for RAG</td></tr> sd-rocm.service</code></td> Image generation — stable-diffusion.cpp</code> built with ROCm/HIPblas</td></tr> </tbody></table> LocalAI’s built-in chat interface runs alongside inference — no separate UI service needed. The dependency ordering is explicit in the systemd units: [Unit] Description=LocalAGI agent orchestration server After=localai.service localrecall.service</code></pre> LocalAGI won’t start until both LocalAI and LocalRecall are up. If inference restarts, LocalAGI follows. Not glamorous, but it works. Hardware split#</a> </h2> The RX 9070 XT is the primary GPU — 16GB VRAM, ROCm, handles the heavy models. The iGPU handles a second LocalAI instance running smaller, faster models. The two instances are isolated from each other via HIP_VISIBLE_DEVICES</code>: Instance</th> Device</th> Use</th></tr></thead> localai</code></td> HIP_VISIBLE_DEVICES=0</code> (dGPU)</td> Large models, embeddings, image gen, STT</td></tr> localai-igpu</code></td> HIP_VISIBLE_DEVICES=-1</code> (iGPU/CPU)</td> Small models, fast responses</td></tr> </tbody></table> Setting HIP_VISIBLE_DEVICES=-1</code> on the iGPU instance hides the discrete card entirely. The two can’t step on each other’s VRAM budget. What’s coming: I’ve got OcuLink</a> hardware arriving shortly with an RX 6800 XT. OcuLink is a PCIe tunnel for external GPUs — faster than Thunderbolt, no meaningful overhead compared to a slot-mounted card. The plan is a third LocalAI instance pinned to that GPU (HIP_VISIBLE_DEVICES=1</code>), which gives three tiers: Tier</th> Hardware</th> Models</th></tr></thead> Heavy</td> RX 9070 XT (16GB)</td> 14B–30B models, embeddings, image gen</td></tr> Mid</td> RX 6800 XT (16GB, eGPU)</td> 7B–14B models, coding agents</td></tr> Light</td> iGPU</td> 1B–4B models, fast completions</td></tr> </tbody></table> Right now the 9070 XT is doing work that’ll be more comfortable spread across two cards. Image generation#</a> </h3> sd-rocm</code></a> is a separate service, built from source with ROCm/HIPblas targeting gfx1201</code> (9070 XT) and gfx1036</code> (iGPU). It shares the same GGUF model files as LocalAI — no duplication on disk. VRAM layout is intentional: Diffusion model + VAE → GPU</li> Text encoders → CPU RAM (--clip-on-cpu</code>)</li> </ul> The text encoders (~3.6GB) on CPU leaves the GPU free to share headroom with whatever LLM is loaded. Running both concurrently works; running both at full tilt at the same time is a recipe for the mullygrubs. Tuning the models#</a> </h2> Each model in LocalAI gets a YAML config. The knobs that matter: name: qwen3.5-9b backend: llama-cpp context_size: 32768 gpu_layers: 99 # full offload — all layers on GPU flash_attention: "on" # faster attention math, same output cache_type_k: q8_0 # KV cache at q8 precision — saves VRAM, negligible quality loss cache_type_v: q8_0 threads: 16 reasoning_effort: none # disable think-chain for this model by default temperature: 0.6 top_k: 20 top_p: 0.95 parameters: f16: true mmap: true model: Qwen3.5-9B-UD-Q4_K_XL.gguf</code></pre> gpu_layers: 99</code> — offloads all transformer layers to the GPU. With enough VRAM this is the right call; partial offload (split between GPU and RAM) is slower than full offload to either. KV cache quantization — the attention cache is quantized separately from the model weights. q8_0</code> keeps it dense enough that quality stays intact while cutting the VRAM hit for long contexts. For most models at 32K context, the KV cache is significant. reasoning_effort: none</code> — models like Qwen3.5 have a built-in chain-of-thought mode where they reason through a problem before answering. That’s useful sometimes. Burning thinking tokens on “what day is it” is not. This can be set per model in the YAML and overridden per-request. Quantization on the model files is the other main lever. Q4_K_XL</code> is smaller and faster than Q8_0</code>, takes less VRAM, and for most tasks — especially coding and instruction-following — the quality difference is hard to find. Q8</code> is worth the cost for models where I care about reasoning fidelity; Q4_K_M</code> or Q4_K_XL</code> for everything else. I’ve also got a soft spot for the Ministral</a> family specifically — the 3B, 8B, and 14B variants. The reason is pretty specific: DevOps tooling. Kubernetes manifests, Helm charts, Terraform, shell scripts, Ansible tasks — Ministral handles all of it well and doesn’t hallucinate its way through YAML the way some models do. Tool-calling works reliably, instruction-following is tight, and the context windows are solid for the size. The 8B in particular earns its keep as an always-on ops model: fast enough that you don’t notice the wait, capable enough that the output is actually usable. 1-bit models: Bonsai and MiniCPM5#</a> </h3> Separate from the production stack, I’ve been playing with 1-bit quantized models — Bonsai</a> and MiniCPM5-1B</a> specifically. These aren’t “lower quality versions of normal models” in the usual sense — they’re trained from scratch with 1-bit weights, where each parameter is essentially a single bit rather than a float. The result is models that are absurdly small and fast. Bonsai comes in 1.7B, 4B, and 8B variants at Q1_0. MiniCPM5-1B is exactly what it sounds like. Neither is going to replace a 14B model for anything serious. The interesting part is how much they can do — coherent responses, basic tool calling, useful summaries — at token rates that make normal quantized models look sluggish. The Bonsai 4B sits at 130–149 tok/s on my hardware. That’s fast enough to use as a dispatch model, a quick triage pass, or anything where you want a near-instant answer and the stakes aren’t high. It’s interesting work from a research standpoint. The fact that a 1-bit model can produce anything useful at all still surprises me a little every time. The memory layer#</a> </h2> LocalAGI</a> is the agent orchestration layer. When a message comes in, it doesn’t fire it straight at the LLM — it queries LocalRecall for relevant context first, injects it into the prompt, then hands everything to LocalAI. The model answers with relevant stored knowledge already in hand. User message │ ▼ LocalAGI ──► LocalRecall (vector search) │ │ │ relevant chunks │◄─────────────┘ │ ▼ LocalAI (LLM inference) │ ▼ Response</code></pre> LocalRecall</a> is the vector RAG server. It stores documents as embeddings — generated by LocalAI’s embedding model — and retrieves by semantic similarity rather than keyword match. Ask about “GPU memory management” and it’ll surface something filed under “VRAM budgeting” without you having to know the exact term you used when you stored it. It runs three backends: chromem</code> (file-based, default), postgres</code> (hybrid BM25 + vector), and localai</code> (delegates embedding entirely to LocalAI). The chromem</code> backend is what I use — no external database dependency, persists to disk, fast enough. I started with Cognee</a>. It’s a knowledge graph tool that can build entity relationships across documents — more structured than a flat vector store. It worked, but for the way I actually use the memory layer (feed in notes, retrieve relevant chunks, don’t overthink it), Cognee was more machinery than the job needed. LocalRecall does the same thing with less surface area. Cognee’s still referenced in the OpenCode config as a leftover, but LocalRecall is what the stack actually uses. OpenCode and the sandbox#</a> </h2> OpenCode</a> is my terminal-based AI coding assistant — think Cursor but in the terminal, pointed at local models. It’s configured with a fleet of named agents, each assigned a specific model and role: Agent</th> Model</th> Purpose</th></tr></thead> build</code></td> Codestral-RAG-19B</td> General coding, default agent</td></tr> plan</code></td> Qwen3.5-9B</td> Architecture, system design, visible think chain</td></tr> deep</code></td> Devstral-Small-2507</td> Complex multi-step problems, large refactors</td></tr> ops</code></td> Ministral-8B</td> Shell, K8s, Terraform, incident triage</td></tr> think</code></td> Qwen3.5-9B</td> Debug analysis, root-cause tracing</td></tr> vision</code></td> Gemma-4-E4B</td> Multimodal — screenshots, diagrams</td></tr> fast</code></td> Bonsai-4B</td> Trivial lookups, quick drafts, 130+ tok/s</td></tr> long</code></td> SmolLM3-3B</td> 128K context window — full codebases, large logs</td></tr> image</code></td> Z-Image Turbo</td> Text-to-image via LocalAI</td></tr> audio</code></td> Whisper large-v3</td> Speech-to-text via LocalAI</td></tr> </tbody></table> The problem with a tool that can read files, write files, and run arbitrary shell commands is obvious. The solution is bubblewrap</a>. OpenCode runs inside a bubblewrap sandbox. It only sees the project directory it’s handed. It can’t traverse the filesystem, can’t reach services it has no business reaching, and can’t make network calls outside a defined allowlist. The sandbox is transparent during normal use — OpenCode doesn’t know it’s there — but a misbehaving model output or a prompt injection that tries to do something clever lands in a box it can’t get out of. The wrapper is a shell script installed as opencode</code> in ~/.local/bin</code>, taking precedence over the system binary. Every invocation goes through it. # simplified — actual wrapper handles bind mounts and seccomp filter exec bwrap \ --ro-bind /usr/bin /usr/bin \ --bind "${PROJECT_DIR}" "${PROJECT_DIR}" \ --unshare-net \ opencode-real "$@"</code></pre> The seccomp filter is generated separately — a BPF program that allowlists the syscalls OpenCode legitimately needs and denies everything else. It’s fiddly to get right the first time and then you don’t touch it again. Zed integration#</a> </h2> Zed</a> has a native agent server</a> integration. OpenCode is registered there as the provider: "agent_servers": { "opencode": { "type": "registry" } }</code></pre> When I’m in the middle of editing and want an inline suggestion, a refactor, or a quick explanation without leaving the editor, Zed routes it through OpenCode. Same local models, same bubblewrap sandbox, same stack. The editor integration and the terminal session share the same backend — there’s no separate “Zed model” to configure. Skills and Codegraph#</a> </h2> Skills#</a> </h3> I maintain a skills system for the AI agents — a directory of domain-specific Markdown files, each covering a specific tool or area: Kubernetes, Ansible, Rust, the blog stack, hardware specifics, etc. The key design constraint is lazy loading. At session start, the agent only loads skill names and their trigger keywords. The actual content — which can be several hundred lines of reference material — doesn’t enter context until a trigger fires. This keeps the context window clean and means the agent isn’t dragging around Kubernetes docs when you’re writing a blog post. A skill file looks like this: --- name: localrecall triggers: - LocalRecall - localrecall - RAG server - chromem engine --- ## Always - Pure HTTP API server; all config via env vars; base path `/api` - Three backends: chromem (default), postgres (hybrid BM25+vector), localai - Indexable file types: PDF, TXT, MD ## Branch | Context | Load | |---|---| | Creating or managing collections | Read `references/collections.md` | | Uploading files or external sources | Read `references/ingest.md` | | Searching a collection | Read `references/search.md` |</code></pre> The ## Always</code> block is ≤5 bullets — only the things that cause immediately wrong behavior if unknown. Everything else lives in branch reference files, loaded one at a time when the context matches. I’ve got skills covering around 50 domains at this point. New ones are easy to add. Codegraph#</a> </h3> Codegraph</a> runs as an MCP server inside both Claude Code and OpenCode. It indexes the codebase into a SQLite knowledge graph — every symbol, every call edge, every file relationship — built from a full AST parse. When the AI needs to understand how something works, where a function is called from, or what a change would break, it queries Codegraph directly instead of grepping around hoping to find the right file. The index lags writes by about a second. Queries come back in under a millisecond. For practical purposes it’s live. The difference in how an agent navigates a codebase with and without it is hard to overstate. Grep finds strings. Codegraph finds meaning — call paths across files, dynamic dispatch hops, symbol definitions across a whole repo. Once you’ve worked with it you notice immediately when it’s absent. Open Questions#</a> </h2> The stack is stable, but there’s plenty I’m still chewing on. LocalRecall’s chromem</code> backend is file-based and fast, but I’m curious whether the postgres</code> hybrid BM25 + vector backend would improve retrieval quality on longer documents — or whether that’s a problem I don’t actually have yet. The bubblewrap seccomp filter works, but “works” is doing some load-bearing there — it passes the syscalls I tested and blocks ones I didn’t. I’d like a cleaner way to profile what OpenCode actually needs rather than assembling the allowlist by running it into walls. The OcuLink 6800 XT situation: eGPU bandwidth over OcuLink is theoretically fine for inference, but inference has different access patterns than gaming workloads, which is what most OcuLink benchmarks cover. I allow it’ll be fine. I will find out shortly. And the big one I keep circling back to: is the skills system the right abstraction at scale, or am I building toward a point where 80 skill files becomes its own kind of mess? Right now it’s manageable. The lazy-loading discipline keeps it clean. But there’s a version of this that turns into a laurel hell of Markdown I can’t find my way through.

fosstog - localai

No Cloud Required: My Local AI Stack on AMD Hardware