No Cloud Required: My Local AI Stack on AMD Hardware

TL;DR: I run a fully self-hosted AI stack — inference, memory, agent orchestration, image generation, and speech-to-text — entirely on local AMD hardware. This is a follow-up to Down the Agentic AI Rabbit Hole, where I covered the earlier version of this setup. A lot has changed since then, and I have a second GPU arriving any day now that I am completely calm about.

SectionSummary
Then vs. nowWhat changed since the last post
How I got hereLemonade, Gaia, LiteLLM, and why none of them stuck
The service mapWhat’s running and what it does
Hardware splitdGPU, iGPU, and an incoming OcuLink eGPU
Tuning the modelsYAML configs, KV cache quantization, Ministral, and 1-bit models
The memory layerLocalAGI, LocalRecall, and why Cognee didn’t last
OpenCode and the sandboxBubblewrap isolation for a tool that can run shell commands
Zed integrationOpenCode as a Zed agent server
Skills and CodegraphLazy-loaded domain knowledge and a live code intelligence graph
Open QuestionsThings I’m still chewing on

Then vs. now

The previous post covered the first serious version of this stack: Lemonade Server for inference, Gaia for a web-style agent interface, a vaguely-described personal knowledge base connected via MCP, and LiteLLM sitting in front of everything translating Claude API names to local models. It worked. It was also a collection of moving parts that each had their own opinions about how things should run.

Here’s where things stood then versus now:

ComponentThen (May 2026)Now (June 2026)
InferenceLemonade ServerLocalAI
Model proxyLiteLLMRemoved — LocalAI speaks OpenAI API natively
Agent interfaceGaiaLocalAGI
Chat UIGaia web interfaceLocalAI built-in chat
Memory/RAGVague MCP-connected knowledge baseLocalRecall (replaced Cognee)
Image genLocalAI Vulkan backend (blocked by glibc)sd-rocmstable-diffusion.cpp built with ROCm
STTNoneWhisper large-v3-turbo via LocalAI
TTSNonePiper via LocalAI
Code sandboxNoneBubblewrap isolation on every OpenCode invocation
Code intelligenceNoneCodegraph MCP server
Agent skillsAd-hocFormalized lazy-loaded SKILL.md system

The short version: Lemonade did one thing well. LocalAI does everything.


How I got here

I didn’t sit down one day and design this stack from scratch. I arrived at it the usual way — by running through several things that didn’t quite fit until something did.

Lemonade Server was my inference backend for a long stretch — it was the foundation of the previous setup. It’s clean, the API is well-behaved, and if you’re on Windows it’s one of the easier options. But it’s focused on chat and completion. Once I wanted embeddings, image generation, and speech-to-text all coming out of the same backend, Lemonade started showing its ceiling. It ain’t designed for that, and bending it toward it felt like the wrong direction.

Gaia I spent time with too — also covered in the previous post. The concept is interesting — a decentralized network of AI nodes, each running local models, addressable from anywhere. In practice, for a self-hosted setup where everything lives on one machine, the decentralized architecture is more in the way than useful. It makes assumptions about how you want to run things that don’t match how I want to run things.

LiteLLM came in as a translation layer — a proxy that sat in front of everything and mapped Claude API model names (claude-sonnet-4-6, claude-haiku-4-5) to whatever local model was actually serving. That worked, but it was a layer of indirection that mostly just added a place for things to go wrong. Every request had to go through it, and when something misfired it wasn’t always obvious which layer was at fault.

LocalAI is where I landed. It handles inference, embeddings, image generation via Stable Diffusion backends, speech-to-text via Whisper, and TTS via Piper — all under a single OpenAI-compatible API. It has real ROCm support for AMD GPUs, which matters a lot when your hardware isn’t Nvidia. Once it was running, LiteLLM came out and hasn’t been missed.


The service map

Everything runs as a systemd user service — no root, starts on login, logs to the journal like anything else.

ServiceRole
localai.servicePrimary inference — dGPU (ROCm), large models, embeddings, STT, TTS
localai-igpu.serviceSecondary inference — iGPU, small/fast models
localrecall.serviceVector RAG server — chromem backend, fed by LocalAI embeddings
localagi.serviceAgent orchestration — routes to LocalAI for LLM, LocalRecall for RAG
sd-rocm.serviceImage generation — stable-diffusion.cpp built with ROCm/HIPblas

LocalAI’s built-in chat interface runs alongside inference — no separate UI service needed.

The dependency ordering is explicit in the systemd units:

[Unit]
Description=LocalAGI agent orchestration server
After=localai.service localrecall.service

LocalAGI won’t start until both LocalAI and LocalRecall are up. If inference restarts, LocalAGI follows. Not glamorous, but it works.


Hardware split

The RX 9070 XT is the primary GPU — 16GB VRAM, ROCm, handles the heavy models. The iGPU handles a second LocalAI instance running smaller, faster models. The two instances are isolated from each other via HIP_VISIBLE_DEVICES:

InstanceDeviceUse
localaiHIP_VISIBLE_DEVICES=0 (dGPU)Large models, embeddings, image gen, STT
localai-igpuHIP_VISIBLE_DEVICES=-1 (iGPU/CPU)Small models, fast responses

Setting HIP_VISIBLE_DEVICES=-1 on the iGPU instance hides the discrete card entirely. The two can’t step on each other’s VRAM budget.

What’s coming: I’ve got OcuLink hardware arriving shortly with an RX 6800 XT. OcuLink is a PCIe tunnel for external GPUs — faster than Thunderbolt, no meaningful overhead compared to a slot-mounted card. The plan is a third LocalAI instance pinned to that GPU (HIP_VISIBLE_DEVICES=1), which gives three tiers:

TierHardwareModels
HeavyRX 9070 XT (16GB)14B–30B models, embeddings, image gen
MidRX 6800 XT (16GB, eGPU)7B–14B models, coding agents
LightiGPU1B–4B models, fast completions

Right now the 9070 XT is doing work that’ll be more comfortable spread across two cards.

Image generation

sd-rocm is a separate service, built from source with ROCm/HIPblas targeting gfx1201 (9070 XT) and gfx1036 (iGPU). It shares the same GGUF model files as LocalAI — no duplication on disk. VRAM layout is intentional:

The text encoders (~3.6GB) on CPU leaves the GPU free to share headroom with whatever LLM is loaded. Running both concurrently works; running both at full tilt at the same time is a recipe for the mullygrubs.


Tuning the models

Each model in LocalAI gets a YAML config. The knobs that matter:

name: qwen3.5-9b
backend: llama-cpp
context_size: 32768
gpu_layers: 99        # full offload — all layers on GPU
flash_attention: "on" # faster attention math, same output
cache_type_k: q8_0    # KV cache at q8 precision — saves VRAM, negligible quality loss
cache_type_v: q8_0
threads: 16
reasoning_effort: none  # disable think-chain for this model by default
temperature: 0.6
top_k: 20
top_p: 0.95
parameters:
  f16: true
  mmap: true
  model: Qwen3.5-9B-UD-Q4_K_XL.gguf

gpu_layers: 99 — offloads all transformer layers to the GPU. With enough VRAM this is the right call; partial offload (split between GPU and RAM) is slower than full offload to either.

KV cache quantization — the attention cache is quantized separately from the model weights. q8_0 keeps it dense enough that quality stays intact while cutting the VRAM hit for long contexts. For most models at 32K context, the KV cache is significant.

reasoning_effort: none — models like Qwen3.5 have a built-in chain-of-thought mode where they reason through a problem before answering. That’s useful sometimes. Burning thinking tokens on “what day is it” is not. This can be set per model in the YAML and overridden per-request.

Quantization on the model files is the other main lever. Q4_K_XL is smaller and faster than Q8_0, takes less VRAM, and for most tasks — especially coding and instruction-following — the quality difference is hard to find. Q8 is worth the cost for models where I care about reasoning fidelity; Q4_K_M or Q4_K_XL for everything else.

I’ve also got a soft spot for the Ministral family specifically — the 3B, 8B, and 14B variants. The reason is pretty specific: DevOps tooling. Kubernetes manifests, Helm charts, Terraform, shell scripts, Ansible tasks — Ministral handles all of it well and doesn’t hallucinate its way through YAML the way some models do. Tool-calling works reliably, instruction-following is tight, and the context windows are solid for the size. The 8B in particular earns its keep as an always-on ops model: fast enough that you don’t notice the wait, capable enough that the output is actually usable.

1-bit models: Bonsai and MiniCPM5

Separate from the production stack, I’ve been playing with 1-bit quantized modelsBonsai and MiniCPM5-1B specifically. These aren’t “lower quality versions of normal models” in the usual sense — they’re trained from scratch with 1-bit weights, where each parameter is essentially a single bit rather than a float. The result is models that are absurdly small and fast.

Bonsai comes in 1.7B, 4B, and 8B variants at Q1_0. MiniCPM5-1B is exactly what it sounds like. Neither is going to replace a 14B model for anything serious. The interesting part is how much they can do — coherent responses, basic tool calling, useful summaries — at token rates that make normal quantized models look sluggish. The Bonsai 4B sits at 130–149 tok/s on my hardware. That’s fast enough to use as a dispatch model, a quick triage pass, or anything where you want a near-instant answer and the stakes aren’t high.

It’s interesting work from a research standpoint. The fact that a 1-bit model can produce anything useful at all still surprises me a little every time.


The memory layer

LocalAGI is the agent orchestration layer. When a message comes in, it doesn’t fire it straight at the LLM — it queries LocalRecall for relevant context first, injects it into the prompt, then hands everything to LocalAI. The model answers with relevant stored knowledge already in hand.

User message


LocalAGI ──► LocalRecall (vector search)
    │              │
    │         relevant chunks
    │◄─────────────┘


LocalAI (LLM inference)


Response

LocalRecall is the vector RAG server. It stores documents as embeddings — generated by LocalAI’s embedding model — and retrieves by semantic similarity rather than keyword match. Ask about “GPU memory management” and it’ll surface something filed under “VRAM budgeting” without you having to know the exact term you used when you stored it.

It runs three backends: chromem (file-based, default), postgres (hybrid BM25 + vector), and localai (delegates embedding entirely to LocalAI). The chromem backend is what I use — no external database dependency, persists to disk, fast enough.

I started with Cognee. It’s a knowledge graph tool that can build entity relationships across documents — more structured than a flat vector store. It worked, but for the way I actually use the memory layer (feed in notes, retrieve relevant chunks, don’t overthink it), Cognee was more machinery than the job needed. LocalRecall does the same thing with less surface area. Cognee’s still referenced in the OpenCode config as a leftover, but LocalRecall is what the stack actually uses.


OpenCode and the sandbox

OpenCode is my terminal-based AI coding assistant — think Cursor but in the terminal, pointed at local models. It’s configured with a fleet of named agents, each assigned a specific model and role:

AgentModelPurpose
buildCodestral-RAG-19BGeneral coding, default agent
planQwen3.5-9BArchitecture, system design, visible think chain
deepDevstral-Small-2507Complex multi-step problems, large refactors
opsMinistral-8BShell, K8s, Terraform, incident triage
thinkQwen3.5-9BDebug analysis, root-cause tracing
visionGemma-4-E4BMultimodal — screenshots, diagrams
fastBonsai-4BTrivial lookups, quick drafts, 130+ tok/s
longSmolLM3-3B128K context window — full codebases, large logs
imageZ-Image TurboText-to-image via LocalAI
audioWhisper large-v3Speech-to-text via LocalAI

The problem with a tool that can read files, write files, and run arbitrary shell commands is obvious. The solution is bubblewrap.

OpenCode runs inside a bubblewrap sandbox. It only sees the project directory it’s handed. It can’t traverse the filesystem, can’t reach services it has no business reaching, and can’t make network calls outside a defined allowlist. The sandbox is transparent during normal use — OpenCode doesn’t know it’s there — but a misbehaving model output or a prompt injection that tries to do something clever lands in a box it can’t get out of.

The wrapper is a shell script installed as opencode in ~/.local/bin, taking precedence over the system binary. Every invocation goes through it.

# simplified — actual wrapper handles bind mounts and seccomp filter
exec bwrap \
  --ro-bind /usr/bin /usr/bin \
  --bind "${PROJECT_DIR}" "${PROJECT_DIR}" \
  --unshare-net \
  opencode-real "$@"

The seccomp filter is generated separately — a BPF program that allowlists the syscalls OpenCode legitimately needs and denies everything else. It’s fiddly to get right the first time and then you don’t touch it again.


Zed integration

Zed has a native agent server integration. OpenCode is registered there as the provider:

"agent_servers": {
  "opencode": {
    "type": "registry"
  }
}

When I’m in the middle of editing and want an inline suggestion, a refactor, or a quick explanation without leaving the editor, Zed routes it through OpenCode. Same local models, same bubblewrap sandbox, same stack. The editor integration and the terminal session share the same backend — there’s no separate “Zed model” to configure.


Skills and Codegraph

Skills

I maintain a skills system for the AI agents — a directory of domain-specific Markdown files, each covering a specific tool or area: Kubernetes, Ansible, Rust, the blog stack, hardware specifics, etc.

The key design constraint is lazy loading. At session start, the agent only loads skill names and their trigger keywords. The actual content — which can be several hundred lines of reference material — doesn’t enter context until a trigger fires. This keeps the context window clean and means the agent isn’t dragging around Kubernetes docs when you’re writing a blog post.

A skill file looks like this:

---
name: localrecall
triggers:
  - LocalRecall
  - localrecall
  - RAG server
  - chromem engine
---

## Always

- Pure HTTP API server; all config via env vars; base path `/api`
- Three backends: chromem (default), postgres (hybrid BM25+vector), localai
- Indexable file types: PDF, TXT, MD

## Branch

| Context | Load |
|---|---|
| Creating or managing collections | Read `references/collections.md` |
| Uploading files or external sources | Read `references/ingest.md` |
| Searching a collection | Read `references/search.md` |

The ## Always block is ≤5 bullets — only the things that cause immediately wrong behavior if unknown. Everything else lives in branch reference files, loaded one at a time when the context matches. I’ve got skills covering around 50 domains at this point. New ones are easy to add.

Codegraph

Codegraph runs as an MCP server inside both Claude Code and OpenCode. It indexes the codebase into a SQLite knowledge graph — every symbol, every call edge, every file relationship — built from a full AST parse.

When the AI needs to understand how something works, where a function is called from, or what a change would break, it queries Codegraph directly instead of grepping around hoping to find the right file. The index lags writes by about a second. Queries come back in under a millisecond. For practical purposes it’s live.

The difference in how an agent navigates a codebase with and without it is hard to overstate. Grep finds strings. Codegraph finds meaning — call paths across files, dynamic dispatch hops, symbol definitions across a whole repo. Once you’ve worked with it you notice immediately when it’s absent.


Open Questions

The stack is stable, but there’s plenty I’m still chewing on. LocalRecall’s chromem backend is file-based and fast, but I’m curious whether the postgres hybrid BM25 + vector backend would improve retrieval quality on longer documents — or whether that’s a problem I don’t actually have yet.

The bubblewrap seccomp filter works, but “works” is doing some load-bearing there — it passes the syscalls I tested and blocks ones I didn’t. I’d like a cleaner way to profile what OpenCode actually needs rather than assembling the allowlist by running it into walls.

The OcuLink 6800 XT situation: eGPU bandwidth over OcuLink is theoretically fine for inference, but inference has different access patterns than gaming workloads, which is what most OcuLink benchmarks cover. I allow it’ll be fine. I will find out shortly.

And the big one I keep circling back to: is the skills system the right abstraction at scale, or am I building toward a point where 80 skill files becomes its own kind of mess? Right now it’s manageable. The lazy-loading discipline keeps it clean. But there’s a version of this that turns into a laurel hell of Markdown I can’t find my way through.