<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
    <title>fosstog - localai</title>
    <subtitle>Free &amp; Open Source Photography</subtitle>
    <link rel="self" type="application/atom+xml" href="https://fosstog.com/tags/localai/atom.xml"/>
    <link rel="alternate" type="text/html" href="https://fosstog.com"/>
    <generator uri="https://www.getzola.org/">Zola</generator>
    <updated>2026-06-11T00:00:00+00:00</updated>
    <id>https://fosstog.com/tags/localai/atom.xml</id>
    <entry xml:lang="en">
        <title>No Cloud Required: My Local AI Stack on AMD Hardware</title>
        <published>2026-06-11T00:00:00+00:00</published>
        <updated>2026-06-11T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              ganthore
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://fosstog.com/blog/local-ai-stack/"/>
        <id>https://fosstog.com/blog/local-ai-stack/</id>
        
        <content type="html" xml:base="https://fosstog.com/blog/local-ai-stack/">&lt;p&gt;&lt;strong&gt;TL;DR&lt;&#x2F;strong&gt;: I run a fully self-hosted AI stack — inference, memory, agent orchestration, image generation, and speech-to-text — entirely on local AMD hardware. This is a follow-up to &lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;agentic-ai-rabbit-hole&#x2F;&quot;&gt;Down the Agentic AI Rabbit Hole&lt;&#x2F;a&gt;, where I covered the earlier version of this setup. A lot has changed since then, and I have a second GPU arriving any day now that I am completely calm about.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Section&lt;&#x2F;th&gt;&lt;th&gt;Summary&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;local-ai-stack&#x2F;#then-vs-now&quot;&gt;Then vs. now&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;What changed since the last post&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;local-ai-stack&#x2F;#how-i-got-here&quot;&gt;How I got here&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;Lemonade, Gaia, LiteLLM, and why none of them stuck&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;local-ai-stack&#x2F;#the-service-map&quot;&gt;The service map&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;What’s running and what it does&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;local-ai-stack&#x2F;#hardware-split&quot;&gt;Hardware split&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;dGPU, iGPU, and an incoming OcuLink eGPU&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;local-ai-stack&#x2F;#tuning-the-models&quot;&gt;Tuning the models&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;YAML configs, KV cache quantization, Ministral, and 1-bit models&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;local-ai-stack&#x2F;#the-memory-layer&quot;&gt;The memory layer&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;LocalAGI, LocalRecall, and why Cognee didn’t last&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;local-ai-stack&#x2F;#opencode-and-the-sandbox&quot;&gt;OpenCode and the sandbox&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;Bubblewrap isolation for a tool that can run shell commands&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;local-ai-stack&#x2F;#zed-integration&quot;&gt;Zed integration&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;OpenCode as a Zed agent server&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;local-ai-stack&#x2F;#skills-and-codegraph&quot;&gt;Skills and Codegraph&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;Lazy-loaded domain knowledge and a live code intelligence graph&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;local-ai-stack&#x2F;#open-questions&quot;&gt;Open Questions&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;Things I’m still chewing on&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;then-vs-now&quot;&gt;Then vs. now&lt;a class=&quot;post-anchor&quot; href=&quot;#then-vs-now&quot; aria-label=&quot;Anchor link for: then-vs-now&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h2&gt;
&lt;p&gt;The &lt;a href=&quot;https:&#x2F;&#x2F;fosstog.com&#x2F;blog&#x2F;agentic-ai-rabbit-hole&#x2F;&quot;&gt;previous post&lt;&#x2F;a&gt; covered the first serious version of this stack: Lemonade Server for inference, Gaia for a web-style agent interface, a vaguely-described personal knowledge base connected via MCP, and LiteLLM sitting in front of everything translating Claude API names to local models. It worked. It was also a collection of moving parts that each had their own opinions about how things should run.&lt;&#x2F;p&gt;
&lt;p&gt;Here’s where things stood then versus now:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Component&lt;&#x2F;th&gt;&lt;th&gt;Then (May 2026)&lt;&#x2F;th&gt;&lt;th&gt;Now (June 2026)&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Inference&lt;&#x2F;td&gt;&lt;td&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;lemonade-server.ai&#x2F;&quot;&gt;Lemonade Server&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;localai.io&#x2F;&quot;&gt;LocalAI&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Model proxy&lt;&#x2F;td&gt;&lt;td&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;BerriAI&#x2F;litellm&quot;&gt;LiteLLM&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;Removed — LocalAI speaks OpenAI API natively&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Agent interface&lt;&#x2F;td&gt;&lt;td&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;amd&#x2F;gaia&quot;&gt;Gaia&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;mudler&#x2F;LocalAGI&quot;&gt;LocalAGI&lt;&#x2F;a&gt;&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Chat UI&lt;&#x2F;td&gt;&lt;td&gt;Gaia web interface&lt;&#x2F;td&gt;&lt;td&gt;LocalAI built-in chat&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Memory&#x2F;RAG&lt;&#x2F;td&gt;&lt;td&gt;Vague MCP-connected knowledge base&lt;&#x2F;td&gt;&lt;td&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;mudler&#x2F;localrecall&quot;&gt;LocalRecall&lt;&#x2F;a&gt; (replaced Cognee)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Image gen&lt;&#x2F;td&gt;&lt;td&gt;LocalAI Vulkan backend (blocked by glibc)&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;sd-rocm&lt;&#x2F;code&gt; — &lt;code&gt;stable-diffusion.cpp&lt;&#x2F;code&gt; built with ROCm&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;STT&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;td&gt;Whisper large-v3-turbo via LocalAI&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;TTS&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;td&gt;Piper via LocalAI&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Code sandbox&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;td&gt;Bubblewrap isolation on every OpenCode invocation&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Code intelligence&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;td&gt;Codegraph MCP server&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Agent skills&lt;&#x2F;td&gt;&lt;td&gt;Ad-hoc&lt;&#x2F;td&gt;&lt;td&gt;Formalized lazy-loaded SKILL.md system&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The short version: Lemonade did one thing well. LocalAI does everything.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;how-i-got-here&quot;&gt;How I got here&lt;a class=&quot;post-anchor&quot; href=&quot;#how-i-got-here&quot; aria-label=&quot;Anchor link for: how-i-got-here&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h2&gt;
&lt;p&gt;I didn’t sit down one day and design this stack from scratch. I arrived at it the usual way — by running through several things that didn’t quite fit until something did.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;lmstudio-ai&#x2F;lemonade&quot;&gt;Lemonade Server&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; was my inference backend for a long stretch — it was the foundation of the previous setup. It’s clean, the API is well-behaved, and if you’re on Windows it’s one of the easier options. But it’s focused on chat and completion. Once I wanted embeddings, image generation, and speech-to-text all coming out of the same backend, Lemonade started showing its ceiling. It ain’t designed for that, and bending it toward it felt like the wrong direction.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;gaianet.ai&#x2F;&quot;&gt;Gaia&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; I spent time with too — also covered in the previous post. The concept is interesting — a decentralized network of AI nodes, each running local models, addressable from anywhere. In practice, for a self-hosted setup where everything lives on one machine, the decentralized architecture is more in the way than useful. It makes assumptions about how you want to run things that don’t match how I want to run things.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;BerriAI&#x2F;litellm&quot;&gt;LiteLLM&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; came in as a translation layer — a proxy that sat in front of everything and mapped Claude API model names (&lt;code&gt;claude-sonnet-4-6&lt;&#x2F;code&gt;, &lt;code&gt;claude-haiku-4-5&lt;&#x2F;code&gt;) to whatever local model was actually serving. That worked, but it was a layer of indirection that mostly just added a place for things to go wrong. Every request had to go through it, and when something misfired it wasn’t always obvious which layer was at fault.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;localai.io&#x2F;&quot;&gt;LocalAI&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; is where I landed. It handles inference, embeddings, image generation via Stable Diffusion backends, speech-to-text via Whisper, and TTS via Piper — all under a single OpenAI-compatible API. It has real &lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;rocm.docs.amd.com&#x2F;&quot;&gt;ROCm&lt;&#x2F;a&gt; support for AMD GPUs, which matters a lot when your hardware isn’t Nvidia. Once it was running, LiteLLM came out and hasn’t been missed.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-service-map&quot;&gt;The service map&lt;a class=&quot;post-anchor&quot; href=&quot;#the-service-map&quot; aria-label=&quot;Anchor link for: the-service-map&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h2&gt;
&lt;p&gt;Everything runs as a &lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;wiki.archlinux.org&#x2F;title&#x2F;Systemd&#x2F;User&quot;&gt;systemd user service&lt;&#x2F;a&gt; — no root, starts on login, logs to the journal like anything else.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Service&lt;&#x2F;th&gt;&lt;th&gt;Role&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;localai.service&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Primary inference — dGPU (ROCm), large models, embeddings, STT, TTS&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;localai-igpu.service&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Secondary inference — iGPU, small&#x2F;fast models&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;localrecall.service&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Vector RAG server — chromem backend, fed by LocalAI embeddings&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;localagi.service&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Agent orchestration — routes to LocalAI for LLM, LocalRecall for RAG&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;sd-rocm.service&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Image generation — &lt;code&gt;stable-diffusion.cpp&lt;&#x2F;code&gt; built with ROCm&#x2F;HIPblas&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;LocalAI’s built-in chat interface runs alongside inference — no separate UI service needed.&lt;&#x2F;p&gt;
&lt;p&gt;The dependency ordering is explicit in the systemd units:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #F8F8F2; background-color: #272822;&quot;&gt;&lt;code data-lang=&quot;ini&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;[Unit]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;Description&lt;&#x2F;span&gt;&lt;span&gt;=LocalAGI agent orchestration server&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;After&lt;&#x2F;span&gt;&lt;span&gt;=localai.service localrecall.service&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;LocalAGI won’t start until both LocalAI and LocalRecall are up. If inference restarts, LocalAGI follows. Not glamorous, but it works.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;hardware-split&quot;&gt;Hardware split&lt;a class=&quot;post-anchor&quot; href=&quot;#hardware-split&quot; aria-label=&quot;Anchor link for: hardware-split&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h2&gt;
&lt;p&gt;The RX 9070 XT is the primary GPU — 16GB VRAM, ROCm, handles the heavy models. The iGPU handles a second LocalAI instance running smaller, faster models. The two instances are isolated from each other via &lt;code&gt;HIP_VISIBLE_DEVICES&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Instance&lt;&#x2F;th&gt;&lt;th&gt;Device&lt;&#x2F;th&gt;&lt;th&gt;Use&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;localai&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;HIP_VISIBLE_DEVICES=0&lt;&#x2F;code&gt; (dGPU)&lt;&#x2F;td&gt;&lt;td&gt;Large models, embeddings, image gen, STT&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;localai-igpu&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;&lt;code&gt;HIP_VISIBLE_DEVICES=-1&lt;&#x2F;code&gt; (iGPU&#x2F;CPU)&lt;&#x2F;td&gt;&lt;td&gt;Small models, fast responses&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Setting &lt;code&gt;HIP_VISIBLE_DEVICES=-1&lt;&#x2F;code&gt; on the iGPU instance hides the discrete card entirely. The two can’t step on each other’s VRAM budget.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;What’s coming&lt;&#x2F;strong&gt;: I’ve got &lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;OCuLink&quot;&gt;OcuLink&lt;&#x2F;a&gt; hardware arriving shortly with an RX 6800 XT. OcuLink is a PCIe tunnel for external GPUs — faster than Thunderbolt, no meaningful overhead compared to a slot-mounted card. The plan is a third LocalAI instance pinned to that GPU (&lt;code&gt;HIP_VISIBLE_DEVICES=1&lt;&#x2F;code&gt;), which gives three tiers:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Tier&lt;&#x2F;th&gt;&lt;th&gt;Hardware&lt;&#x2F;th&gt;&lt;th&gt;Models&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Heavy&lt;&#x2F;td&gt;&lt;td&gt;RX 9070 XT (16GB)&lt;&#x2F;td&gt;&lt;td&gt;14B–30B models, embeddings, image gen&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Mid&lt;&#x2F;td&gt;&lt;td&gt;RX 6800 XT (16GB, eGPU)&lt;&#x2F;td&gt;&lt;td&gt;7B–14B models, coding agents&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Light&lt;&#x2F;td&gt;&lt;td&gt;iGPU&lt;&#x2F;td&gt;&lt;td&gt;1B–4B models, fast completions&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;Right now the 9070 XT is doing work that’ll be more comfortable spread across two cards.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;image-generation&quot;&gt;Image generation&lt;a class=&quot;post-anchor&quot; href=&quot;#image-generation&quot; aria-label=&quot;Anchor link for: image-generation&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;leejet&#x2F;stable-diffusion.cpp&quot;&gt;&lt;code&gt;sd-rocm&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; is a separate service, built from source with ROCm&#x2F;HIPblas targeting &lt;code&gt;gfx1201&lt;&#x2F;code&gt; (9070 XT) and &lt;code&gt;gfx1036&lt;&#x2F;code&gt; (iGPU). It shares the same GGUF model files as LocalAI — no duplication on disk. VRAM layout is intentional:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Diffusion model + VAE → GPU&lt;&#x2F;li&gt;
&lt;li&gt;Text encoders → CPU RAM (&lt;code&gt;--clip-on-cpu&lt;&#x2F;code&gt;)&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The text encoders (~3.6GB) on CPU leaves the GPU free to share headroom with whatever LLM is loaded. Running both concurrently works; running both at full tilt at the same time is a recipe for the mullygrubs.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;tuning-the-models&quot;&gt;Tuning the models&lt;a class=&quot;post-anchor&quot; href=&quot;#tuning-the-models&quot; aria-label=&quot;Anchor link for: tuning-the-models&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h2&gt;
&lt;p&gt;Each model in LocalAI gets a YAML config. The knobs that matter:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #F8F8F2; background-color: #272822;&quot;&gt;&lt;code data-lang=&quot;yaml&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;name&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; qwen3.5-9b&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;backend&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; llama-cpp&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;context_size&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt; 32768&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;gpu_layers&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt; 99&lt;&#x2F;span&gt;&lt;span style=&quot;color: #88846F;&quot;&gt;        # full offload — all layers on GPU&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;flash_attention&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; &amp;quot;on&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #88846F;&quot;&gt; # faster attention math, same output&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;cache_type_k&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; q8_0&lt;&#x2F;span&gt;&lt;span style=&quot;color: #88846F;&quot;&gt;    # KV cache at q8 precision — saves VRAM, negligible quality loss&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;cache_type_v&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; q8_0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;threads&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt; 16&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;reasoning_effort&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; none&lt;&#x2F;span&gt;&lt;span style=&quot;color: #88846F;&quot;&gt;  # disable think-chain for this model by default&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;temperature&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt; 0.6&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;top_k&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt; 20&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;top_p&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt; 0.95&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;parameters&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;  f16&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt; true&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;  mmap&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt; true&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;  model&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; Qwen3.5-9B-UD-Q4_K_XL.gguf&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;gpu_layers: 99&lt;&#x2F;code&gt;&lt;&#x2F;strong&gt; — offloads all transformer layers to the GPU. With enough VRAM this is the right call; partial offload (split between GPU and RAM) is slower than full offload to either.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;KV cache quantization&lt;&#x2F;strong&gt; — the attention cache is quantized separately from the model weights. &lt;code&gt;q8_0&lt;&#x2F;code&gt; keeps it dense enough that quality stays intact while cutting the VRAM hit for long contexts. For most models at 32K context, the KV cache is significant.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;&lt;code&gt;reasoning_effort: none&lt;&#x2F;code&gt;&lt;&#x2F;strong&gt; — models like Qwen3.5 have a built-in chain-of-thought mode where they reason through a problem before answering. That’s useful sometimes. Burning thinking tokens on “what day is it” is not. This can be set per model in the YAML and overridden per-request.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;Quantization on the model files&lt;&#x2F;strong&gt; is the other main lever. &lt;code&gt;Q4_K_XL&lt;&#x2F;code&gt; is smaller and faster than &lt;code&gt;Q8_0&lt;&#x2F;code&gt;, takes less VRAM, and for most tasks — especially coding and instruction-following — the quality difference is hard to find. &lt;code&gt;Q8&lt;&#x2F;code&gt; is worth the cost for models where I care about reasoning fidelity; &lt;code&gt;Q4_K_M&lt;&#x2F;code&gt; or &lt;code&gt;Q4_K_XL&lt;&#x2F;code&gt; for everything else.&lt;&#x2F;p&gt;
&lt;p&gt;I’ve also got a soft spot for the &lt;strong&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;mistral.ai&#x2F;news&#x2F;ministraux&#x2F;&quot;&gt;Ministral&lt;&#x2F;a&gt;&lt;&#x2F;strong&gt; family specifically — the 3B, 8B, and 14B variants. The reason is pretty specific: DevOps tooling. Kubernetes manifests, Helm charts, Terraform, shell scripts, Ansible tasks — Ministral handles all of it well and doesn’t hallucinate its way through YAML the way some models do. Tool-calling works reliably, instruction-following is tight, and the context windows are solid for the size. The 8B in particular earns its keep as an always-on ops model: fast enough that you don’t notice the wait, capable enough that the output is actually usable.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;1-bit-models-bonsai-and-minicpm5&quot;&gt;1-bit models: Bonsai and MiniCPM5&lt;a class=&quot;post-anchor&quot; href=&quot;#1-bit-models-bonsai-and-minicpm5&quot; aria-label=&quot;Anchor link for: 1-bit-models-bonsai-and-minicpm5&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h3&gt;
&lt;p&gt;Separate from the production stack, I’ve been playing with &lt;strong&gt;1-bit quantized models&lt;&#x2F;strong&gt; — &lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;huggingface.co&#x2F;mobiuslabsgmbh&quot;&gt;Bonsai&lt;&#x2F;a&gt; and &lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;huggingface.co&#x2F;openbmb&#x2F;MiniCPM5-1B&quot;&gt;MiniCPM5-1B&lt;&#x2F;a&gt; specifically. These aren’t “lower quality versions of normal models” in the usual sense — they’re trained from scratch with 1-bit weights, where each parameter is essentially a single bit rather than a float. The result is models that are absurdly small and fast.&lt;&#x2F;p&gt;
&lt;p&gt;Bonsai comes in 1.7B, 4B, and 8B variants at Q1_0. MiniCPM5-1B is exactly what it sounds like. Neither is going to replace a 14B model for anything serious. The interesting part is how much they &lt;em&gt;can&lt;&#x2F;em&gt; do — coherent responses, basic tool calling, useful summaries — at token rates that make normal quantized models look sluggish. The Bonsai 4B sits at 130–149 tok&#x2F;s on my hardware. That’s fast enough to use as a dispatch model, a quick triage pass, or anything where you want a near-instant answer and the stakes aren’t high.&lt;&#x2F;p&gt;
&lt;p&gt;It’s interesting work from a research standpoint. The fact that a 1-bit model can produce anything useful at all still surprises me a little every time.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-memory-layer&quot;&gt;The memory layer&lt;a class=&quot;post-anchor&quot; href=&quot;#the-memory-layer&quot; aria-label=&quot;Anchor link for: the-memory-layer&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;mudler&#x2F;LocalAGI&quot;&gt;LocalAGI&lt;&#x2F;a&gt; is the agent orchestration layer. When a message comes in, it doesn’t fire it straight at the LLM — it queries LocalRecall for relevant context first, injects it into the prompt, then hands everything to LocalAI. The model answers with relevant stored knowledge already in hand.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #F8F8F2; background-color: #272822;&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;User message&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    │&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ▼&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;LocalAGI ──► LocalRecall (vector search)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    │              │&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    │         relevant chunks&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    │◄─────────────┘&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    │&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ▼&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;LocalAI (LLM inference)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    │&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ▼&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Response&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;mudler&#x2F;localrecall&quot;&gt;&lt;strong&gt;LocalRecall&lt;&#x2F;strong&gt;&lt;&#x2F;a&gt; is the vector RAG server. It stores documents as embeddings — generated by LocalAI’s embedding model — and retrieves by semantic similarity rather than keyword match. Ask about “GPU memory management” and it’ll surface something filed under “VRAM budgeting” without you having to know the exact term you used when you stored it.&lt;&#x2F;p&gt;
&lt;p&gt;It runs three backends: &lt;code&gt;chromem&lt;&#x2F;code&gt; (file-based, default), &lt;code&gt;postgres&lt;&#x2F;code&gt; (hybrid BM25 + vector), and &lt;code&gt;localai&lt;&#x2F;code&gt; (delegates embedding entirely to LocalAI). The &lt;code&gt;chromem&lt;&#x2F;code&gt; backend is what I use — no external database dependency, persists to disk, fast enough.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;strong&gt;I started with &lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;topoteretes&#x2F;cognee&quot;&gt;Cognee&lt;&#x2F;a&gt;.&lt;&#x2F;strong&gt; It’s a knowledge graph tool that can build entity relationships across documents — more structured than a flat vector store. It worked, but for the way I actually use the memory layer (feed in notes, retrieve relevant chunks, don’t overthink it), Cognee was more machinery than the job needed. LocalRecall does the same thing with less surface area. Cognee’s still referenced in the OpenCode config as a leftover, but LocalRecall is what the stack actually uses.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;opencode-and-the-sandbox&quot;&gt;OpenCode and the sandbox&lt;a class=&quot;post-anchor&quot; href=&quot;#opencode-and-the-sandbox&quot; aria-label=&quot;Anchor link for: opencode-and-the-sandbox&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;opencode.ai&#x2F;&quot;&gt;OpenCode&lt;&#x2F;a&gt; is my terminal-based AI coding assistant — think Cursor but in the terminal, pointed at local models. It’s configured with a fleet of named agents, each assigned a specific model and role:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Agent&lt;&#x2F;th&gt;&lt;th&gt;Model&lt;&#x2F;th&gt;&lt;th&gt;Purpose&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;build&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Codestral-RAG-19B&lt;&#x2F;td&gt;&lt;td&gt;General coding, default agent&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;plan&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Qwen3.5-9B&lt;&#x2F;td&gt;&lt;td&gt;Architecture, system design, visible think chain&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;deep&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Devstral-Small-2507&lt;&#x2F;td&gt;&lt;td&gt;Complex multi-step problems, large refactors&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;ops&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Ministral-8B&lt;&#x2F;td&gt;&lt;td&gt;Shell, K8s, Terraform, incident triage&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;think&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Qwen3.5-9B&lt;&#x2F;td&gt;&lt;td&gt;Debug analysis, root-cause tracing&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;vision&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Gemma-4-E4B&lt;&#x2F;td&gt;&lt;td&gt;Multimodal — screenshots, diagrams&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;fast&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Bonsai-4B&lt;&#x2F;td&gt;&lt;td&gt;Trivial lookups, quick drafts, 130+ tok&#x2F;s&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;long&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;SmolLM3-3B&lt;&#x2F;td&gt;&lt;td&gt;128K context window — full codebases, large logs&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;image&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Z-Image Turbo&lt;&#x2F;td&gt;&lt;td&gt;Text-to-image via LocalAI&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;code&gt;audio&lt;&#x2F;code&gt;&lt;&#x2F;td&gt;&lt;td&gt;Whisper large-v3&lt;&#x2F;td&gt;&lt;td&gt;Speech-to-text via LocalAI&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;The problem with a tool that can read files, write files, and run arbitrary shell commands is obvious. The solution is &lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;containers&#x2F;bubblewrap&quot;&gt;bubblewrap&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;OpenCode runs inside a bubblewrap sandbox. It only sees the project directory it’s handed. It can’t traverse the filesystem, can’t reach services it has no business reaching, and can’t make network calls outside a defined allowlist. The sandbox is transparent during normal use — OpenCode doesn’t know it’s there — but a misbehaving model output or a prompt injection that tries to do something clever lands in a box it can’t get out of.&lt;&#x2F;p&gt;
&lt;p&gt;The wrapper is a shell script installed as &lt;code&gt;opencode&lt;&#x2F;code&gt; in &lt;code&gt;~&#x2F;.local&#x2F;bin&lt;&#x2F;code&gt;, taking precedence over the system binary. Every invocation goes through it.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #F8F8F2; background-color: #272822;&quot;&gt;&lt;code data-lang=&quot;shellscript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #88846F;&quot;&gt;# simplified — actual wrapper handles bind mounts and seccomp filter&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #66D9EF;&quot;&gt;exec&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; bwrap&lt;&#x2F;span&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt; \&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt;  --ro-bind&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; &#x2F;usr&#x2F;bin &#x2F;usr&#x2F;bin&lt;&#x2F;span&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt; \&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt;  --bind&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; &amp;quot;${&lt;&#x2F;span&gt;&lt;span&gt;PROJECT_DIR&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt;}&amp;quot; &amp;quot;${&lt;&#x2F;span&gt;&lt;span&gt;PROJECT_DIR&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt;}&amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt; \&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #AE81FF;&quot;&gt;  --unshare-net \&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt;  opencode-real &amp;quot;&lt;&#x2F;span&gt;&lt;span style=&quot;color: #FD971F;font-style: italic;&quot;&gt;$@&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt;&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The seccomp filter is generated separately — a BPF program that allowlists the syscalls OpenCode legitimately needs and denies everything else. It’s fiddly to get right the first time and then you don’t touch it again.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;zed-integration&quot;&gt;Zed integration&lt;a class=&quot;post-anchor&quot; href=&quot;#zed-integration&quot; aria-label=&quot;Anchor link for: zed-integration&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;zed.dev&#x2F;&quot;&gt;Zed&lt;&#x2F;a&gt; has a native &lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;zed.dev&#x2F;docs&#x2F;assistant&quot;&gt;agent server&lt;&#x2F;a&gt; integration. OpenCode is registered there as the provider:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #F8F8F2; background-color: #272822;&quot;&gt;&lt;code data-lang=&quot;json&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt;&amp;quot;agent_servers&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;: {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #66D9EF;font-style: italic;&quot;&gt;  &amp;quot;opencode&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;: {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #66D9EF;font-style: italic;&quot;&gt;    &amp;quot;type&amp;quot;&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #CFCFC2;&quot;&gt; &amp;quot;registry&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When I’m in the middle of editing and want an inline suggestion, a refactor, or a quick explanation without leaving the editor, Zed routes it through OpenCode. Same local models, same bubblewrap sandbox, same stack. The editor integration and the terminal session share the same backend — there’s no separate “Zed model” to configure.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;skills-and-codegraph&quot;&gt;Skills and Codegraph&lt;a class=&quot;post-anchor&quot; href=&quot;#skills-and-codegraph&quot; aria-label=&quot;Anchor link for: skills-and-codegraph&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;skills&quot;&gt;Skills&lt;a class=&quot;post-anchor&quot; href=&quot;#skills&quot; aria-label=&quot;Anchor link for: skills&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h3&gt;
&lt;p&gt;I maintain a skills system for the AI agents — a directory of domain-specific Markdown files, each covering a specific tool or area: Kubernetes, Ansible, Rust, the blog stack, hardware specifics, etc.&lt;&#x2F;p&gt;
&lt;p&gt;The key design constraint is lazy loading. At session start, the agent only loads skill &lt;em&gt;names&lt;&#x2F;em&gt; and their trigger keywords. The actual content — which can be several hundred lines of reference material — doesn’t enter context until a trigger fires. This keeps the context window clean and means the agent isn’t dragging around Kubernetes docs when you’re writing a blog post.&lt;&#x2F;p&gt;
&lt;p&gt;A skill file looks like this:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo&quot; style=&quot;color: #F8F8F2; background-color: #272822;&quot;&gt;&lt;code data-lang=&quot;markdown&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;---&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;name&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; localrecall&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #F92672;&quot;&gt;triggers&lt;&#x2F;span&gt;&lt;span&gt;:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  -&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; LocalRecall&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  -&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; localrecall&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  -&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; RAG server&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  -&lt;&#x2F;span&gt;&lt;span style=&quot;color: #E6DB74;&quot;&gt; chromem engine&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;---&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #A6E22E;font-weight: bold;&quot;&gt;## Always&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #A6E22E;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span&gt; Pure HTTP API server; all config via env vars; base path &lt;&#x2F;span&gt;&lt;span style=&quot;color: #FD971F;&quot;&gt;`&#x2F;api`&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #A6E22E;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span&gt; Three backends: chromem (default), postgres (hybrid BM25+vector), localai&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #A6E22E;&quot;&gt;-&lt;&#x2F;span&gt;&lt;span&gt; Indexable file types: PDF, TXT, MD&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span style=&quot;color: #A6E22E;font-weight: bold;&quot;&gt;## Branch&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;| Context | Load |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;|---|---|&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;| Creating or managing collections | Read &lt;&#x2F;span&gt;&lt;span style=&quot;color: #FD971F;&quot;&gt;`references&#x2F;collections.md`&lt;&#x2F;span&gt;&lt;span&gt; |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;| Uploading files or external sources | Read &lt;&#x2F;span&gt;&lt;span style=&quot;color: #FD971F;&quot;&gt;`references&#x2F;ingest.md`&lt;&#x2F;span&gt;&lt;span&gt; |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;| Searching a collection | Read &lt;&#x2F;span&gt;&lt;span style=&quot;color: #FD971F;&quot;&gt;`references&#x2F;search.md`&lt;&#x2F;span&gt;&lt;span&gt; |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The &lt;code&gt;## Always&lt;&#x2F;code&gt; block is ≤5 bullets — only the things that cause immediately wrong behavior if unknown. Everything else lives in branch reference files, loaded one at a time when the context matches. I’ve got skills covering around 50 domains at this point. New ones are easy to add.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;codegraph&quot;&gt;Codegraph&lt;a class=&quot;post-anchor&quot; href=&quot;#codegraph&quot; aria-label=&quot;Anchor link for: codegraph&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;a rel=&quot;noopener external&quot; target=&quot;_blank&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;ntegrals&#x2F;codegraph&quot;&gt;Codegraph&lt;&#x2F;a&gt; runs as an MCP server inside both Claude Code and OpenCode. It indexes the codebase into a SQLite knowledge graph — every symbol, every call edge, every file relationship — built from a full AST parse.&lt;&#x2F;p&gt;
&lt;p&gt;When the AI needs to understand how something works, where a function is called from, or what a change would break, it queries Codegraph directly instead of grepping around hoping to find the right file. The index lags writes by about a second. Queries come back in under a millisecond. For practical purposes it’s live.&lt;&#x2F;p&gt;
&lt;p&gt;The difference in how an agent navigates a codebase with and without it is hard to overstate. Grep finds strings. Codegraph finds meaning — call paths across files, dynamic dispatch hops, symbol definitions across a whole repo. Once you’ve worked with it you notice immediately when it’s absent.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;open-questions&quot;&gt;Open Questions&lt;a class=&quot;post-anchor&quot; href=&quot;#open-questions&quot; aria-label=&quot;Anchor link for: open-questions&quot;&gt;&lt;span aria-hidden=&quot;true&quot;&gt;#&lt;&#x2F;span&gt;&lt;&#x2F;a&gt;
&lt;&#x2F;h2&gt;
&lt;p&gt;The stack is stable, but there’s plenty I’m still chewing on. LocalRecall’s &lt;code&gt;chromem&lt;&#x2F;code&gt; backend is file-based and fast, but I’m curious whether the &lt;code&gt;postgres&lt;&#x2F;code&gt; hybrid BM25 + vector backend would improve retrieval quality on longer documents — or whether that’s a problem I don’t actually have yet.&lt;&#x2F;p&gt;
&lt;p&gt;The bubblewrap seccomp filter works, but “works” is doing some load-bearing there — it passes the syscalls I tested and blocks ones I didn’t. I’d like a cleaner way to profile what OpenCode actually needs rather than assembling the allowlist by running it into walls.&lt;&#x2F;p&gt;
&lt;p&gt;The OcuLink 6800 XT situation: eGPU bandwidth over OcuLink is theoretically fine for inference, but inference has different access patterns than gaming workloads, which is what most OcuLink benchmarks cover. I allow it’ll be fine. I will find out shortly.&lt;&#x2F;p&gt;
&lt;p&gt;And the big one I keep circling back to: is the skills system the right abstraction at scale, or am I building toward a point where 80 skill files becomes its own kind of mess? Right now it’s manageable. The lazy-loading discipline keeps it clean. But there’s a version of this that turns into a laurel hell of Markdown I can’t find my way through.&lt;&#x2F;p&gt;
</content>
        
    </entry>
</feed>
