Alex Finn put out a piece a week or so ago making the case for running OpenClaw entirely on local hardware. No API fees, no rate limits, your data stays on your machine, and execution is where 90% of token usage lives so the savings compound fast.
He’s right. I’ve been wanting to test this for a while and his post was the push.
His specific recommendation, though, doesn’t quite work on the hardware most people who land on his post will actually have.
#What Alex recommends
Download LM Studio, search for Qwen 3.5-35B-A3B, grab the 4-bit MLX version (about 20 GB on disk), load it, point OpenClaw at it. He notes it needs 32 GB+ of unified memory.
Most Mac Minis people are buying right now ship with 16 GB.
#What actually fits 16 GB
I run a Mac Mini M4 with 16 GB. After the OS and a few apps, you’re working with around 10-12 GB usable. A 20 GB model isn’t loading. A 9 GB model is the realistic ceiling, and even then you want to be careful with what else is running.
The model the agent community keeps recommending for this exact configuration is Qwen 2.5 Coder 14B. It’s roughly 9 GB on Ollama, fits with headroom, and was specifically designed for agent workloads (tool use, code reasoning, structured output).
Pull it with one command:
ollama pull qwen2.5-coder:14b
For the lighter triage work (heartbeats, classification, “is this email urgent” decisions), grab Llama 3.2 too. It’s 2 GB, loads in 17 seconds the first time, runs in under 2 seconds on subsequent calls.
ollama pull llama3.2
#How I actually wired it
In OpenClaw’s config:
- Default chat model for the main agent: not local. I delegate to Claude through the sanctioned
claude-clibackend, which means OpenClaw shells out to my logged-in Claude Code CLI and the call is debited against my Max subscription. Cost: zero extra per conversation. - Sub-agent model for the heavier code-and-tool work: Qwen 2.5 Coder 14B local. Free.
- Heartbeat model every 5 minutes: Llama 3.2 local. Free.
That hybrid is the actual unlock. Frontier model on the hot path where reasoning quality matters. Local models everywhere else, including the loops you’d never run on the API because the budget wouldn’t justify it.
Alex’s broader point stands. Local execution changes which loops are worth running. I have my agent re-reading her own memory files every five minutes and writing back a state summary. That’s around 5,000 tokens of input every cycle. On the API that’s a budget item I’d have to defend. Locally it’s noise.
#The pragmatic version of his thesis
You don’t need 32 GB of RAM to start. You need 16 GB, the right model for the size, and a clear split between what runs frontier and what runs local. That gets you the cost-savings argument without the hardware upgrade.
If you have the headroom for Qwen 3.5 35B, take it. If you don’t, Qwen 2.5 Coder 14B + Llama 3.2 is the setup that actually ships.
Let’s go!