Building a Local Coding Stack (And Why You Still Need the Cloud)

You’ve got a local model running. Now what?

The model doesn’t do anything by itself. You need a harness: something that connects it to your codebase, handles tool calls, manages context, and lets the model read, write, and run commands. The harness is the difference between a chatbot and a coding agent.

A year ago there was basically one option. Now there are six. I tested all of them with local models via Ollama.

The Harnesses

Claude Code + Ollama (Terminal)

Ollama v0.14 added native Anthropic API compatibility. The setup is one command:

ollama launch claude-code --model qwen3.5:27b

You get Claude Code’s tool calling infrastructure, file handling, git integration, and terminal access. The /commands work. The agent loop works. It’s the most polished harness available.

The catch: Claude Code’s system prompt eats 14-18K tokens of context on every request. For Opus with 1M context, that’s nothing. For a local model at 32K or 64K context, it’s a real problem. Long sessions hit compaction early. There’s also a per-request hash in the system prompt that defeats prefix caching on some inference servers.

Claude Code was built for Opus. It works with local models, but the overhead matters.

Cline with YOLO Mode (VS Code)

Cline’s default is approval at every step. Read a file? Approve. Write? Approve. For agentic work, that’s a non-starter.

But YOLO mode changes everything. Check a box in settings and Cline auto-approves all actions: file edits, terminal commands, browser use, everything. No confirmation dialogs. The model runs autonomously until the task is done.

Cline supports Ollama, LM Studio, and any OpenAI-compatible endpoint. With YOLO mode enabled and a capable local model (MiniMax or Qwen3.5 from my benchmarks), you get fully autonomous agentic coding inside VS Code. You see file changes in real time in your editor.

Lower context overhead than Claude Code. Solid tool calling support. If you’re a VS Code person, this might be the best option.

OpenCode (Terminal)

95K+ GitHub stars. Open source. The terminal-based alternative to Claude Code that supports any model backend.

OpenCode connects to Ollama, LM Studio, or any OpenAI-compatible API. It handles file reading, writing, command execution, and multi-step tool calling. The interface is similar to Claude Code but without the Anthropic-specific overhead.

I haven’t benchmarked it as thoroughly as the others, but the community consensus is that it’s the closest open-source equivalent to Claude Code for agentic workflows. Model-agnostic from day one.

Kilo Code / Roo Code (VS Code)

Kilo Code forked Cline and Roo Code, raised $8M, and has 1.5M users. It’s trying to be an all-in-one platform: orchestration, cloud execution, code review, voice commands.

Roo Code adds multi-agent role-based execution on top of the Cline foundation. Different “modes” for different tasks (architect mode, code mode, review mode).

Both support local models. Both have auto-approve options. For most developers, the question is whether you want the simplicity of Cline or the extra features these forks add. I’ve been using Cline directly and haven’t felt the need to switch, but your mileage may vary.

Aider (Terminal)

The most mature local-model coding tool, but it’s not fully agentic. Aider works through diffs: it generates unified diffs and applies them. It can’t run commands autonomously, can’t dynamically decide which files to read, and can’t iterate on test failures by itself. You tell it what files to work on.

What it does well: minimal context overhead (roughly 4x fewer tokens than Claude Code for equivalent tasks), git-native workflow (every edit is a commit), and solid multi-file editing. For straightforward tasks (bug fixes, test writing, refactors where you know the scope), it’s excellent.

Think of Aider as pair programming. The others are autonomous agents. Different tools for different jobs.

OpenHands (Docker/Web)

Full sandbox environment. Docker container, web UI, the agent installs packages, runs commands, iterates. When it works with a frontier cloud model, it’s impressive.

With local models, it’s unreliable. The agent framework expects deep reasoning and long context. Local models via Ollama struggle with the multi-step workflows OpenHands demands.

The Comparison

Tool	Interface	Full Agentic?	Context Overhead	Best For
Claude Code + Ollama	Terminal	Yes	High (14-18K)	Polished experience, short tasks
Cline (YOLO)	VS Code	Yes	Moderate	IDE integration, visual feedback
OpenCode	Terminal	Yes	Low	Open-source, model-agnostic
Kilo/Roo Code	VS Code	Yes	Moderate	Extra features, team workflows
Aider	Terminal	Partial	Very low	Scoped edits, git-native
OpenHands	Docker/Web	Yes (unreliable)	High	Sandbox isolation

The surprise: there’s no shortage of agentic harnesses anymore. A year ago, Claude Code was the only real option. Now you have at least four that support fully autonomous local model coding. The bottleneck isn’t the harness. It’s the model.

What I’m Actually Using

Aider + Ollama for 60% of my work. Scoped tasks where I know which files need to change. Bug fixes, test writing, refactors. Qwen3.5-27B on my current hardware. No API cost. Works offline.

Claude Code + Anthropic API for 30%. The hard stuff. Complex multi-file changes, architectural decisions, anything that needs deep reasoning or huge context. I cancelled Max. API-only at $5/$25 per million tokens. Cheaper if I’m disciplined about when I reach for it.

Continue.dev in VS Code for the remaining 10%. Small autocomplete model (Qwen2.5-Coder 1.5B) for inline completions. Local Copilot. Fast, free, good at boilerplate.

I haven’t moved to Cline YOLO yet. I probably should. The agentic capability with local models in a VS Code environment is compelling, and the context overhead is lower than Claude Code. That’s my next experiment.

Where This Is Going

Right now my setup is about 40% local, 60% cloud. Not where I want to be.

When the M5 Ultra ships, MiniMax M2.5 at 45 tok/s changes the equation. It scored 22/30 on my agentic benchmark and 80.2% on SWE-bench. On 256GB you get 128-200K tokens of context after loading the model, which is enough for real working sessions with a full codebase loaded. Running that through Cline YOLO or OpenCode could push the ratio to 70-80% local.

The end state: local models handle everything except the hardest 10-15% of tasks. For those, I keep an Anthropic API key and pay per token. No subscription. No rate limits. No one throttles my workflow because their servers are busy.

The harnesses are ready. The models are almost ready. The hardware ships in June.

I tested this claim. I pointed Claude Code at four local models and ran them on eight real coding tasks. The results surprised me.

Previous: The Thinking Token Tax

Next: I Pointed Claude Code at Local Models | Start from the beginning: Cloud AI Coding Is Getting Worse