I Pointed Claude Code at Local Models. Here's What They Wrote.

Throughout this series, I’ve shown that local models can handle real coding tasks through Claude Code. Four models, eight tasks, 24/24 across the board.

But pass/fail doesn’t tell you much. The interesting question is what the code looks like. Did the models write clean code or spaghetti? Did they make the same choices or take wildly different approaches? Would you actually ship what they wrote?

I captured the full code output from every task. Here’s what I found.

The Tasks

Not synthetic benchmarks. Real files, real pytest tests, real pass/fail:

Bug fix: find and fix a missing remainder in merge_sorted_lists
TDD: read 8 test cases, write a Cache class from scratch
Refactor: rename UserManager to AccountService across 3 files
Feature addition: add a search_users method from test specs
Security fix: convert 3 SQL injection vulnerabilities to parameterized queries
Debug: find and fix an off-by-one error in a paginator
Implement: create email and password validators from 13 test cases
Optimize: rewrite an O(n²) algorithm to O(n)

Each task starts with failing tests. The model, through Claude Code, has to read the code, understand the problem, make the fix, and get all tests green. No hand-holding.

The Results

Model	Score	Avg Time	Backend
Gemma 4 31B	24/24	~65s	Ollama
Qwen3.5-27B	24/24	~42s	Ollama
Coder-Next 80B	24/24	~45s	Ollama
MiniMax M2.5	24/24	~48s	vLLM
Devstral 2 123B	0/24	N/A	Both failed

Four models. Four perfect scores. Every task, every test, green.

Coder-Next, the model that scored worst on raw tool calling (13/30), went perfect. Qwen3.5, the model crippled by thinking overhead in raw benchmarks, finished tasks in 42 seconds average. The same tasks that took it 172 seconds through raw API calls.

Devstral couldn’t make edits through either Ollama or vLLM due to Mistral tool-format incompatibility with Claude Code’s expected interface. The model writes great code in isolation but can’t drive the harness. Infrastructure matters.

What the Code Actually Looks Like

Pass/fail is the headline. The code is the story.

The Cache class (TDD task) showed four distinct architectures:

MiniMax added thread safety with threading.Lock without being asked. Nobody else did. The tests didn’t require it. MiniMax just decided a production Cache should be thread-safe. That’s the 80.2% SWE-bench model showing judgment, not just competence.

Coder-Next wrote the most documented version: class docstring, method docstrings, inline comments. If you’re onboarding a junior developer, this is the code you’d want them reading.

Gemma 4 went compact: (value, expiry) tuples in a single dict. Fewer lines, same functionality.

The optimize task (O(n²) to O(n)) revealed algorithmic depth:

Gemma 4 and Qwen3.5 both used a two-pass approach: count occurrences first, then collect duplicates with a seen set. Truly O(n). Qwen3.5 even added comments explaining the complexity.

Coder-Next and MiniMax used single-pass approaches that are slightly less optimal (item not in duplicates is O(d) on the duplicates list). But they’re more concise and pass the performance test.

The security fix was identical across all four models. F-string SQL replaced with parameterized queries, LIKE '%{name}%' correctly handled as LIKE ? with (f"%{name}%",). SQL injection prevention is a solved problem for local models.

One flaw: Gemma 4 leaked internal reasoning into the validators task. Comments like # Wait, the tests don't explicitly check... appeared in the production code. The model’s chain-of-thought bled into its output. It still passed, but you’d want to catch that in review.

Why the Harness Matters

The raw benchmark sends one request and judges the response. The model gets one shot. If it calls the wrong tool or writes a subtly wrong fix, it fails.

Claude Code doesn’t work that way. It reads the test output. If tests fail, it reads the error. It tries a different approach. It iterates. The model doesn’t need to be perfect on the first try. It needs to be good enough to converge on the right answer with help from the harness.

That’s why a model that scored 13/30 on raw tool calling goes 24/24 through Claude Code. The harness provides the feedback loop that the raw benchmark doesn’t.

This has a practical implication: stop evaluating local models by how they perform in isolation. Evaluate them by how they perform in the tool you’ll actually use.

What Didn’t Work

Devstral 2 123B failed on both Ollama (tool format incompatibility) and vLLM (Anthropic API translator crashes on Mistral tool calls). The model writes the best code in raw benchmarks (4.7/5 quality score) but can’t drive Claude Code. You’d need to use it through Aider or another harness that supports Mistral’s format.

MiniMax isn’t available on Ollama’s local registry (only a :cloud API proxy tag). It works through vLLM with the QuantTrio/MiniMax-M2.5-AWQ weights. One extra step, but it works.

Not every model works through every harness. That’s a real constraint readers should know about.

The Revised Answer

The series started with “are local models ready for agentic coding?” The raw benchmarks said “almost.” The harness tests say “yes, today, for the tasks I tested.”

Four local models, running on rented GPUs, driven by Claude Code, autonomously fixed bugs, implemented features from tests, patched security vulnerabilities, optimized algorithms, and refactored across files. All tests passing. No human intervention.

The qualifier: these are well-defined tasks with clear test suites. Ask a local model to architect a new system or debug a race condition with no reproduction steps, and you’ll still want Opus. The gap between “fix this bug” and “design this system” is where cloud models still win.

But for the 70-80% of coding work that is fixing, implementing, and refactoring against a test suite? You can do that locally right now. On a $2,899 Mac Studio with Gemma 4. Or on a $5,999 Mac with MiniMax for better code quality. No subscription. No rate limits. No one throttling your workflow.

That’s closer to “there” than I expected when I started this series.

Previous: Building a Local Coding Stack

Start from the beginning: Cloud AI Coding Is Getting Worse. Can Local Models Save Us?