Where Local Coding Models Break (Without a Harness)
I built a 10-task agentic benchmark and ran it on five local models. The results explain why raw benchmarks tell a different story than real-world use.
In the last post, I showed that four local models passed all eight real coding tasks through Claude Code. Perfect scores.
So why does this post exist? Because when I tested the same models through raw API calls (no harness), the scores dropped to 13-22 out of 30. Understanding where and why they break without a harness tells you something important about how to use them.
The gap between “model in isolation” and “model in a harness” is the most useful thing I measured. If you’re evaluating local models, you need both numbers.
How I Tested It
I defined ten tool-calling tasks across three difficulty tiers. Each model got four tools: read_file, write_file, run_command, and search_code. The model sees the tools, gets a task, and has to figure out which tools to call and in what order. I simulate the tool responses with predefined fixtures (identical for every model).
Each task ran three times per model. Max ten turns per task. Everything goes through vLLM’s tool-calling API.
Tier 1 (basic):
- Read a config file and answer a question about it
- Read a buggy file, fix it, write it back
- Search for TODO comments, pick the file with the most, implement them
Tier 2 (intermediate): 4. Run tests, get a failure, diagnose the missing function, write the fix, re-run tests 5. Debug a 500 error by tracing from the API layer to a database config bug 6. Read a test file, write the implementation that passes the tests, verify with pytest
Tier 3 (advanced): 7. Read an error log, trace through middleware and routes, fix a missing key check 8. Add a new API endpoint: read existing code, write the endpoint, add a DB function, update tests 9. Search for SQL injection vulnerabilities, read affected files, fix with parameterized queries 10. Full debug cycle: run tests, read the traceback, read the source, fix the bug, re-run to verify
These aren’t hypothetical. This is the kind of work a coding agent actually does. If a model can handle task 10, it can meaningfully contribute to a real development workflow. If it can only handle task 1, it’s a chatbot with extra steps.
The Results
| Task | Tier | Gemma 4 | Devstral | MiniMax | Coder-Next | Qwen3.5 |
|---|---|---|---|---|---|---|
| Config read | Basic | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
| Read-fix-write | Basic | 3/3 | 0/3 | 2/3 | 2/3 | 3/3 |
| TODO implementation | Basic | 0/3 | 1/3 | 3/3 | 0/3 | 3/3 |
| Error recovery | Inter. | 0/3 | 3/3 | 0/3 | 2/3 | 2/3 |
| Dependency chain | Inter. | 3/3 | 3/3 | 3/3 | 0/3 | 3/3 |
| TDD | Inter. | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
| Multi-file investigation | Adv. | 3/3 | 2/3 | 3/3 | 1/3 | 3/3 |
| Feature addition | Adv. | 2/3 | 3/3 | 0/3 | 0/3 | 0/3 |
| Security fix | Adv. | 0/3 | 0/3 | 2/3 | 0/3 | 0/3 |
| Full debug cycle | Adv. | 3/3 | 1/3 | 3/3 | 2/3 | 2/3 |
| Total | 20/30 | 19/30 | 22/30 | 13/30 | 22/30 |
First thing: every model made valid tool calls on every task. The JSON was well-formed. The function names were correct. The arguments made sense. Structurally, all five models can use tools.
The failures are all about what’s inside the tool calls. Wrong fix in the write. Skipped a step. Traced the wrong file. Called run_command eight times trying to verify something instead of just writing the fix.
What Stands Out
MiniMax and Qwen3.5 tied at 22/30 with completely different failure patterns. MiniMax is strong on investigation, debug cycles, and security fixes. Qwen3.5 is strong on dependency tracing and TODO implementation. Neither passes everything. They’d complement each other if you could run both.
Gemma 4 scored 20/30 at 31B. The smallest model in the test, and it held its own against models 4-7x its size. Perfect on multi-step, dependency chains, TDD, investigation, and full debug. It’s not the top scorer, but for a model that fits on a $2,899 Mac Studio, this is strong.
Devstral surprised me. It scored 3/9 on the basic tasks in the original benchmark, so I expected it to bomb the advanced ones. Instead it went 16/21 on the new tasks. Error recovery, dependency tracing, TDD, and feature addition all perfect. The basic task scoring was too strict on content-matching. When given harder problems that require more steps, Devstral’s careful approach works in its favor.
Coder-Next is fast and bad at this. 13/30. It makes more tool calls than anyone (often 10-12 per task) but they’re frequently wrong. On the dependency chain task, it correctly read both files, identified the wrong port, and then spent eight turns trying to run the code to verify instead of writing the fix. On the security task, it searched for the SQL injection pattern and read the files but never wrote the fixes.
Speed doesn’t help when the model is running in circles.
No model passes the feature addition task. Only Devstral got 3/3. Adding a new endpoint requires reading existing patterns, writing to multiple files, and maintaining consistency across all of them. This is where autonomous coding still falls apart. A model can fix a bug (reactive), but building something new (generative, across files) is a different skill.
How They Fail
The interesting part isn’t the scores. It’s how each model fails.
Coder-Next tries to verify before committing. It reads the right files and understands the problem, but instead of writing the fix, it runs commands to test hypotheses. Over and over. It’s cautious to the point of being useless, burning through all ten turns without producing output.
MiniMax gets ambitious. On the error recovery task, it ran the tests, read the file, searched for the missing function, and then started running commands to install dependencies and restart services. It did real work, just not the right work. It scored 0/3 on that task because it never wrote the simple validate_email function.
Qwen3.5 is methodical but limited. It follows the expected sequence almost perfectly on tasks it passes. When it fails (feature addition, security fix), it’s usually because it ran out of turns. Ten turns isn’t enough when you’re generating 500 tokens of thinking per response.
Devstral is efficient. It makes fewer tool calls than anyone (often 3-7 per task) and gets to the point. On the feature addition task, it read two files and wrote all the changes in a single write_file call with parallel tool calls. Three turns total.
What This Means
If you’re thinking about using local models for agentic coding, here’s the honest take:
Tier 1 tasks (simple read-fix-write) work today. All four models handle these. You can point Aider at a local model and have it fix straightforward bugs, write tests, do simple refactors. This is viable right now on any Mac with 64GB+.
Tier 2 tasks (error recovery, dependency tracing, TDD) are model-dependent. MiniMax and Qwen3.5 handle them. Coder-Next and Devstral are inconsistent. You need to pick the right model and expect some misses.
Tier 3 tasks (feature addition, security audits, full debug cycles) are unreliable across the board. Even the best models (MiniMax at 22/30) fail a third of the time. You’re going to review everything. That might be fine for some workflows, but it’s not “autonomous coding.”
This is why raw benchmarks can be misleading. Claude Opus handles all ten of these tasks reliably. Local models score 13-22/30 on the same tasks. That looks like a big gap.
But through Claude Code, those same models went 24/24 on real coding tasks. The harness provides the feedback loop that raw tool calling doesn’t: read the error, retry, iterate. The model doesn’t need to nail it on the first try. It needs to be good enough to converge.
The practical takeaway: don’t evaluate local models by raw benchmark scores. Evaluate them in the tool you’ll actually use. A model that scores 13/30 in isolation can go perfect when Claude Code is driving.
Previous: Cloud AI Coding Is Getting Worse. Can Local Models Save Us?
Next: Every Mac That Can Run Local AI Coding Models | Full series: Thinking Token Tax · Local Stack · Claude Code + Local Models