The Thinking Token Tax

Some local models burn 85% of their output on chain-of-thought you never see. If you're comparing by tok/s, you might be measuring the wrong thing.

ailocal-llmcoding

Qwen3.5-27B generates at 25 tokens per second. Sounds decent.

Except 85% of those tokens are internal chain-of-thought that never appears in the output. The model argues with itself, outlines numbered reasoning steps, second-guesses its approach, and then finally writes the code. The actual code arrives at about 3 tokens per second.

I found this while benchmarking local coding models. Same task, four models:

ModelTokens GeneratedUseful CodeTimeEffective Speed
Coder-Next 80B192192 (100%)1.6s120 tok/s
Gemma 4 31B314314 (100%)15.2s21 tok/s
Devstral 123B157157 (100%)9.4s17 tok/s
MiniMax M2.5319220 (~69%)3.3s67 tok/s
Qwen3.5-27B3,986600 (~15%)172s3.5 tok/s

The task was implementing a RateLimiter class. All four models wrote correct, working code. Coder-Next did it in 192 tokens. Qwen3.5 used 3,986. Same result, 20x the tokens.

What’s Happening

Qwen3.5-27B has a thinking problem. Despite explicit system prompts telling it not to reason aloud, despite Qwen’s /no_think flag, the model dumps hundreds of tokens of chain-of-thought before writing any code. Numbered analysis steps. Self-corrections. Internal debates about approach.

On two of my benchmark tasks, the thinking consumed the entire 4,096 token budget. The model hit the ceiling and got cut off before finishing the actual code. It literally thought so hard it forgot to answer.

MiniMax has a milder version of the same issue. About 30% of its tokens are thinking. That’s manageable. The model still finishes every task within budget and the quality is high. But if you’re comparing MiniMax’s “95 tok/s” to Coder-Next’s “114 tok/s,” the real comparison is 65 vs 114 useful tokens per second. MiniMax is slower than it looks.

Devstral and Coder-Next produce zero thinking overhead. Every token is output.

The Other Failure: Repetition Loops

Coder-Next has its own problem. On the CLI tool task, it started building a hardcoded set of text file extensions, then got stuck:

'.properties', '.properties', '.properties', '.properties',
'.properties', '.properties', '.properties', '.properties',

Hundreds of times, until it hit the token limit. This happened on 2 of 3 runs.

This is a known MoE failure mode. With only 3 billion active parameters per token, the model sometimes locks into a repetition pattern it can’t escape. It’s rare (only showed up on one task), but when it happens, the entire response is wasted.

What This Means

If you’re evaluating local models by tok/s, you need two numbers:

Raw tok/s: How fast the hardware generates tokens. This is what benchmarks report. It maps to memory bandwidth and tells you about the hardware.

Useful tok/s: How fast actual code appears. This is what you experience. It factors in thinking waste and is the number that matters for “does this feel responsive.”

For Qwen3.5, the gap between these numbers is 7x. Every model comparison that reports it at “25 tok/s” is lying to you. Not on purpose, but the metric doesn’t capture what’s actually happening.

The broader point: speed benchmarks without output analysis are incomplete. A model that generates 100 tok/s of garbage is slower than a model that generates 20 tok/s of working code.

The Harness Fixes This

Here’s the twist. When I ran Qwen3.5 through Claude Code on real coding tasks, the thinking tax largely disappeared. Tasks that took 172 seconds in raw benchmarks completed in 42 seconds through Claude Code. The model went 24/24 on all tasks.

Claude Code manages context differently than raw API calls. It structures the conversation, handles tool results, and doesn’t give the model room to spiral into chain-of-thought monologues. The thinking waste that crippled Qwen3.5 in raw benchmarks simply didn’t show up through the harness.

This doesn’t mean the thinking tax isn’t real. If you’re using Aider or a lighter harness that gives the model more freedom, you’ll see it. And for MiniMax’s 30% overhead, the harness helps but doesn’t eliminate it entirely. But for the worst case (Qwen3.5’s 85% waste), the right tool makes the problem go away.

The lesson: the thinking token tax is a property of how you use the model, not just the model itself.


Previous: Every Mac That Can Run Local AI Coding Models

Next: Building a Local Coding Stack · Claude Code + Local Models