experiment

The Benchmark Trap: When Better Scores Don't Make a Better Model

Qwen 3.5 wins almost every benchmark. Gemma 4 ranks #6 on Arena ELO — where humans judge blind. I kept reaching for the one that lost the tests. Here's why that makes sense, and why on 128 GB you don't have to choose.

I ran these tests in April 2026. Since then, Qwen 3.6 has landed and the version race has moved on. The architectural observations and the core paradox I'm writing about still hold — but treat the specific numbers as a snapshot, not a verdict.

Every time I benchmark two models, I expect the winner to feel like the winner. With gemma-4-26b-a4b and qwen3.5-35b-a3b, that didn't happen.

Qwen 3.5 won. Clearly, on paper. Eight benchmarks out of nine, sometimes by a lot. And then I started actually using both of them — for writing, for German, for just thinking through a problem out loud — and kept reaching for Gemma.

That dissonance is worth pulling apart. Because it's not a fluke, and it's not just vibes.

The numbers: Qwen wins, and it's not close

Let me be straight about the data before I push back on it.

On almost every standard academic benchmark, qwen3.5-35b-a3b has a real lead:

Benchmark	Gemma 4 26B	Qwen 3.5 35B
MMLU-Pro	82.6	85.3
GPQA Diamond	82.3	84.2
HLE (with tools)	17.2	47.4
TAU2-Bench (agentic)	68.2	81.2
Codeforces ELO	1718	2028

That HLE gap is the one that sticks with me. Humanity's Last Exam — a benchmark deliberately designed to be hard for frontier models — shows Qwen 3.5 more than doubling Gemma's score when search tools are in play. The TAU2 agentic gap is 13 points. These aren't rounding errors.

In my own benchmark session on the M4 Max — seven standardized tasks, identical conditions — Qwen averaged 87.5% overall. Gemma hit 85.2%.

On paper, Qwen is the better model. So why do I keep opening the other one?

But humans don't prefer it

Arena AI ELO is a different kind of benchmark. It's not an exam. It's a blind preference test: real people compare two model outputs side by side, without knowing which is which, and pick the one they'd rather read. Thousands of comparisons, aggregated over time.

Gemma 4 sits at roughly ELO 1441 — ranked #6 among all open models globally. Qwen 3.5 is around 1400.

Let that land: the model that loses on almost every objective capability metric wins when humans judge blind. Not marginally. By 41 ELO points, which means people consistently and repeatedly prefer Gemma's answers when they don't know the source.

This is what I've started calling the Arena/HLE paradox — and I think it matters more than most benchmark discussions acknowledge.

Here's my read on why it happens: HLE and MMLU test whether a model gets the right answer. Arena tests whether a model produces an answer you'd want to read. Those are genuinely different things. A model can be technically correct and still be slightly off in register — over-structured, a touch verbose, weirdly formal for a casual question. Gemma seems to have found a register that feels natural. Less formula, more actual voice.

Several community reviewers described it as "the model with the most human touch" among current open-weight options. I wouldn't have believed that from a press release. I believe it from spending a few weeks with both.

The architecture explains some of it

Both models are Mixture-of-Experts — only a fraction of their parameters activate per token. But they make very different design choices, and those choices show up in how they feel.

qwen3.5-35b-a3b is architecturally experimental. It uses Gated DeltaNet — a linear attention mechanism — in 30 of its 40 layers, only falling back to full quadratic attention every fourth layer. It also runs 256 tiny experts per MoE layer with an intermediate dimension of just 512, hitting 97% expert sparsity. This is a model built around throughput and efficiency. The architecture is optimized for tasks with clear right answers — where the path to the output is structured and the goal is well-defined.

gemma-4-26b-a4b is more conventional. Alternating sliding-window and global attention, 128 routed experts, a standard Transformer backbone. It's not trying to be novel. It's trying to be reliably good across a wide range of situations, including ones that are harder to define.

The MMMLU multilingual benchmark points in the same direction: Gemma leads on European language performance (86.3 vs. 85.2), and community testing on German, French, and Arabic consistently puts it in a different tier. Qwen's training skews heavily toward Chinese and English. For me — writing a lot in German — that's not a footnote, that's a deciding factor.

Running both on M4 Max: what you actually need to know

On M4 Max with 128 GB unified memory, Qwen 3.5 is significantly faster — around 130 tokens/s via native MLX, versus an estimated 75–85 for Gemma 4. That's a real difference for long-form generation.

The catch: LM Studio's MLX backend doesn't support Gemma 4 yet (Bug #1741, as of April 2026). If you're running Gemma through LM Studio, you're stuck on the GGUF/llama.cpp path at roughly 20–30 tok/s. For conversational use, fine. For bulk tasks, limiting.

Memory at Q8_0:

gemma-4-26b-a4b: 26.9 GB
qwen3.5-35b-a3b: 36.9 GB
Combined: ~64 GB — less than half your available memory.

Quantization behavior also differs. Gemma 4 degrades gracefully down to Q3_K_M with surprisingly intact output. Qwen 3.5's 256-tiny-expert architecture is more sensitive to precision loss; below Q5 you start seeing compilation failures and degraded reasoning. At Q8_0 on a 128 GB system, this distinction mostly disappears — which is why that's the right quantization to run for both.

The actual answer: don't choose

Here's the take I've landed on.

The "which model wins" framing is wrong for this hardware class. On 128 GB, you don't have to choose. Load both, switch in LM Studio in seconds, use the right tool for the actual task.

qwen3.5-35b-a3b at Q8_0 as your primary workhorse — code, agentic tasks, tool calling, anything where speed and raw capability matter.

gemma-4-26b-a4b at Q8_0 for writing, conversation, German, anything where the output needs to feel right — not just be right.

Together they use ~64 GB. You have 128. The math works out almost suspiciously well.

This isn't a compromise or a hedge. The models have genuinely complementary profiles — one optimized for structured problem-solving, one for producing output that humans actually prefer to read. Running both isn't indecision. It's just using the right tool.

The Arena/HLE paradox isn't really a paradox once you accept that capability and quality aren't synonyms. Benchmarks measure whether a model can do hard things. They're much worse at measuring whether the output of an easier thing feels like it came from someone who understood what you actually asked.

Qwen 3.5 is the more capable model by almost every objective measure. Gemma 4 is the one I reach for when it matters. I'm comfortable holding both of those as true — and a little curious whether Qwen 3.6 changes the equation.

The Benchmark Trap: When Better Scores Don't Make a Better Model

The numbers: Qwen wins, and it's not close

But humans don't prefer it

The architecture explains some of it

Running both on M4 Max: what you actually need to know

The actual answer: don't choose

Read more

I Planted a Lie in My AI's Memory. Here's What It Did.

I Expected Theater. I Got Methodology.