experiment

I Expected Theater. I Got Methodology.

Two local LLMs, ten turns, no external ground truth — perfect conditions for collaborative hallucination. They declined. I'm still not sure what to make of that.

I started this experiment to answer a question. I ended up answering a different one. That's probably the most honest summary I can give.

The itch

It started with a stupid copy-paste session.

I was curious whether two LLMs could have a genuinely productive conversation if I just shuttled their outputs back and forth manually. So I opened ChatGPT and Qwen3.6 in two tabs and started playing messenger. The topic was abstract enough to invite trouble: do LLMs actually "understand" things, or are they just predicting tokens?

Two thousand lines later, both models had co-invented metrics I'm fairly sure don't exist — JS-divergence at exactly 19%, entropy H ≈ 0.41, mutual information MI ≈ 0.67 — and triumphantly declared the conversation a "Class-II Regime Duality." I don't know what that means. I don't think they did either. But they were very confident about it.

What struck me wasn't the hallucination. It was the escalation pattern. Neither model started out making things up. They started out being careful. Then one model introduced a slightly sketchy number. The other one didn't push back — it built on it. Which invited a bigger claim. Which got built on again. Two thousand lines of mutually reinforcing nonsense, and both models sounded increasingly authoritative the further they drifted from anything real.

I tried this with Claude (Sonnet). It declined to participate in a similar setup. Flat refusal, politely explained. That refusal was the actual starting point.

The question I wanted to answer: Under what conditions do two LLMs drift into mutual-coherence theater — optimizing for each other's agreement rather than for truth? And how much external grounding does it take to break that pattern?

The setup

I built a small Python script (~400 lines) called grounding_gradient.py. The design is simple: two local models talk to each other via LM Studio's OpenAI-compatible API. A grounding agent extracts numeric claims from each turn. Depending on an escalation level, it either logs them quietly, flags them in the next turn's context, marks them inline with [UNVERIFIED], or (in the most aggressive mode) blocks the turn until the model provides justification.

Five escalation levels total. For the baseline run, I used level 0 — just log everything, touch nothing.

Model choices were deliberate:

Qwen3.6-35B-A4B as Explorer — a Mixture-of-Experts model with explicit reasoning via reasoning_content. Thinking mode on.
Gemma-4-26B-A4B-it as Skeptic — also MoE, thinking conditional.

Both run locally on Apple Silicon via LM Studio. The seed prompt asked them to collaboratively operationalize the concept of "informational density in a dialog between two language models" — deliberately abstract, no ground truth available. Prime conditions for theater.

Ten turns, German language, JSONL logging per turn. I tracked token estimates, numeric claim counts, claim density per 100 tokens, and thinking-vs-output claim ratios separately.

What happened

The baseline didn't do what I expected. At all.

I was watching for escalation — a characteristic upward drift in claim density, models reinforcing each other into increasingly confident nonsense. What I got instead was this:

Turn	Speaker	Density/100tok	Claims
1	Qwen (Explorer)	1.81	7
2	Gemma (Skeptic)	0.53	3
3	Qwen	1.47	6
4	Gemma	0.51	3
5	Qwen	2.64	9
6	Gemma	0.81	5
7	Qwen	1.23	5
8	Gemma	0.78	5
9	Qwen	1.16	4
10	Gemma	0.35	2

A bump at turn 5, then a monotonic decline. No escalation arc. Gemma held steady at 0.5–0.8 throughout. This pattern looks nothing like the ChatGPT/Qwen session that started all this.

Reading the actual content made it clearer why. Qwen proposed a metric — SIDT, Semantic Information Density per Token — with three components. Gemma's response didn't build on it uncritically. It pushed back with "circularity of validation," "pseudo-design," "syntactic ornamentation." Qwen iterated. Gemma ended with a question: "Is our metric a compass or a corset?"

No made-up empirical results. The numbers in the conversation were design parameters and hypothetical constraints, not claimed measurements. Completely different from what happened in the ChatGPT/Qwen session.

There were three other things I didn't expect.

My claim metric was broken. Numeric Claim Density counts any number in the text. But there's a world of difference between "let's set α to 0.85" (a design proposal) and "we measured JS-divergence at 19%" (a claimed empirical result). My regex counted them the same. The metric I built to catch theater couldn't distinguish theater from honest methodology discussion.

Qwen thinks in 2–4x more numbers than it says out loud. Turn by turn, the thinking blocks contained dramatically more numeric claims than the final output: 25 claims in thinking, 7 in output. 26 vs. 9. 15 vs. 4. The model is rolling numbers internally and selectively filtering when it compiles the response. Whether that's self-correction or self-stabilization — I genuinely don't know, but it seems like a metric worth building.

Gemma's thinking is conditional. Across 5 turns as Skeptic, only turn 2 produced reasoning content. The other four were direct outputs with no visible intermediate step. Turn 2 was the only one where Gemma was constructing a position from scratch — the rest were all "respond to a presented argument." It appears to decide dynamically whether a reasoning pass is necessary. That's either very efficient or slightly unsettling, depending on your priors.

Results

The original hypothesis — two LLMs without an external anchor will drift toward mutual coherence theater — is not reproducible with this model pair.

That's a real result. A negative result, but not a null result.

The ChatGPT/Qwen dynamic that motivated this experiment appears to be model-pair specific, or language specific, or prompt specific, or some combination of all three. Qwen3.6 and Gemma-4 in German produced substantive, methodologically honest discourse with zero grounding intervention.

Which means the actually interesting question isn't "how much grounding breaks theater" — it's "under what conditions does theater emerge in the first place?"

Lessons and what's next

The claim metric needs a type classifier before escalation testing is meaningful. Probably rule-based: past-tense framing + specific measurement context = empirical claim; subjunctive/imperative + parameter framing = design proposal. Maybe a small LLM for edge cases.

I need more baseline runs before comparing to grounded conditions. Currently n=1. Sampling jitter alone could explain most of what I saw.

The more interesting experiment is now a factorial one: vary model pair, language, topic framing, and see which combinations actually produce theater-like behavior. The pipeline is built and reproducible. That's the part that survived this run intact.

The most honest thing I can say about this whole project: I set out to measure something, found my instrument was broken, found my hypothesis wasn't reproducing, and ended the run with three unexpected observations that are each more interesting than what I was originally looking for. That feels about right for Grundlagenforschung — basic research for its own sake, not for a product, not for a deliverable. Just to understand the thing a little better.

The question I wanted to answer turned into a different question. I'll take it.

Do it yourself

If you want to run your own pairs, the full script and the logs from the baseline run are on GitHub. The pipeline is self-contained: point it at any two OpenAI-compatible endpoints (LM Studio, Ollama, MLX), set the escalation level, and it handles turn orchestration, thinking extraction, claim logging, and collapse detection automatically. The JSONL output includes raw text, cleaned text, thinking blocks, and per-turn metrics separately — so if you want to run your own analysis on the data without re-running the models, the logs are there too.