More Tokens Isn’t More Intelligence

There’s a leaderboard I’ve been watching lately that tracks how many tokens different versions of Claude consume on the same prompts. Opus 4.6 versus Opus 4.7. Same input, different token counts. (https://tokens.billchambers.me/leaderboard) The community submits prompts, the numbers get aggregated, and you can see, in real time, how token consumption has drifted between model versions. It’s a useful tool. It’s also measuring the wrong thing or more precisely, only one side of an equation whose other side nobody seems to be measuring. Here is the question nobody is asking clearly:

When a newer model consumes more tokens for the same prompt, is that because it’s thinking harder, or because it’s just running up the bill?

We’ve collectively decided that “scaling” is good. More parameters, more training data, more compute, more tokens at inference. The word has become a synonym for progress. But scaling isn’t one thing. It’s at least four things, and they behave differently, sometimes opposite( ! ), from each other. The one getting the most attention right now, inference-time compute, is also the one with the weakest theoretical foundation.

I want to argue that the current AI scaling conversation is missing a principle that biology figured out hundreds of millions of years ago: scaling only works when it preserves a balance between competing objectives. Without that balance, you don’t get more intelligence. You just get more overhead.

Four Scalings, Often Confused
When people say “AI is scaling,” they usually mean one of these:

– Model scaling. More parameters. GPT-3 had 175 billion. GPT-4 is larger. The parameter count grows.
– Data scaling. More training tokens. The Chinchilla paper in 2022 argued that the field had been undertraining models relative to their size, and that optimal performance required scaling data in proportion to parameters.
– Compute scaling. More FLOPs during training. Larger models trained on more data require more compute, and the relationship follows a power law.
– Inference-time scaling. More tokens consumed per query at runtime. Chain-of-thought, extended thinking, o1-style deliberation, agentic scaffolding. The model “thinks longer” before answering.

The first three have empirical scaling laws behind them — the Kaplan and Chinchilla curves — and even those empirical laws are poorly understood at a first-principles level. The fourth has no comparable theoretical foundation at all. And yet it’s the one driving most of the current excitement about “reasoning models.”

But, the conflation matters. When researchers say “scaling keeps working,” they often mean model/data/compute scaling during training. When product teams say “scaling keeps working,” they often mean test-time compute during inference. These are different regimes governed by different dynamics, and the evidence that one works is not evidence that the other does.

What Inference-Time Scaling Actually Does
Let me be precise about what happens when a model “thinks longer.”
Sometimes, more tokens produce better answers. On mathematics problems, multi-step logic, and certain planning tasks, chain-of-thought reasoning genuinely helps. There are benchmarks where extended inference demonstrably improves accuracy.

But sometimes, more tokens produce the same answer with more words around it. The model restates the problem. It hedges. It generates plausible-sounding intermediate reasoning that doesn’t actually change the output. Token count rises. Accuracy is flat.

Sometimes, more tokens produce worse answers. Extended reasoning can cause a model to second-guess a correct initial response, drift from the original question, or hallucinate intermediate steps that corrupt the final answer. Researchers have started calling this “overthinking.” And sometimes, the token increase isn’t reasoning at all. It’s formatting, safety disclaimers, meta-commentary, or internal scaffolding introduced in newer model versions. The tokens go up; the cognition doesn’t.

The leaderboard I mentioned at the start measures total token consumption. It doesn’t measure which category each query falls into. That’s not a criticism of the leaderboard — it’s doing what it set out to do. A new leaderboard tracks Claude’s token drift, but it’s only measuring one side of an equation biology solved millions of years ago: the metabolic cost of information. The criticism is of the broader conversation, which has decided that rising token consumption is either (a) evidence of progress or (b) evidence of inefficiency, without noticing that the same number can mean either, depending on what’s actually happening inside the model.

This isn’t just an engineering hurdle; it’s a thermodynamic one. As Laughlin et al (2007) demonstrated, the metabolic cost of information grows faster than the information itself. We don’t have good tools for distinguishing these cases. And without them, “inference-time scaling” is a claim we can’t verify.

What Biology Did Differently
In 2007, I published a paper with Charles F. Stevens at the Salk Institute on how vertebrate retinas scale. The question we were asking was similar in structure to the one AI researchers are asking now: when a neural circuit grows, what governs the growth?

We had three candidate design principles to test. The first was maximum spatial resolution — keep each retinal ganglion cell’s dendritic arbor small, so each cell covers a tiny patch of the visual field. More cells, finer detail. The second was maximum signal accuracy, grow each arbor large, so each cell averages more light and produces a less noisy signal. Fewer cells, more certainty per cell. The third was a compromise that scaled the arbor size at a specific sub-linear rate that preserved an optimal ratio between resolution and accuracy.

We measured 70 cells across 26 fish spanning an order of magnitude in eye size. The third principle won. Arbor area scaled with retina area at roughly the 1/2 power — not flat, not linear, but balanced. Evolution had selected for the specific exponent that kept resolution and accuracy in optimal tension.
The deeper point of the paper wasn’t about retinas. It was about what makes scaling work at all. Scaling succeeds when a system has identified the competing objectives that matter and preserves their ratio during growth. Scaling fails when you pick one objective and maximize it, because at large enough scales, the other objective catches up and breaks the system.

A retina that maximized only resolution would produce beautifully detailed but impossibly noisy images. A retina that maximized only accuracy would produce clean signals of a visual world reduced to a handful of pixels. Neither works. The balanced regime works, and the mathematical signature of that balance is a specific power-law exponent.

This is not a metaphor. It is a principle that recurs across biological scaling problems: the brain-body mass relationship, the cortical surface area to thickness relationship, the density of synapses per neuron across species. Biology scales by preserving ratios.

The Missing Principle in Inference-Time Scaling
Now look at inference-time compute in AI.
What competing objectives are being balanced as token consumption grows? What ratio is being preserved?

I don’t think anyone knows. I’m not being rhetorical — I mean literally, the field does not have a clearly articulated answer to this question. There are candidate objectives, and they might be worth naming:
Reasoning depth vs. coherence. Longer reasoning chains can explore more of the problem space but are also more likely to drift from the original question.

Exploration vs. commitment. More tokens allow the model to consider more alternatives, but they also dilute the signal of the model’s initial best guess.

Accuracy vs. computational cost. Each additional token costs real energy and dollars; the marginal accuracy gain has to justify that cost.

Breadth vs. specificity. Longer outputs can cover more ground, but they can also muddy a sharp answer.

If inference-time scaling were following a biological-style design principle, we’d expect it to preserve the ratio between some pair of these competing objectives as token counts grow. We’d expect there to be a specific exponent, not a flat line, not linear growth, but a sub-linear power law that reflects the balance.

We don’t see that. What we see instead is more tokens, assumed to be more thinking, without a framework for distinguishing the cases where that’s true from the cases where it isn’t.
This is what I mean by saying the leaderboard measures only one side of a two-sided equation. The token count is the cost. The accuracy gain is the benefit. The ratio between them is where the real story lives, and we don’t have good public tools for tracking it.

What This Would Look Like If We Got It Right
Imagine a version of the token leaderboard that tracked not just how many tokens a model consumed on a prompt, but how accuracy (or some quality metric) changed as a function of those tokens. The slope of that relationship, plotted log-log, would tell you whether the model was in a productive scaling regime or a wasteful one.

A flat slope would mean the extra tokens aren’t doing anything. A steep positive slope would mean that extended reasoning genuinely helps. A negative slope would mean that overthinking is actively hurting performance. A sub-linear positive slope ( the biological regime) would mean the model is preserving a useful balance between competing objectives as it scales its deliberation.

Different task types would probably show different scaling regimes. A hard math problem might show steep positive scaling (real reasoning helps). A simple factual question might show a flat line (extra tokens don’t help, they just add cost). Certain classes of questions might even show negative scaling, where the model’s first instinct is correct and extended deliberation introduces errors. This is the kind of diagnostic that’s currently missing from the public conversation. And I suspect that if we built it, we’d find that a lot of what’s currently being sold as “reasoning” is actually falling into the flat or negative regimes more often than anyone wants to admit.

Why I’m Writing This
I spent part of my research career studying how biological neural systems scale. The central finding of that work wasn’t a specific number. It was a structural insight: scaling is not the same as growing. Scaling is growth that preserves a principled relationship between competing quantities. Without that principle, you’re not scaling. You’re just accumulating.

The current AI scaling conversation is mostly about accumulation. More parameters. More data. More tokens. The word “scaling” has drifted to mean any form of getting bigger, without the disciplining constraint that biology places on the concept.

My bet is that the next real breakthrough in AI scaling won’t come from more — it will come from identifying the competing objectives that matter and designing systems that preserve their ratio as they grow. That’s what biology has been doing for 500 million years. It’s a framework the field can borrow from, if it’s willing to.

Until then, a leaderboard rise is not necessarily a model improvement. It might be. It might not be. Without measuring the other side of the equation, we don’t know. And “we don’t know” should be a more uncomfortable answer than the field is currently willing to give.

Brianne Lee is a former researcher at the Salk Institute and co-author, with Charles F. Stevens, of the 2007 PNAS paper “General design principle for scalable neural circuits in a vertebrate retina.” This is the second in a series on biological and artificial scaling. Read Article 1 here [https://www.pnas.org/doi/epdf/10.1073/pnas.0705469104].

https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.0050116

The 2007 PNAS paper “General design principle for scalable neural circuits in a vertebrate retina.” This is the second in a series on biological and artificial scaling. Read Article 1 here [https://www.pnas.org/doi/epdf/10.1073/pnas.0705469104].

Related Posts

Leave a Comment Cancel Reply