What is the core claim of this post?

A large share of current GPU demand reflects software inefficiency. As sparsity, right-sizing, and a small, durable reasoning core are adopted, the same hardware can do more work—so demand adjusts in step changes rather than smooth growth.

How does sparsity reduce GPU demand?

Sparsity routes computation to only the parameters needed for a given token instead of activating the whole network. Moving from broad MoE to finer routing (e.g., sections → books → pages) cuts unnecessary compute and lowers hardware required for the same throughput.

What is a 'cognitive core' in this context?

A small, stable subset of the model that handles logic, grammar, and reasoning, retrained infrequently. Fresh facts are supplied via tools or retrieval (RAG), reducing the need to repeatedly pretrain massive models end-to-end.

Why do GPU demand forecasts get surprised?

Efficiency ships as code—kernels, schedulers, routing—so effective capacity can jump overnight without new buildings. These step changes break models that assume steady, up-and-to-the-right hardware purchases.

What should investors and builders watch next?

Signs of sparsity beyond MoE in mainstream stacks; training footprints where capability rises but reported training compute falls; and a retrain cadence shifting toward a slim core plus retrieval and tooling.

Rebooting Our Thinking on AI GPU Demand

Rebellionaire Staff
Oct 20, 2025
3 min read

Graph shows declining LLM costs from 2022, with GPT-3 to Llama 3 models. Costs drop 10x annually. Colors: purple, orange. Text highlights tech advances. — This blog post is a response to the linked substack from Andrew Trask.

I went back and re-read Trask’s essay and I keep coming back to one section that keeps me scratching my head – what if we managed to avoid retraining everything from scratch and instead focused on a smaller, rock-solid reasoning core with a toolset that can handle changes in facts? That right there could wipe out the bulk of the pretraining spend and change the whole conversation around the so-called "GPU shortages. " It's like saying "Version X-1 was probably around 90% wasteful" - which is basically what the author is saying.

“The previous version was about 90% waste.”

What the argument boils down to - plain English

A huge chunk of the "GPU demand" we're seeing is just pure waste - not because people are using their GPUs to the max, but because the software is just plain inefficient. Okay, so if we start making some smart changes to the software – sparsity, right-sizing, and not retraining the whole model all the time – we can squeeze more work out of the same hardware. And that’s when you get a step change in demand, not some smooth curve. Which is why all those market models that are built on the assumption of smooth growth keep getting surprised.

Where does all that waste end up showing up?

Dense inference vs sparsity

Transformers tend to use an awful lot more of the network than any single token actually needs. MoE does a bit to fix this by only turning on a subset of the model for each token (which is a good start). But what Trask is saying is that there’s even more potential for improving things - that if we can do a better job of routing the compute to the parts of the model that actually need it, we could be looking at some serious reductions in waste. (And that’s on top of the fact that estimates are pretty rough - it’s the direction of travel that matters, not the exact numbers).

Right-sizing after training

LLMs don’t just learn a bunch of facts - they also learn how to organise those facts, which is what drives all that overparameterization in the first place. Later on, you can do things like pruning or distillation to end up with a leaner system - which shows that the initial footprint was carrying a lot of slack.

Retraining habits (the big elephant in the room)

Instead of retraining the whole huge model every time knowledge changes, why don’t we go with a small, ultra-durable cognitive core (the logic and reasoning bits) that can be retrained infrequently, and then just use some tools or a retrieval-aggregate-generator (RAG) to bring in the fresh facts at runtime. And that’s not all - Trask figures that if we did that, most of the pretraining demand would "disappear" - he’s talking about roughly 70% of the global GPU spend, based on one lab’s split.

Why forecasts can be so far off

Efficiency gets written into code, not built into the walls. A simple kernel update or scheduler change can suddenly give you a whole lot more oomph out of your hardware - overnight. And that’s the kind of discontinuity that all those smooth "up and to the right" demand curves just can’t see coming.

What counteracts the drop?

New workloads can soak up some of the efficiency gains - like longer contexts or tool use. It’s a bit of a crapshoot - will they scale faster than we can make efficiency gains, or not? And that’s not to predict that efficiency will always be the winner.

Next things to watch out for

Sparsity beyond MoE showing up in mainstream stacks (some signs of deeper routing happening).
Training footprints for the latest frontier models where capability is still rising but training compute is actually falling. A visible step-down would be a pretty clear sign.
Retrain cadence shifting toward that slim, stable core plus retrieval and tooling for knowledge updates.

Editor's note (why this matters to our readers)

For investors - be careful to separate out the waste-driven demand (which is likely to decline) from the capability-driven demand (which is all about new use cases). It’s a different risk profile.\
For builders - keep an eye on GPU-minutes per token and job completion time under real I/O - those metrics will give you a much clearer view of the efficiency gains than just looking at raw GPU counts.

Resource

Trask, Andrew. GPU demand is (~1Mx) distorted by efficiency problems which are being solved. (Oct 19, 2025).