Rebooting Our Thinking on AI GPU Demand
- Rebellionaire Staff
- Oct 20
- 3 min read
I went back and re-read Trask’s essay and I keep coming back to one section that keeps me scratching my head – what if we managed to avoid retraining everything from scratch and instead focused on a smaller, rock-solid reasoning core with a toolset that can handle changes in facts? That right there could wipe out the bulk of the pretraining spend and change the whole conversation around the so-called "GPU shortages. " It's like saying "Version X-1 was probably around 90% wasteful" - which is basically what the author is saying.
“The previous version was about 90% waste.”
What the argument boils down to - plain English
A huge chunk of the "GPU demand" we're seeing is just pure waste - not because people are using their GPUs to the max, but because the software is just plain inefficient. Okay, so if we start making some smart changes to the software – sparsity, right-sizing, and not retraining the whole model all the time – we can squeeze more work out of the same hardware. And that’s when you get a step change in demand, not some smooth curve. Which is why all those market models that are built on the assumption of smooth growth keep getting surprised.
Where does all that waste end up showing up?
Dense inference vs sparsity
Transformers tend to use an awful lot more of the network than any single token actually needs. MoE does a bit to fix this by only turning on a subset of the model for each token (which is a good start). But what Trask is saying is that there’s even more potential for improving things - that if we can do a better job of routing the compute to the parts of the model that actually need it, we could be looking at some serious reductions in waste. (And that’s on top of the fact that estimates are pretty rough - it’s the direction of travel that matters, not the exact numbers).
Right-sizing after training
LLMs don’t just learn a bunch of facts - they also learn how to organise those facts, which is what drives all that overparameterization in the first place. Later on, you can do things like pruning or distillation to end up with a leaner system - which shows that the initial footprint was carrying a lot of slack.
Retraining habits (the big elephant in the room)
Instead of retraining the whole huge model every time knowledge changes, why don’t we go with a small, ultra-durable cognitive core (the logic and reasoning bits) that can be retrained infrequently, and then just use some tools or a retrieval-aggregate-generator (RAG) to bring in the fresh facts at runtime. And that’s not all - Trask figures that if we did that, most of the pretraining demand would "disappear" - he’s talking about roughly 70% of the global GPU spend, based on one lab’s split.
Why forecasts can be so far off
Efficiency gets written into code, not built into the walls. A simple kernel update or scheduler change can suddenly give you a whole lot more oomph out of your hardware - overnight. And that’s the kind of discontinuity that all those smooth "up and to the right" demand curves just can’t see coming.
What counteracts the drop?
New workloads can soak up some of the efficiency gains - like longer contexts or tool use. It’s a bit of a crapshoot - will they scale faster than we can make efficiency gains, or not? And that’s not to predict that efficiency will always be the winner.
Next things to watch out for
Sparsity beyond MoE showing up in mainstream stacks (some signs of deeper routing happening).
Training footprints for the latest frontier models where capability is still rising but training compute is actually falling. A visible step-down would be a pretty clear sign.
Retrain cadence shifting toward that slim, stable core plus retrieval and tooling for knowledge updates.
Editor's note (why this matters to our readers)
For investors - be careful to separate out the waste-driven demand (which is likely to decline) from the capability-driven demand (which is all about new use cases). It’s a different risk profile.\
For builders - keep an eye on GPU-minutes per token and job completion time under real I/O - those metrics will give you a much clearer view of the efficiency gains than just looking at raw GPU counts.
Resource
Trask, Andrew. GPU demand is (~1Mx) distorted by efficiency problems which are being solved. (Oct 19, 2025).






Comments