Open-weight models are catching the closed frontier. They are also getting dramatically smaller. And consumer silicon is getting dramatically faster at exactly the work they do. Three exponential curves, one destination — the capability that needs a datacenter today runs in your hand tomorrow.
Below: the evidence for each curve, the single equation that decides whether a model fits on a device, and a dated — deliberately caveated — estimate of when the lines cross.
// THE GOVERNING CONSTRAINT
A model runs at usable speed on a device only when this clears an interactive threshold. The numerator is silicon. The denominator is set by how big the model is and how few bits each weight needs. Every trend below pushes one of these terms the right way, at the same time. That is the whole argument — the rest is measuring the slopes.
For three years the default mental model has been simple: the good AI lives in someone else's datacenter, you rent it by the token, your phone is a thin client. That model is quietly expiring — not because phones are about to out-muscle datacenters (they aren't), but because the amount of muscle required to deliver a given level of intelligence is collapsing far faster than most people have priced in.
There is a precedent. From 1946 to 2009, computing efficiency — performance per watt — doubled roughly every 1.5 years.7 That single trend, not raw peak performance, moved computing out of the mainframe room and onto the desk, then the lap, then into the pocket. PCs never beat mainframes head-to-head. They became capable enough inside a consumer power budget. AI inference is on the same threshold — and you can watch it in three measurable streams.
The gap everyone assumed was structural turned out to be a delay. End of 2023: the best closed model scored ~88% on the MMLU knowledge benchmark; the best open-weight model managed ~70.5%.6 A real moat. By early 2026 that gap is effectively gone on knowledge tasks and into single digits on most reasoning tasks.
Epoch AI measures this rigorously with a composite capability index. The headline is stark: across 2026 the best open-weight models trail the closed state-of-the-art by an average of about four months — up slightly from three months over 2023–2025, but down from roughly a year in late 2024.1 A four-month lag that holds steady while the frontier accelerates is not weakness. It is lockstep.
What broke the moat was efficiency, not spend. DeepSeek's V3 base model trained on ~2.6M GPU-hours against Llama-3-405B's 30.8M — an order-of-magnitude training-efficiency gain that any lab could copy.6 The result was a cascade: between January and May 2026, a string of frontier-class open models shipped — Kimi K2.6, DeepSeek V4, GLM-5, Qwen3, MiniMax, gpt-oss — several within weeks of each other.1
This stream does the heavy lifting, and it has three independent engines — better models per parameter, fewer bits per weight, cheaper attention. They compound.
The cleanest result in the field is the Densing Law. Define capability density as performance per parameter. Across open models it has been doubling roughly every 3.3–3.5 months.4 Plainly: every quarter, the same quality at half the parameters.
The same physics shows in cost: Epoch finds the price of running an LLM at a fixed performance level falling ~2 orders of magnitude per year — halving roughly every two months — and pre-training compute efficiency improving ~3×/year.3 Different instruments, same trend.
On-device footprint is parameters × bits-per-weight. Density attacks the first term; quantization attacks the second. Four-bit weights are routine now — AWQ/GPTQ cut a model to a quarter of its 16-bit size with little measurable loss, inside mainstream serving stacks.5 Below four bits the frontier is ternary: BitNet b1.58 stores each weight as {−1, 0, +1} — ~1.58 bits — and from 3B up matches full-precision LLaMA quality at ~7× memory savings, ~4× faster decode, ~70× better energy.8
Standard attention scales with the square of context length; at long contexts it can eat 70–80% of decode latency.9 The fix arriving across open models: a lightweight indexer selects the top-k relevant tokens, dropping cost toward linear. DeepSeek V3.2 ships exactly this — near-linear O(kL) attention, ~50% lower long-context cost, quality on par with dense.5 Mixture-of-experts adds a fourth multiplier: only a fraction of parameters fire per token, so “frontier-level intelligence” now ships with ~11–20B active parameters even when the full model is far larger.10
Density and MoE shrink active params. Quantization shrinks bits. Sparse attention removes the quadratic blow-up that wrecks the equation at long context. Three of the four levers in the denominator are being pulled down hard, at once — and a Stanford study finds the net “intelligence per watt” of local inference improved 5.3× from 2023 to 2025, with the share of real queries a local model handles well rising from 23% to 71%.7
Now the numerator. Consumer SoCs stopped treating AI as a side feature and started designing around it. Qualcomm's Snapdragon 8 Elite Gen 5 pushes on-device AI throughput to ~100 TOPS — from the 40–45 class a generation earlier — and quotes ~220 tokens/second on-device, up from ~70 the prior year.10 Apple took a different route to the same place: the M5 puts a neural accelerator in every GPU core, up to 3.5× the AI compute of the M4.11
But TOPS is the wrong hero metric — and it's worth being honest about why. Token generation is memory-bound, not compute-bound: to emit each token, the chip streams the model's weights through memory once. That is why the equation has memory bandwidth in the numerator, not TOPS. It is the term that actually gates speed.
// Memory bandwidth by device class.11 Plug it in: a 7B model at 4-bit (~3.5 GB) on a 600 GB/s laptop clears ~170 tok/s in principle; the same model on a 76 GB/s phone clears ~20 — already past reading speed.
Put the streams back into the equation. Fix a capability target — say, mid-2025 frontier quality (GPT-5 / Gemini-3 class). Then ask one question over time: how much on-device memory does it take to hold the smallest open model that reaches that target, at 4-bit? Density (halving param count every ~3–4 months) and quantization drive the requirement down. Device ceilings sit still. They cross.
The crossover is not speculative in kind, only in date. It already happened for earlier tiers: GPT-3.5-class quality needed a 175B model and a server in 2022; it runs on a laptop — and a small quantized model on a phone — comfortably now. Epoch's separate measurement finds frontier capability becomes runnable on a single consumer GPU on a 6–12 month average lag, and that consumer-class open models improve faster than the frontier itself (+125 vs +80 Elo/year).2 The pocket is the next ceiling down.
The frontier is a moving target. The honest version is not “your phone runs the frontier.” It is “your phone runs last year's frontier” — and that delay stays roughly constant as both lines advance. The interesting claim is that the gap is now a fixed lag of months, not a permanent tier difference.
Bandwidth is the real wall. Capacity (do the weights fit?) and bandwidth (can they stream fast enough?) are different constraints. Fitting in memory is necessary, not sufficient. Phone bandwidth climbs slower than capacity, so the phone crossover is the later, harder one.
Benchmarks overstate. Lab scores and deployed performance diverge — agentic and real-world tasks routinely show large gaps. “Matches on benchmark X” is not “matches in your product.” And quantization, near-lossless at 4-bit, does degrade below it.
Even so, the direction is not in dispute. Three independently measured exponentials — density, precision, bandwidth — push the same inequality the same way. You needn't believe any single date in Fig. 05 to accept the structure: the cost of a unit of intelligence is falling toward the device.
Stanford's telemetry point is the quiet bombshell: ~77% of real AI requests are routine — writing, summarizing, looking things up — tasks that don't need frontier capability and could already run on the device.7 We are shipping “write me an email” to a building full of accelerators. The convergence above means we increasingly won't have to.
The second-order effects rhyme with mainframe-to-PC. When a capable model is local, three things stop being true at once: inference stops being a per-token meter, your data stops leaving the device, and capability stops requiring connectivity. Intelligence becomes a property of hardware you own rather than a subscription to someone else's — the same shift that turned computing from a service you queued for into a thing you simply had.
That's the thesis, with the math attached. The lines are real, the slopes are measured, and they bend toward the same point. The only open question is the date stamped on the intersection — and on current evidence, it is closer than the rented-token model assumes.