FP-RD-014 // THE POCKET FRONTIER

For three years the default mental model has been simple: the good AI lives in someone else's datacenter, you rent it by the token, your phone is a thin client. That model is quietly expiring — not because phones are about to out-muscle datacenters (they aren't), but because the amount of muscle required to deliver a given level of intelligence is collapsing far faster than most people have priced in.

There is a precedent. From 1946 to 2009, computing efficiency — performance per watt — doubled roughly every 1.5 years.⁷ That single trend, not raw peak performance, moved computing out of the mainframe room and onto the desk, then the lap, then into the pocket. PCs never beat mainframes head-to-head. They became capable enough inside a consumer power budget. AI inference is on the same threshold — and you can watch it in three measurable streams.

01 //Capability

STREAM 01 / REF-001

The open frontier is months behind, not years

The gap everyone assumed was structural turned out to be a delay. End of 2023: the best closed model scored ~88% on the MMLU knowledge benchmark; the best open-weight model managed ~70.5%.⁶ A real moat. By early 2026 that gap is effectively gone on knowledge tasks and into single digits on most reasoning tasks.

Epoch AI measures this rigorously with a composite capability index. The headline is stark: across 2026 the best open-weight models trail the closed state-of-the-art by an average of about four months — up slightly from three months over 2023–2025, but down from roughly a year in late 2024.¹ A four-month lag that holds steady while the frontier accelerates is not weakness. It is lockstep.

FIG. 01 / REF-001Knowledge benchmark · %

The gap closed from ~18 points to ~1

Closed frontier Best open-weight

The lines converge. MMLU-class trajectory, closed vs. best open-weight. Independent check: Epoch AI's capability index puts the open lag at ~3–4 months in 2026, down from ~12 months in late 2024.¹ Values indicative, from reported figures.

What broke the moat was efficiency, not spend. DeepSeek's V3 base model trained on ~2.6M GPU-hours against Llama-3-405B's 30.8M — an order-of-magnitude training-efficiency gain that any lab could copy.⁶ The result was a cascade: between January and May 2026, a string of frontier-class open models shipped — Kimi K2.6, DeepSeek V4, GLM-5, Qwen3, MiniMax, gpt-oss — several within weeks of each other.¹

// the real questionIt is no longer “can open models reach the frontier?” They do, on a short delay. It is “how small can a frontier-grade model become?”

02 //Efficiency

STREAM 02 / REF-002

The same intelligence keeps getting smaller

This stream does the heavy lifting, and it has three independent engines — better models per parameter, fewer bits per weight, cheaper attention. They compound.

Engine A — capability density

The cleanest result in the field is the Densing Law. Define capability density as performance per parameter. Across open models it has been doubling roughly every 3.3–3.5 months.⁴ Plainly: every quarter, the same quality at half the parameters.

FIG. 02 / REF-002Params for GPT-3.5-class · log

Same quality, an order of magnitude fewer parameters

A straight line on a log axis is an exponential collapse. Smallest model reaching GPT-3.5-class quality, by release period — 175B (late 2022) toward the low billions (2026). Slope consistent with the Densing Law's ~3.3-month halving.⁴ Points indicative.

The same physics shows in cost: Epoch finds the price of running an LLM at a fixed performance level falling ~2 orders of magnitude per year — halving roughly every two months — and pre-training compute efficiency improving ~3×/year.³ Different instruments, same trend.

Engine B — quantization

On-device footprint is parameters × bits-per-weight. Density attacks the first term; quantization attacks the second. Four-bit weights are routine now — AWQ/GPTQ cut a model to a quarter of its 16-bit size with little measurable loss, inside mainstream serving stacks.⁵ Below four bits the frontier is ternary: BitNet b1.58 stores each weight as {−1, 0, +1} — ~1.58 bits — and from 3B up matches full-precision LLaMA quality at ~7× memory savings, ~4× faster decode, ~70× better energy.⁸

FIG. 03 / REF-002Memory for an 8B model · GB

Precision is the difference between “datacenter” and “pocket”

Each step down in precision is a step toward the device. Memory to store an 8B model's weights at decreasing bit-widths. Dashed line = a typical flagship-phone RAM budget (~12 GB). At 4-bit and below, an 8B model lives inside it.⁸

Engine C — sparse attention

Standard attention scales with the square of context length; at long contexts it can eat 70–80% of decode latency.⁹ The fix arriving across open models: a lightweight indexer selects the top-k relevant tokens, dropping cost toward linear. DeepSeek V3.2 ships exactly this — near-linear O(kL) attention, ~50% lower long-context cost, quality on par with dense.⁵ Mixture-of-experts adds a fourth multiplier: only a fraction of parameters fire per token, so “frontier-level intelligence” now ships with ~11–20B active parameters even when the full model is far larger.¹⁰

// why this matters for the equation

Density and MoE shrink active params. Quantization shrinks bits. Sparse attention removes the quadratic blow-up that wrecks the equation at long context. Three of the four levers in the denominator are being pulled down hard, at once — and a Stanford study finds the net “intelligence per watt” of local inference improved 5.3× from 2023 to 2025, with the share of real queries a local model handles well rising from 23% to 71%.⁷

03 //Silicon

STREAM 03 / REF-003

The chips are racing up to meet it

Now the numerator. Consumer SoCs stopped treating AI as a side feature and started designing around it. Qualcomm's Snapdragon 8 Elite Gen 5 pushes on-device AI throughput to ~100 TOPS — from the 40–45 class a generation earlier — and quotes ~220 tokens/second on-device, up from ~70 the prior year.¹⁰ Apple took a different route to the same place: the M5 puts a neural accelerator in every GPU core, up to 3.5× the AI compute of the M4.¹¹

FIG. 04 / REF-003Peak on-device AI · TOPS

Mobile AI compute is on its own exponential

Apple Neural Engine Qualcomm Hexagon

The numerator is climbing too. Peak neural-accelerator throughput, flagship mobile SoCs. Apple's TOPS understate its real gain (moved into the GPU in 2025, +3.5×).¹¹ Qualcomm's 2025 jump to ~100 TOPS includes native low-bit (INT2) support — tuned for exactly the quantized models in Stream 02.¹⁰

But TOPS is the wrong hero metric — and it's worth being honest about why. Token generation is memory-bound, not compute-bound: to emit each token, the chip streams the model's weights through memory once. That is why the equation has memory bandwidth in the numerator, not TOPS. It is the term that actually gates speed.

~76

GB/s · flagship
phone (A19 Pro)

153

GB/s · thin
laptop (M5)

~600

GB/s · pro
laptop (M5 Max)

// Memory bandwidth by device class.¹¹ Plug it in: a 7B model at 4-bit (~3.5 GB) on a 600 GB/s laptop clears ~170 tok/s in principle; the same model on a 76 GB/s phone clears ~20 — already past reading speed.

04 //The Crossover

CONVERGENCE / REF-004

When do the lines actually cross?

Put the streams back into the equation. Fix a capability target — say, mid-2025 frontier quality (GPT-5 / Gemini-3 class). Then ask one question over time: how much on-device memory does it take to hold the smallest open model that reaches that target, at 4-bit? Density (halving param count every ~3–4 months) and quantization drive the requirement down. Device ceilings sit still. They cross.

FIG. 05 / REF-004 — THE MONEY CHARTGB · log

The descending curve meets the standing ceilings

Memory required (4-bit) Laptop ceiling Phone ceiling

The thesis, dated. The accent curve is the memory to run a fixed mid-2025-frontier capability on-device, falling as density and quantization compound. Below the laptop band (~64–128 GB) then the phone band (~12–16 GB), that capability becomes locally runnable. On current rates: pro-laptop ~late 2026, flagship phone ~2027–28. A trend extrapolation, not a product roadmap — see caveats.

The crossover is not speculative in kind, only in date. It already happened for earlier tiers: GPT-3.5-class quality needed a 175B model and a server in 2022; it runs on a laptop — and a small quantized model on a phone — comfortably now. Epoch's separate measurement finds frontier capability becomes runnable on a single consumer GPU on a 6–12 month average lag, and that consumer-class open models improve faster than the frontier itself (+125 vs +80 Elo/year).² The pocket is the next ceiling down.

05 //Caveats

READ HONESTLY / REF-005

What this argument does and doesn't claim

The frontier is a moving target. The honest version is not “your phone runs the frontier.” It is “your phone runs last year's frontier” — and that delay stays roughly constant as both lines advance. The interesting claim is that the gap is now a fixed lag of months, not a permanent tier difference.

Bandwidth is the real wall. Capacity (do the weights fit?) and bandwidth (can they stream fast enough?) are different constraints. Fitting in memory is necessary, not sufficient. Phone bandwidth climbs slower than capacity, so the phone crossover is the later, harder one.

Benchmarks overstate. Lab scores and deployed performance diverge — agentic and real-world tasks routinely show large gaps. “Matches on benchmark X” is not “matches in your product.” And quantization, near-lossless at 4-bit, does degrade below it.

Even so, the direction is not in dispute. Three independently measured exponentials — density, precision, bandwidth — push the same inequality the same way. You needn't believe any single date in Fig. 05 to accept the structure: the cost of a unit of intelligence is falling toward the device.

06 //Implications

SO WHAT / REF-006

What changes when intelligence is ambient

Stanford's telemetry point is the quiet bombshell: ~77% of real AI requests are routine — writing, summarizing, looking things up — tasks that don't need frontier capability and could already run on the device.⁷ We are shipping “write me an email” to a building full of accelerators. The convergence above means we increasingly won't have to.

The second-order effects rhyme with mainframe-to-PC. When a capable model is local, three things stop being true at once: inference stops being a per-token meter, your data stops leaving the device, and capability stops requiring connectivity. Intelligence becomes a property of hardware you own rather than a subscription to someone else's — the same shift that turned computing from a service you queued for into a thing you simply had.

// the counter-forceAI's economic gravity has pulled toward the center — bigger clusters, more capex, rented access. These three curves are the counter-force. Build for the device side of the crossover and you build for where the cost curve is going, not where it has been.

That's the thesis, with the math attached. The lines are real, the slopes are measured, and they bend toward the same point. The only open question is the date stamped on the intersection — and on current evidence, it is closer than the rented-token model assumes.

07 //Sources

REFERENCES / REF-007

Epoch AI — “Open models lag SOTA closed models by 4 months” (2026). epoch.ai/data-insights/open-closed-eci-gap
Epoch AI — “Frontier AI performance becomes accessible on consumer hardware within a year” (2025). epoch.ai/data-insights/consumer-gpu-model-gap
Epoch AI — Trends dashboard: inference cost halving ~2 mo; pre-training efficiency ~3×/yr. epoch.ai/trends
Xiao et al. — “Densing Law of LLMs,” Nature Machine Intelligence (2025); arXiv:2412.04315. nature.com/articles/s42256-025-01137-0
DeepSeek-AI — “DeepSeek-V3.2: DeepSeek Sparse Attention” (2025). api-docs.deepseek.com/news/news250929
DeepSeek-AI — “Native Sparse Attention,” arXiv:2502.11089. arxiv.org/abs/2502.11089
D. Friedman — “Closed Source vs Open Source AI” (2026): MMLU 88% vs 70.5%; DeepSeek V3 2.6M vs Llama-3-405B 30.8M GPU-hrs. davefriedman.substack.com
Saad-Falcon, Narayan et al. — “Intelligence per Watt,” arXiv:2511.07885 (Stanford/Hazy): IPW +5.3×; coverage 23%→71%; 77% routine; Koomey 1.5-yr doubling. arxiv.org/abs/2511.07885
Ma et al. — BitNet b1.58 (ternary; ~7× memory, ~4× decode, ~70× energy from 3B+); AWQ/GPTQ 4-bit. 1.58-bit LLM overview
“Step 3.5 Flash: Frontier-Level Intelligence with 11B Active Parameters,” arXiv:2602.10604. arxiv.org/abs/2602.10604
Qualcomm — Snapdragon 8 Elite Gen 5 (2025): ~100 TOPS, ~220 tok/s, native INT2. qualcomm.com
Apple — M5 newsroom (3.5× AI vs M4; 153 GB/s) & A19 Pro (~76 GB/s); bandwidth as bottleneck. apple.com/newsroom