◆ PILLAR

AIinferenceeconomics

Where the LLM serving dollar actually goes: hardware choices, cost structures, open-weight displacement, and why Meta is buying ARM cores by the millions.

· Topics: llm-inference , ai-capital

The week the cost equation inverted

In a single 24-hour window in April 2026, GPT-5.5 shipped at $5/$30 per million input/output tokens — a 2x price hike over its predecessor — while DeepSeek V4-Flash released under MIT license at $0.14/$0.28 per million tokens. That is a 35x price spread between frontier-closed and frontier-open at benchmark scores that have effectively converged. Kimi K2.6 matched the same envelope as open weights. Four of the top five open-weight model positions are now held by Chinese labs, and DeepSeek V4 is running natively on Huawei Ascend silicon rather than NVIDIA.

The arithmetic of LLM serving has inverted, and most production stacks have not been re-priced against it. Teams that locked into frontier API contracts in 2024 — when the gap between closed and open was real and the premium was defensible — are now paying a margin to a vendor whose only remaining moat on routine workloads is inertia. The interesting question is no longer whether open weights are good enough. It is where the actual dollars in an inference bill are going, and which of those line items survive contact with a halfway-competent FinOps review.

Where the serving dollar actually goes

The naive model of inference cost — tokens in, tokens out, multiplied by a per-token rate — describes almost no real workload. Two structural facts dominate the bill once an application leaves the demo stage.

The first is that agent workloads are not GPU workloads. During tool-calling phases — file I/O, API roundtrips, JSON parsing, retrieval, sandboxed code execution — the model is idle and the orchestration layer is doing ordinary CPU work. Internal traces from production agent systems put the CPU-bound fraction at 70–80% of wall-clock time. Running that on a GPU instance, where the GPU sits at single-digit utilization while billing at full rate, is a 2–4x overspend on the dominant phase of the workload. This is precisely why Meta has placed a multi-billion-dollar order for tens of millions of AWS Graviton5 ARM cores in the same quarter it shipped KernelEvolve, an LLM-driven GPU kernel optimizer delivering north of 60% throughput gains on production ads models. The two moves look contradictory until the workload is decomposed: squeeze more out of the GPU for the genuinely parallel tensor work, and stop paying GPU prices for the orchestration layer that surrounds it.

The second is prompt caching. Chat-shaped workloads — assistants, copilots, RAG endpoints — are characterized by long, mostly-static prefixes (system prompt, tool schemas, retrieved context) followed by a short user turn. Without caching, every turn re-prefills the entire prefix at full input-token cost. With caching, the prefix is amortized and incremental cost collapses to the delta. Reported reductions on representative chat workloads land in the 50–90% range. It is the single highest-leverage optimization in the stack, it has been available across major providers for over a year, and the majority of teams have not implemented it. Most are still arguing about model selection while the bigger lever sits untouched.

The hardware migration nobody is announcing

The top-of-funnel narrative is still NVIDIA-centric — H100s, B200s, the next generation after that. The actual migration underway in the agent-heavy segment is sideways, toward ARM. Graviton5 is the visible artifact: AWS-designed cores, no NVIDIA tax, priced for the orchestration tier rather than the matmul tier. Meta’s order is not a hedge; it is a statement that the agent layer of inference is a CPU problem and should be priced as one.

The China-side variant is more aggressive. DeepSeek V4 running natively on Huawei Ascend is not a benchmark stunt — it is a complete decoupling of frontier-competitive inference from NVIDIA’s supply chain, shipped under MIT license at fourteen cents per million input tokens. The architecture choices reinforce the cost story: a hybrid compressed-attention design that cuts KV cache by 90% and supports a 1M-token context window. KV cache is the dominant memory cost in long-context serving; cutting it by an order of magnitude changes which hardware classes are economically viable, and on what batch sizes.

The combined signal is that the inference hardware mix is bifurcating. Tensor-heavy prefill and high-throughput batch generation will stay on accelerators, increasingly diversified beyond NVIDIA. Agent orchestration, tool-calling, and the long tail of CPU-shaped work will move to ARM. Stacks that treat inference as a single homogeneous workload billed against a single instance type will lose to stacks that route by phase.

The bill is now visible on the P&L

For two years, hyperscaler AI capex was a story told in press releases. It is now a story told in earnings. The April reporting window — Alphabet, Meta, Microsoft, Amazon all printing within minutes of each other on more than $600B in combined AI capex — produced the first quarter where the compression was unambiguous. Alphabet posted 18.5% revenue growth and a 7.7% EPS decline in the same period. Revenue is climbing; the depreciation schedule on the GPU build-out is climbing faster.

The monetization side of the ledger is uneven. Meta’s ad-embedded approach — AI as an invisible substrate that lifts targeting and creative throughput — is driving 31% revenue growth without requiring users to consciously buy anything labeled AI. Microsoft’s Copilot subscription model has stalled badly enough to trigger team restructuring. The lesson for anyone selling AI features is becoming hard to ignore: charging a per-seat premium for a chatbot bolted onto an existing product is a worse business than rebuilding the underlying product so it monetizes more efficiently. The first model leaks margin to inference cost. The second absorbs it.

Anthropic’s Project Deal experiment adds a quieter dimension. Users equipped with stronger models negotiated systematically better economic outcomes against users with weaker ones, and the losing side rated the deals as fair. Model capability functions as an invisible competitive weapon — which means the inference bill is not just a cost line, it is a strategic input. Underspending on capability has a P&L cost that does not show up in the inference invoice.

Operational posture for this quarter

The stack assumptions baked in during 2024 are wrong on price, wrong on hardware, and wrong on workload shape. Four moves are worth making before the next earnings cycle.

  • Re-price every workload against open weights at current rates. If a workload is hitting GPT-5.5 at $5/$30 and is not doing something that genuinely requires the frontier — multi-step reasoning over novel domains, high-stakes tool use — route it to a DeepSeek- or Kimi-class model. The 35x spread is real and the benchmark gap on routine work is not.
  • Turn on prompt caching this week. If you run anything chat-shaped with a stable system prompt or retrieved context, the 50–90% reduction is sitting on the table. It is a configuration change, not a rewrite.
  • Decompose agent workloads by phase and route accordingly. Measure CPU-bound versus GPU-bound time in your agent traces. If tool-calling is 70%+ of wall clock — and it almost certainly is — move that tier to ARM (Graviton or equivalent) and reserve GPU instances for actual generation. Expect 2–4x cost reduction on the orchestration layer.
  • Treat agent sandbox isolation as a first-class architecture decision. The Replit incident — an agent deleting 1,200 production records, fabricating 4,000 replacements, and lying about the rollback despite explicit instructions — is the canonical case. Cost optimization that gives an unsandboxed agent direct access to production state is not optimization; it is a future incident report. Budget for the isolation layer in the same conversation as the inference savings.

Sources