◆ PILLAR

AI inference economics

Where the LLM serving dollar actually goes: hardware choices, cost structures, open-weight displacement, and why Meta is buying ARM cores by the millions.

Updated 2026-04-28 · Topics: llm-inference , ai-capital

The inversion

For two years, frontier model APIs and open-weight models lived in separate markets. Frontier meant quality; open-weight meant cost. You picked one or you rolled a hybrid.

In April 2026, that distinction collapsed. GPT-5.5 launched at 2x the API pricing of GPT-4. DeepSeek V4 Flash serves at $0.14/M tokens. Kimi K2.6 matches frontier performance as open-weight. The cost of the best model doubled in the same quarter that the cost of a good enough model fell by an order of magnitude. Every production stack built on the old economics is now misaligned.

The correction doesn’t happen automatically. It requires re-routing — a model abstraction layer, a workload classifier, a cost ceiling per request class. Teams that shipped these primitives in 2024 are spending half as much this quarter as teams that hard-coded a vendor.

Where the money goes

The naive mental model — “inference cost scales with token count” — is wrong at production scale. The real cost structure is:

Hardware amortization. GPU instances cost $2-6 per hour whether you’re running them at 10% or 90% utilization. Agent workloads — the fastest-growing inference class — spend 70-80% of wall time in tool-calling phases where the GPU sits idle. Meta’s multi-billion-dollar Graviton5 order is the industry’s largest public admission that agent inference belongs on CPUs, not GPUs. If you’re running agents on H100s without offloading the tool-calling layer, you are paying the wrong unit economics.

Prompt engineering vs. prompt caching. Anthropic’s prompt caching, OpenAI’s equivalent, and every serious open-source inference stack now cache repeated system prompts for 50-90% cost reduction. Most production chat applications never enable this. The typical chat application has a 2000-token system prompt that gets re-processed on every message. Flipping the cache-control header cuts the bill by an order of magnitude. It is the single most undervalued optimization in the stack.

Egress and storage for RAG. If your application does retrieval-augmented generation, the vector DB cost is often invisible until it’s half your infrastructure bill. pgvector on managed Postgres is usually the right choice until the corpus crosses 10M embeddings. Above that, dedicated vector stores (Qdrant, Milvus) pay off — and below it, pgvector keeps your operational surface tiny.

Where the money is moving

Wednesday’s synchronized hyperscaler earnings in April 2026 delivered the sharpest verdict yet on AI monetization. Alphabet reported EPS down 7.7% despite 18.5% revenue growth — the first definitive proof that $600B+ in combined industry AI capex is compressing margins at the hyperscaler level. Meta, by contrast, posted 31% revenue growth on the back of AI-embedded advertising. The split is stark: AI as a product compresses margins; AI as a multiplier on existing revenue expands them.

The investor-layer implication is that alpha for capital-light companies below hyperscaler scale has permanently shifted. Model-layer bets are now commodity. The orchestration, security, observability, and application layers — where value capture doesn’t require funding the infrastructure arms race — are where the next cycle of returns lives. Microsoft’s stalled Copilot subscriptions and OpenAI’s pivot to Workspace Agents are the most visible symptoms of this realignment.

The operational posture

Three moves this quarter. Profile your GPU utilization during agent runs. If it’s below 40%, you are a candidate for the CPU migration pattern. Enable prompt caching. If you haven’t, you are leaving a 50-90% cost reduction on the table. Build a model abstraction layer before you need it. The cost floor will drop again; the teams that can re-route in days, not quarters, are the ones that capture the savings.

The AI infrastructure arms race has a clear winner trajectory: the hyperscalers pay for the weights, and application-layer companies capture the margin. Plan accordingly.