Sunday, May 3, 2026 ~4 min

Cache Hit Rate Is the New Eval Metric for Agent Workloads

DeepSeek's hours-long KV cache delivered a 3.2x effective discount this week while open-weight MoE models closed to within six points of frontier. The model you pick matters less than the harness you wrap around it.

A production dashboard made the rounds this week showing $1,051 in API spend against $3,351 in cache savings on DeepSeek V4 Pro. That's a 3.2x effective discount, and it appears nowhere on a per-token rate card. The mechanism is unglamorous: V4 Pro holds its disk-backed KV cache for hours where most providers evict in five minutes. On a multi-step agent loop with a stable prefix — system prompt, tool schemas, prior observations — the prefill bill is paid once instead of on every step.

This is the actual story of the day. Not Grok 4.3's price cut. Not the Pentagon excluding Anthropic. Not Replit at a billion in ARR. Those are real, and we'll get to them. But the through-line is that cache residency, harness design, and prefix stability now move agent unit economics by larger factors than the gap between any two frontier models. The teams winning this quarter figured that out. The teams renewing three-year API contracts on a per-token comparison sheet have not.

The benchmark gap stopped mattering

Three open-weight MoE models shipped in a single week — DeepSeek V4 Pro at 1.6T total / 49B active, Kimi K2.6 at 1T / 32B, MiMo V2.5 Pro at 1T / 42B — all landing 52–54 on the Artificial Analysis Intelligence Index against GPT-5.5 at 60 and Opus 4.7 at 57. A six-point gap, concentrated in HLE, CritPt, TerminalBench Hard, and hallucination-heavy evals. On the workloads that actually generate revenue — coding, tool use, multi-step planning — these models are at functional parity.

The gap is also smaller than the gap between a tuned harness and a sloppy one. GPT-5.5 beats Opus 4.7 on the Intelligence Index but loses on PostTrainBench inside the Claude Code harness. Same weights, different plumbing, opposite ranking. Grok 4.3 takes first on CaseLaw and CorpFin and scores 11% on ProofBench — the within-model domain variance is wider than the between-vendor gap on any single domain. If your eval doesn't log the harness config — prompt format, tool schema, retry policy, context budget — your benchmark is unfalsifiable. Most aren't logging it.

The practical consequence: a single-harness bake-off between an open-weight model and your incumbent API tells you almost nothing. The open-weight model is being tested through scaffolding tuned for the incumbent's tool-call conventions. The fix is per-family harness tuning before you conclude anything about the weights.

Grok's price cut is a trap for the unmonitored

Grok 4.3 set a new floor at $1.25/M input, $2.50/M output — 40–60% under 4.2. xAI also introduced a $0.05 fee per safety-filter-blocked request. At a 2–3% refusal rate on production traffic, that erodes the savings in a line nobody's dashboard tracks. The headline number is the bait. The refusal-rate instrumentation is the work.

The broader signal is that inference economics are now an acquisition-grade capability. Nebius paid $615M for Eigen AI specifically for inference optimization. Cursor's reported -23% gross margins at the same time as Replit's ~$1B ARR with 300% NRR is the cleanest natural experiment the AI app layer has produced — same category, opposite outcomes, driven entirely by who owns enough of the inference stack to hold their own margin. If your portfolio company resells inference and can't show improving margins under the current deflation curve, it's not an independent business. It's an acqui-sale on a clock.

The agent security pattern that just hardened

The Planner/Executor Split is settling in as the default for agents that touch untrusted content. Planner has tools, never sees untrusted text. Executor reads untrusted text, has no tools. Gmail runs this in production. The honest cost is roughly 2x inference on the protected paths. That's the price that turns indirect prompt injection from catastrophic into recoverable.

MCP-versus-Skills isn't a taste decision — it's a security boundary. MCP runs in containerized JSON-RPC with schema validation. Skills execute arbitrary bash, python, and curl in the agent's own environment with zero isolation. A Skill that ingests untrusted content and shells out is a prompt-injection-to-RCE chain. Most teams haven't audited which of their tools sit on which side of that line. Do it before someone with a clever paragraph does it for you.

Voice cloning hit the 120-second threshold this week via xAI's Custom Voices. Two minutes is a voicemail greeting or the opening of an earnings call. Voice as an implicit auth factor — the CFO calling treasury, the CEO calling the helpdesk for an MFA reset — is dead. Out-of-band callback to a directory-listed number plus a rotating challenge phrase is the floor. Brief finance and the helpdesk this week, not next quarter.

What to do this week

Instrument three metrics in your agent runtime before you compare another model: cache hit rate, prefix-reuse ratio, and effective $/1K tokens net of cache discounts. None of these are in standard observability stacks. All three move cost more than the model choice does.

Then replay last month's actual agent traffic — not a synthetic benchmark — through DeepSeek V4 Pro, Kimi K2.6, and your incumbent, with per-family harness tuning. Measure blended cost per successful task, not sticker price per million tokens. The ranking will surprise you, and that surprise is the point. The teams that did this work in April are renegotiating their cloud AI contracts in May with leverage. The teams that didn't are about to discover that the floor moved without them.

◆ Behind the synthesis

Six specialist takes that fed this piece.

The piece above is one stream in my voice. Below are the six lenses my pipeline produced upstream — each tuned for a different reader. Use them when you want the angle that matters most to your role.

Cache Hit Rate Is the New Eval Metric for Agent Workloads

The benchmark gap stopped mattering

Grok's price cut is a trap for the unmonitored

The agent security pattern that just hardened

What to do this week

Six specialist takes that fed this piece.

Your agentic workload cost model is wrong by roughly 3x because it prices tokens, not KV cache residency.

US and Iran are in active kinetic conflict.

Cache economics now dominates agentic model selection, and price-per-token sheets no longer measure the bottleneck.

A banker I spoke with last month pasted the model's draft into a client memo, then spent forty minutes rewriting it anyway.

OpenAI is now on AWS Bedrock, the Microsoft exclusivity is dissolved, and the AGI clause is gone.

Replit disclosed roughly a billion dollars of ARR with three hundred percent net revenue retention, a 350x jump in eighteen months, while Cursor is reportedly selling to SpaceX at sixty billion on negative twenty-three percent gross margins.