Edition 2026-05-25 · read as Data Science
AnthropicMeteringPlusAgenticLoopsSpikeClaudeCosts70%
- Sources
- 36
- Words
- 1,674
- Read
- 8min
Topics Agentic AI LLM Inference AI Capital
◆ The signal
Anthropic killed the flat-rate Claude subsidy and metered all programmatic usage the same week Vercel confirmed 59% of production tokens are agentic multi-turn traces. Your per-task inference cost just jumped 70-90% on Claude workloads precisely when each task burns 5-15x more tokens than your cost model assumes. Re-price before June 15 or absorb a silent overrun that won't surface until the invoice arrives.
◆ INTELLIGENCE MAP
01 Anthropic's Triple Move: Metering, Capacity Recovery, IPO Countdown
act nowAnthropic metered all programmatic Claude usage (Agent SDK, GitHub Actions, third-party harnesses) killing a 70-90% implicit discount. Simultaneously recovering from an 8x capacity miss via xAI's 220K-GPU Colossus 1 lease. Targeting October IPO with ARR tripling from $9B to $30B+ in four months. Enterprise share crossed OpenAI at 34.4% vs 32.3%.
- Capacity miss
- Colossus GPUs
- ARR growth
- Enterprise share
- June 15 deadline
02 Agentic Majority: 59% of Tokens, Eval Harnesses Blind
act nowVercel's AI Gateway confirms 59% of production tokens are now agentic multi-turn tool-calling traces. Anthropic captures 61% of spend via Opus; Google captures 38% of volume via Flash. Eval harnesses still measure single-turn accuracy on the minority of traffic. CyberGym shows multi-agent decompose-debate-verify outperforms monolithic models. Cost models built on 3:1 input/output ratios are off by ~5x on agentic traces at 15:1.
- Anthropic spend share
- Google volume share
- I/O ratio (agentic)
- Old I/O assumption
03 Lakehouse & Agent Infra Vulnerability Cluster
monitorApache Iceberg CVE-2026-42812 (CVSS 9.9) lets attackers redirect table metadata to poisoned S3 prefixes—corrupting training data silently. Apache Polaris credential-broadening gives cross-tenant access. Argo CD 3.2/3.3 exposes plaintext K8s Secrets. PraisonAI agent framework exploited in 4 hours from disclosure. NGINX 18-year RCE hits every model-serving gateway. The entire ML reference architecture has CVSS 9.0+ coverage.
- Iceberg CVSS
- Polaris CVSS
- Argo CD CVSS
- PraisonAI exploit
- NGINX latent age
- 01Iceberg/Polaris9.9
- 02n8n SQLi9.8
- 03Argo CD9.6
- 04Ollama GGUF9.1
- 05NGINX RCE9.1
04 Training Efficiency: 2-360x Gains Validated at Scale
monitorThree papers drop unit economics of model training. Nous TST delivers 2-3x wall-clock speedup at matched FLOPs with no inference change (validated 270M→10B MoE). Datology's data curation hits +11.7 points on VLM benchmarks at 17x less compute. NVIDIA Star Elastic produces model-size families from one post-training run at 360x lower cost. DuckDB Quack protocol also opens single-node ETL replacement for Spark/Glue.
- TST speedup
- Datology savings
- Star Elastic savings
- TST scale
05 Compute Supply Crunch: 4:1 Demand, Silicon Diversification
backgroundNebius reports 4+ customers competing per GPU at 684% YoY revenue growth, guiding $3-3.4B for 2026 vs $530M in 2025. Cerebras IPO'd at $56B with 70% first-day pop, anchored by OpenAI's $20B commitment. Cisco AI orders jumping $5B→$9B. Stratos 9GW datacenter facing 4,000 complaints and referendum. Memory hardware shortage explicitly driving product redesigns at the networking layer.
- Nebius YoY growth
- Cerebras valuation
- OpenAI→Cerebras
- Stratos complaints
- Cisco AI orders
- Nebius 2025 rev530
- Nebius 2026 guide3200
◆ DEEP DIVES
01 Anthropic's Margin Play: Metering + Capacity + IPO Changes Your Cost Model This Month
What Happened
Anthropic moved on pricing, capacity, and IPO posture in the same week. The combined effect resets inference economics for any team running Claude in production.
- Programmatic usage metered: Claude subscriptions now convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and every third-party harness. The implicit 70-90% effective discount on alt-harness usage is gone. We flagged this loophole back when claude-p shipped. It has now closed.
- Capacity recovery: After an 8x miss on their capacity plan (80x actual growth against 10x projected), Anthropic leased xAI's entire 220,000-GPU Colossus 1 cluster (H100, H200, GB200). Rate limits are coming up. Claude Code 5-hour caps double, peak throttling is removed, and Opus API limits go up "substantially," which is a word doing real work until someone publishes a number.
- IPO pricing pressure: An October IPO is the target, with ARR up from $9B to $30B+ in four months. Margin-per-token is now a board metric. Inference cost work that used to be an engineering KPI is now an investor KPI.
Cross-Source Triangulation
Seven independent sources confirm the pricing impact from different angles:
Source Data Point Implication Ramp AI Index Anthropic 34.4% vs OpenAI 32.3% of US businesses Market leader raising prices from position of strength ServiceNow CDIO Burned full-year Claude budget by May Token economics at scale already unpredictable National Life CIO 'Not great for companies wanting per-user monitoring' No native cost attribution = silent overruns OpenAI 2-month-free Codex enterprise switch promo (same day) Competitive counter-offensive targeting alienated devs Opus 4.7 3x image processing cost increase Per-modality pricing moving unpredictably ServiceNow blew through its full-year Anthropic budget by May. That is the concrete failure mode when token economics move faster than procurement cycles.
What This Doesn't Tell You
The enterprise share number (34.4% vs 32.3%) is drawn from Ramp card-spend data, which skews SMB and mid-market. Large enterprise contracts go out by invoice. OpenAI is right to flag the gap. A 2.1-point lead on that dataset is directional, not definitive. Combined with the pricing changes, the read is that Anthropic believes it can raise prices without bleeding volume, and ServiceNow's budget blow is the early confirmation.
The capacity recovery adds complexity: any Claude benchmark run between mid-April and May 7 is stale. The serving fleet now includes heterogeneous hardware (GB200 via Colossus). Expect p95/p99 latency variance during integration. Re-baseline after the new caps land, not before.
Action items
- Reconcile all Claude-backed workloads (Agent SDK, GitHub Actions, third-party IDE) against new credit cap by end of next week
- Deploy LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily token budget alerts before June 15
- Run OpenAI's 2-month-free Codex promo as a controlled head-to-head through your own eval harness
- Avoid signing long-term commits with Anthropic until post-Colossus integration stability is observable (target August evaluation)
Sources:Claude just metered your agent SDK calls · Anthropic passes OpenAI in B2B · Claude Code latency on long-context requests · Anthropic shipped without the telemetry hooks · Vercel published a number worth sitting with · Ramp's corporate card data
02 The Agentic Eval Gap: 59% of Your Tokens Are Unmeasured
The Production Reality
Vercel's AI Gateway telemetry, covering 200,000 teams over seven months, puts agentic workloads at 59% of all token volume. That is not a forecast. It describes present-tense production. Six months ago the agentic share sat under 20%. The composition has shifted faster than anything since completion endpoints gave way to chat.
The cost structure is not the same shape. Agentic traces run 15:1 input-to-output against 3:1 for chat, with heavy cache reuse on some providers and none on others. A spend forecast built on last year's ratio is off by roughly 5x. The spend-versus-volume split confirms the tiered pattern: Anthropic captures 61% of dollars via Opus on planning and reasoning nodes, while Google captures 38% of volume via Flash on high-throughput utility calls.
Why Current Evals Are Measuring the Wrong Thing
Most eval harnesses score single-turn responses against reference answers. That was the right call in 2023. With 59% agentic traffic, the harness is optimizing the minority of production. Three sources converge on what the harness is not measuring:
What breaks in production What the harness measures The gap Planner burns 40K tokens arguing with itself Final-answer accuracy (90%+) Cost path invisible Tool-call fails on retry 3, succeeds on retry 5 End-state correctness Retry cost hidden Agent passes by gaming the benchmark Pass@1 rate Reward hacking invisible MCP tool listings balloon context Task completion 30% token overhead unmeasured The CyberGym result adds architectural evidence. Microsoft's MDASH (100+ agents, scan→debate→exploit pipeline) outperformed Anthropic's monolithic Mythos on vulnerability benchmarks. The decompose-debate-verify pattern is consistent with classical ML ensemble priors. No ablation isolates which stage contributes the lift, and no cost comparison is published. Given the numbers, expect about half the reported ensemble lift on production traffic once retries and tool-call failures are accounted for.
If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.
The Reference Architecture That's Winning
Abridge, with 80M+ clinical conversations across 250 health systems, disclosed its production pattern: a constellation of models with fast/slow routing. Cheap triage in front, expensive reasoning behind, LLM judges calibrated against human annotations, memory externalized into event-driven stores. At scale this is the de facto pattern now. Teams without model routing and calibrated judges are paying the gap on the inference bill.
Glean's sponsored benchmark claims off-the-shelf MCP uses 30% more tokens than a tuned knowledge graph on agentic tasks. Methodology undisclosed, so the right read is hypothesis, not result. The failure mode it points at is real: MCP tool listings balloon context windows, and naive tool outputs return verbose blobs where a reranked snippet would do the job.
Action items
- Add trajectory-level metrics to eval harness this sprint: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate
- Instrument per-node token cost in agent graphs and route utility calls (summarization, JSON extraction, query rewriting) to Flash/Haiku-class models within 2 weeks
- Run a 1-hour spike measuring MCP/tool-calling token overhead vs. retrieval-first baseline on 100 production traces
- Persist full agent trajectories (tool calls, intermediate state, file diffs) and audit a stratified random sample of 'passing' rollouts for reward hacking
Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs
03 Lakehouse & Agent Infra: The Entire Reference Architecture Has CVSS 9.0+ Coverage
The New Vulnerability Cluster
This week's CVE disclosures land across every layer of the data/ML stack at once. LiteLLM KEV and Ollama were flagged earlier in the week. Four additional critical vulnerabilities hit components most data teams actually run in production:
Component CVE / CVSS Failure Mode Blast Radius Apache Iceberg CVE-2026-42812 / 9.9 Metadata write redirect to attacker S3 Poisoned training data, corrupted features Apache Polaris CVE-2026-42809-11 / 9.9 Credential broadening Cross-tenant S3/GCS access Argo CD 3.2/3.3 CVE-2026-42880 / 9.6 Low-privilege → plaintext K8s Secrets Model-registry tokens, HF PATs, cloud creds PraisonAI CVE-2026-44338 Auth bypass Agent secrets, connected DBs, tool-calls NGINX rewrite 18-year-old RCE Unauthenticated code execution Model-serving ingress, registry creds Why Iceberg Is the Scariest One
The Iceberg vulnerability is the one to read first. An attacker with table-write permission can redirect metadata pointers to an attacker-controlled S3 prefix. The next query reads poisoned Parquet. The next training run ingests silently corrupted features, and most lakehouse observability is instrumented for row-level changes, not pointer mutations. The thing CVSS doesn't tell you is that the bottleneck here is detection, not patching. Combined with Polaris credential broadening, there is a plausible path from compromised analyst notebook to cross-tenant data theft.
The PraisonAI 4-hour exploitation is the second pattern worth internalizing. Agent frameworks have crossed the adoption threshold where threat actors watch the disclosure feeds. Four hours from CVE to working exploit is not a patching window. It is a containment window. LangChain, CrewAI, and AutoGen sit in the same risk bucket until shown otherwise.
Draw a reference architecture for a modern data team, throw darts at it, and every throw hits a CVSS of 9.0 or higher this week.
The Attacker Economics Have Shifted
Honeypot data shows exposed inference endpoints get indexed by Shodan within 3 hours and absorb 113K+ requests monthly. 23% of observed traffic targets AI-specific paths:
/v1/models,/api/tags,/.cursor/rules,/.well-known/mcp.json. LLMjacking is a discipline with mature tooling now. LLM-Scanner shipped a honeypot-detection update mid-experiment, which is the kind of detail that tells you where the attacker investment is going. The attacker's cost is a Shodan query. The defender's cost is whatever the provider bills for stolen tokens, plus whatever the model was allowed to see.The NGINX 18-year RCE deserves separate attention. Every model-serving gateway behind NGINX is in blast radius, which is most of them. Battle-tested and audited are not the same word. This is an 18-year existence proof.
Action items
- Patch Argo CD to ≥3.2.12/≥3.3.10 and rotate every K8s Secret in reachable namespaces this week
- Audit Iceberg/Polaris catalog configurations: enforce explicit storage credential scoping and add write-path allowlisting for table metadata locations
- Patch NGINX across all inference gateways and audit rewrite-module usage in model routing configs
- Inventory all agent frameworks (PraisonAI, LangChain, CrewAI, AutoGen), pin versions, subscribe to CVE feeds, require same-day patching SLA
Sources:LiteLLM landed in the KEV catalog this week · PraisonAI zero-day was popped in four hours · An Ollama endpoint exposed to the public internet · Agent stacks are now in scope for attackers
04 Training Efficiency Results Worth Spiking: 2-17x Validated This Week
Three Results, One Direction
Three independent research drops landed in the same cycle. Each one moves training unit economics in a way that survives a second read:
Work Claim Scale Validated Inference Impact Replication Risk Nous TST 2-3x wall-clock at matched FLOPs 270M → 10B-A1B MoE None—no architecture change Medium; single-source, clean claim Datology VLM +11.7 pts on 20 benchmarks; 17x less compute 2B and 4B params Lower response FLOPs—real serving win Medium; benchmark-selection risk NVIDIA Star Elastic 360x cheaper model-size family from one run Not specified Produces family of sizes High; lab-reported headline number What to Spike First
TST is the cheap experiment to run first. It is a pretraining recipe change with no inference-side coupling, which means the downside is bounded to a failed run. If it replicates at even 1.6x on a 1B continued-pretraining run without a val-loss regression, it pays for itself on the next full run. The claim is narrow and testable: token superposition during training, standard inference.
Datology's result is the strongest evidence this year that the marginal dollar in VLM training has moved from compute to curation. They beat InternVL3.5-2B by ~10 points at 17x less training compute, purely via data selection. Their 4B model approaches frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B. The thing this doesn't tell you is how the curation pipeline transfers to a different data distribution, but the inference FLOP reduction is a real serving cost win regardless.
Star Elastic's 360x number is the kind of claim that always shrinks under independent eval. Even a 30x hold would restructure how size tiers get produced for deployment. Worth watching. Not worth spiking yet.
The marginal dollar in VLM training moved from compute to curation this week. Datology's result is the cleanest evidence: +11.7 points at 17x less compute, purely via data selection.
Adjacent: DuckDB Client-Server Opens Single-Node ETL Replacement
DuckDB's Quack HTTP protocol turns embedded DuckDB into a proper client-server engine. Combined with the published ECS Fargate plus Terraform pattern, this is a credible path to replacing Spark-on-Glue for single-node workloads. The immediate action is mechanical: pull the Glue/EMR job catalog, sort by working-set size, and find every job under ~100GB running on a two-node cluster for 20 minutes because someone once wrote it in PySpark. The common outcome on port is a 50%+ reduction in both $/run and p95 latency.
Caveat on Quack: default config is localhost-bound, no SSL, token auth only. Harden before production.
Action items
- Spike Token Superposition Training on a 1B continued-pretraining run against a matched-FLOPs baseline this quarter
- Audit Glue/EMR job catalog for single-node candidates (<100GB) and port one to the Fargate + DuckDB + Terraform pattern
- If you train VLMs, allocate 20% of next training compute budget to a data curation experiment replicating Datology's approach
- Watch Star Elastic for independent replication before committing to the 360x number
Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode this week · TLDR Data
◆ QUICK HITS
Kafka Share Groups report 8x throughput scaling at 32 consumers on I/O-bound workloads — benchmark against your most partition-bound embedding/enrichment consumer group before over-partitioning again
DuckDB shipped a client-server mode this week
Only 15% of organizations have the data foundation for agentic AI per Fivetran — use as base rate when scoping agent projects; half are actually data-platform projects with an agent on top
TLDR Data
TML-Interaction-Small reports 0.40s turn-taking latency vs 1.18s for GPT-Realtime-2.0 — a 3x gap on the metric that determines perceived voice-agent naturalness, but no disclosed p95 or concurrency behavior
TML is reporting 0.40 seconds of full-duplex latency
Duolingo publicly pegs AI-generated content slop at ~20% requiring human QC — use as calibration anchor for your own generation pipeline rejection rate
Duolingo's twenty percent AI slop rate
LLM-as-a-Verifier (decomposed binary verifications at token granularity) eliminates tie-rate problems plaguing LLM-as-a-Judge — rewrite one pairwise judge as a test this sprint
An Ollama endpoint exposed to the public internet
Affirm claims transformer-based underwriting outperforms GBMs at scale (27M consumers) — new COSO/PCAOB guidance makes non-deterministic ML non-compliant in regulated finance; audit seed management
The transformer underwriting models are outperforming
Update: Gemini reproducibly emits real individuals' phone numbers from training data — add PII extraction eval (canary insertion + divergence attacks) to LLM CI before next release
Gemini is the latest model to surface PII
Persona drift in LLM agents occurs within 8 conversational turns per COLM 2024 — add a verbal-tic canary and measure turn-index of first drop as a free drift signal
AI personas drift within eight turns
x402 batched settlement (sub-cent machine-to-machine payments) is now built into AWS AgentCore Bedrock under Linux Foundation governance — spike if building agents that consume metered tools
x402 + Bedrock AgentCore
◆ Bottom line
The take.
Anthropic killed the flat-rate Claude subsidy and metered all programmatic usage the same week production data confirmed 59% of inference tokens are agentic multi-turn traces running at 5x the cost of your model's assumptions. Meanwhile, Apache Iceberg, Polaris, Argo CD, and PraisonAI all shipped CVSS 9.0+ vulnerabilities that cover the entire data/ML reference architecture. The three things your stack needs before June 15: a gateway with per-workload cost attribution, trajectory-level eval metrics replacing single-turn accuracy, and credential rotation on every component that touches your lakehouse catalog.
Frequently asked
- What's the hard deadline to re-price Claude workloads before silent overruns hit?
- June 15 is the cutoff. Anthropic has converted Claude subscriptions into dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses, ending the implicit 70-90% discount on alt-harness usage. Any workload not reconciled against the new credit cap by that date will silently overrun and only surface on the invoice.
- Why are single-turn eval harnesses inadequate when 59% of tokens are agentic?
- They optimize for the minority of production traffic. Agentic traces run roughly 15:1 input-to-output versus 3:1 for chat, and final-answer accuracy hides cost-path explosions, retry burn, MCP context bloat, and reward hacking. Trajectory-level metrics — tool-call precision/recall, steps-to-completion, cost-per-successful-task, and recovery-from-error rate — are needed to represent actual production behavior.
- Which CVE this week poses the biggest silent data-poisoning risk to ML pipelines?
- Apache Iceberg CVE-2026-42812 (CVSS 9.9). An attacker with table-write permission can redirect metadata pointers to an attacker-controlled S3 prefix, causing the next query to read poisoned Parquet and the next training run to ingest corrupted features. Most lakehouse observability watches row-level changes, not pointer mutations, so detection — not patching — is the bottleneck.
- Which training efficiency result is the safest to spike first, and why?
- Nous Token Superposition Training. It's a pretraining recipe change with no inference-side coupling, so the downside is bounded to a failed run. If it replicates at even 1.6x wall-clock on a 1B continued-pretraining run without a validation-loss regression, it pays back on the next full run. Datology's VLM curation result is higher-impact but harder to transfer; Star Elastic's 360x claim needs independent replication.
- What's the quickest win for cutting agentic inference cost without quality loss?
- Per-node model routing in agent graphs. Vercel telemetry shows Anthropic capturing 61% of dollars via Opus on planning while Google captures 38% of volume via Flash on utility calls — the tiered pattern is already winning in production. Routing utility calls (summarization, JSON extraction, query rewriting) to Flash/Haiku-class models typically yields 20-40% cost reduction at constant trajectory completion rate.
◆ Same day, different angle
Read this day as…
◆ Recent in data science
Keep reading.
- Princeton's ICML 2026 audit added GPT 5.5, Gemini 3.5 Flash, and Claude Opus 4.7 and found zero meaningful reliability improvement over pred…
- Hugging Face Transformers has an RCE path that fires from model config files — not pickle weights — across 2.2 billion installs.
- Anthropic ended the flat-rate Claude subsidy this week.
- Anthropic killed the flat-rate Claude subscription this week.
- Anthropic quietly killed the 70-90% effective discount on programmatic Claude usage — subscriptions now convert to dollar-matched API credit…