Edition 2026-05-15 · read as Data Science
AnthropicEndsSubscriptionArbitrage,ResetsAgentEconomics
- Sources
- 36
- Words
- 1,514
- Read
- 8min
Topics Agentic AI LLM Inference AI Capital
◆ The signal
Anthropic killed the 70-90% effective discount on programmatic Claude usage this week — subscriptions now convert to dollar-matched API credits across Agent SDK, GitHub Actions, and third-party harnesses — while simultaneously admitting they planned for 10x growth and got 80x, forcing an emergency lease of xAI's entire 220,000-GPU Colossus 1 cluster. OpenAI dropped a 2-month-free Codex enterprise switch promo the same day. If you haven't re-run unit economics on your agent stack since Monday, you're making a pricing decision by default.
◆ INTELLIGENCE MAP
01 Anthropic Metering Cliff + OpenAI Counter-Offensive
act nowClaude subscriptions now meter at API list price across all programmatic surfaces. June 15 third-party tool credits become separate, non-rolling buckets. ServiceNow already burned its full-year Claude budget by May. OpenAI's 2-month free Codex promo is an asymmetric-payoff evaluation window.
- Growth miss
- Colossus GPUs leased
- Anthropic B2B share
- OpenAI B2B share
- June 15 deadline
- Anthropic B2B Share34.4
- OpenAI B2B Share32.3
02 59% Agentic Volume: Eval Harnesses Measure the Minority
act nowVercel's AI Gateway shows 59% of production tokens are now multi-turn agentic traces, not single-shot completions. Anthropic captures 61% of spend via Opus; Google captures 38% of volume via Flash. Most eval harnesses still score single-turn responses — they're benchmarking traffic that no longer dominates.
- Opus spend share
- Flash volume share
- MCP token overhead
- Teams tracked
03 AI Cyber Capability Crosses Discrete Threshold
monitorAnthropic's Mythos is the first model to clear both AISI attack ranges (full network takeover). Mozilla's custom harness surfaced 271 Firefox bugs vs. curl's 1 — same model, 271x delta from scaffolding alone. Google confirmed a threat actor using LLMs for live espionage tooling. Patch SLAs calibrated to human speed are now mis-calibrated.
- Mozilla bugs found
- curl bugs found
- MDASH Windows fixes
- PraisonAI exploit time
04 Training Efficiency: Three Results Shift Unit Economics
monitorNous TST reports 2-3x wall-clock speedup at matched FLOPs with no inference architecture change (validated to 10B). Datology beats InternVL3.5-2B by ~10 points at 17x less compute via data curation alone. NVIDIA Star Elastic claims 360x cheaper model-family derivation from one post-training run. The marginal dollar in training is moving from compute to curation.
- TST speedup
- Datology compute cut
- Star Elastic cost cut
- SWE-ZERO corpus
05 Compute Supply Crunch Quantified: 4:1 Demand Ratio
backgroundNebius reports 4+ customers per GPU and 684% YoY revenue growth, guiding $3-3.4B in 2026 from $530M. Cerebras IPO'd at $56B with a 70% day-one pop backed by OpenAI's $20B commitment. Cisco AI orders jumping $5B→$9B. The 9GW Stratos project faces 4,000 complaints and a referendum. Reserved capacity beats on-demand for H2 2026.
- Nebius 2026 guide
- Nebius YoY growth
- Cerebras valuation
- Cisco AI orders jump
- Nebius 2025 Rev530
- Nebius 2026 Guide3200
◆ DEEP DIVES
01 Anthropic's Pricing Cliff: Metering, Capacity, and the 30-Day Window
What Changed This Week
Anthropic tightened the pricing surface this week, and the budget written in October no longer covers the workload run in November. Claude subscriptions now convert to dollar-matched API credits across every programmatic surface: Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The 70-90% effective discount power users were extracting on alternative harnesses is gone. Starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) lands in a separate credit bucket with no rollover and overflow at API rates. Dario Amodei has admitted Anthropic planned for 10x growth and hit 80x, which is why they are emergency-leasing xAI's 220,000+ GPU Colossus 1 cluster (H100, H200, GB200). The capacity scramble is the cause. The metering is the consequence.
The Contradiction Worth Surfacing
Sources disagree on what the Ramp 34.4% vs 32.3% crossover means. Several cite it as evidence Anthropic is winning. Others correctly note that Ramp measures who gets billed on a corporate card, which doesn't capture token volume, workload criticality, or production dependency. OpenAI counters that large enterprise contracts go through invoice and ACH, not cards. The directional signal is robust across sources: multi-vendor procurement is now the default. The specific ranking is noise within measurement error.
ServiceNow burned its full-year Claude budget by May. The cost-attribution gap bites most teams within one quarter of going live.
The No-SLA Problem
Anthropic provides no native per-user usage telemetry, no tool-level consumption breakdown, no SLAs on latency or availability, no budget alerts, and no anomaly detection. For enterprise SaaS at this price point, that is anomalous. The thing this doesn't tell you from the pricing page is that you cannot attribute which tenant, prompt, or feature drove the bill until the invoice arrives.
Capability Enterprise SaaS norm Anthropic (today) Per-user attribution Native dashboards Not exposed Budget alerts Standard Absent Latency/availability SLA Contractual None Anomaly detection Built-in Absent The OpenAI Counter
Sam Altman dropped a 2-month-free Codex enterprise switch promo the same day Anthropic metered. That is a zero-cost evaluation window. The right read is to run it through an internal harness, not vendor benchmarks, with trajectory-level instrumentation that measures how agents succeed, not just pass@1.
Action items
- Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) and reconcile projected token burn against new credit caps by end of next week
- Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily budget alerts before June 15
- Activate OpenAI's 2-month Codex promo and instrument a head-to-head eval on matched prompts and tool schemas
- Avoid locking annual Anthropic contracts until post-Colossus integration stability is observable (expect 6-8 weeks)
Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with
02 59% Agentic: Rebuild the Eval Harness Around Trajectories, Not Turns
The Production Data
Vercel's AI Gateway, spanning 200,000 teams over 7 months, reports 59% of all tokens are now agentic — multi-turn, tool-calling traces. This is production telemetry, not a forecast. The spend-volume split is where the routing behavior shows up: Anthropic captures 61% of dollar spend on reasoning and planning nodes via Opus, while Google captures 38% of token volume on high-throughput utility calls via Flash. Teams are already tiering by node type. Eval and cost code has not caught up.
Why Current Evals Are Measuring the Wrong Bottleneck
Most production eval harnesses still score single-turn responses against reference answers. When 59% of traffic is multi-step tool loops with retries, the failure mode is a planner burning 40,000 tokens arguing with itself, not final-answer accuracy. Accuracy is 90%+ in both cases. The bill lives on the cost path, which the harness does not see.
Cost models have the inverse problem. They were fit when input-output ratios sat around 3:1. Agentic traces run 15:1 on input with cache-hit rates that vary by provider. A forecast built on last year's ratio is off by roughly 5x on spend. That is not a calibration error. That is the wrong model.
If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.
The MDASH Validation
Microsoft's MDASH, a 100+ agent ensemble, outperformed Anthropic's Mythos on CyberGym by decomposing vulnerability discovery into scan → adversarial debate → PoC exploitation stages. The result is consistent with classical ML priors: ensemble topology with explicit disagreement beats monolithic models on complex tasks. Caveat: no cost or latency numbers published. The thing CyberGym doesn't measure is the inference bill for 100+ agents, which is the number that decides whether this ships.
The Architecture That Wins
Layer Pattern Evidence Routing Cheap triage → expensive reasoning, gated by confidence Abridge (80M+ conversations), Vercel spend/volume split Eval Trajectory-level: tool-call F1, steps-to-completion, $/successful-task Kapoor: outcome-only metrics hide reward hacking in capable agents Memory External event-driven store, not weights Microsoft agent memory: 97.2% precision at 400-500 memories Grounding Knowledge Graph + MCP, not vector RAG alone SAP + ServiceNow converged independently; Glean: MCP +30% tokens vs tuned retrieval Action items
- Add trajectory-level metrics (tool-call precision/recall, steps-to-completion, cost-per-successful-task) to eval harness this sprint
- Instrument per-node token cost in agent graphs and route utility calls (summarization, JSON extraction, query rewriting) to Flash/Haiku-class models
- Run a 1-hour spike measuring MCP/tool-calling token overhead vs. retrieval-first baseline on 100 production traces
- Persist full agent trajectories (tool calls, intermediate state, file diffs) and audit stratified sample of 'passing' rollouts for reward hacking
Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs is the combination
03 AI Cyber Capability Crossed a Discrete Threshold — Your Release Gate Needs a New Tier
The Capability Jump
UK AISI evaluated the newest Mythos and GPT-5.5-cyber variants on autonomous cyber-offense tasks. Mythos cleared both of AISI's hardest simulated attack ranges (full network takeover). GPT-5.5-cyber cleared one of two. The prior Mythos generation topped out at 'advanced persistence.' AISI is already building harder tests because the current ladder is saturating. The shape of the curve matters here. This is not smooth interpolation. It looks like a discrete unlock, comparable to GPT-3.5→4 on agentic tool-use.
The 271x Harness Delta
Same model, two teams, two orders of magnitude difference. Mozilla wrapped Claude Mythos Preview in a custom agentic harness integrated with their fuzzing infrastructure and surfaced 271 bugs in Firefox 150, including sandbox escapes, UAFs, and race conditions. Daniel Stenberg pointed the same model at curl and got 1 low-severity CVE with 4 false positives. His verdict: 'primarily marketing.'
The variable that moved was the harness, not the weights. Mozilla's wrapper emits reproducible test cases, scales across ephemeral VMs, and integrates into their security lifecycle. The thing the headline number doesn't tell you is which factor dominated: tool integration, compute budget per target, or corpus priors. On the evidence available, eval scaffolding dominates model choice by at least 50x on this task.
Vulnerability discovery just moved from human-weeks to model-minutes. If the patch SLA is benchmarked against human speed, it's tuned to last year's threat model.
Live Misuse Confirmed
Google's threat-intel team observed a hacking group using LLMs to build cybercrime tooling, the exact scenario flagged when Mythos shipped. Anthropic published a case study of Claude Code running an estimated 80-90% of tactical work across ~30 targets in what they call the first largely AI-executed espionage campaign. The 80-90% figure is self-reported and hard to audit, but the direction is consistent with Google's observation. This is no longer a red-team thought experiment.
What This Means for Release Gates
Current gate What it misses Required addition Refusal rate on static prompts End-to-end chain completion Staged rubric: recon → lateral movement → persistence → exfil Prompt injection catch rate Tool-call chain misuse Agent-trajectory anomaly classifier on known-bad patterns Single-model benchmark Harness-amplified capability Eval the deployed system, not the weights in isolation Action items
- Add a staged cyber-capability rubric (recon/access/lateral/persist/exfil) to your agent release gate before the next model upgrade
- Run a red-team spike using a frontier model against your own codebase and internal tools; measure time-to-first-exploit vs. human baseline
- Log and feature-engineer agent action sequences; train a lightweight classifier on known-bad trajectories for production monitoring
- Spike a domain-specific agentic harness modeled on Mozilla's pattern (reproducible test cases + ephemeral VMs + existing signal pipelines) for one complex internal service
Sources:Mythos cleared the AISI attack ranges · The headline claim is that AI models have reached full network takeover · Mozilla shipped 271 bugs · PraisonAI auth bypass exploited in four hours · Anthropic published the case study this week
04 Three Training Efficiency Results That Change Your Q3 Compute Budget
The Claims
The marginal dollar in model development looks like it is moving away from raw FLOP-hours and toward recipe design and data curation. The week's research drops point that direction unevenly, hitting different phases of the training pipeline.
Work Claim Scale validated Inference impact Replication risk Nous Research TST 2-3x wall-clock at matched FLOPs 270M → 10B-A1B MoE None — no architecture change Medium; single-source, clean claim Datology VLM +11.7 pts on 20 benchmarks at 17x less compute 2B and 4B Lower response FLOPs (real serving win) Medium; benchmark-selection risk NVIDIA Star Elastic 360x cheaper model-family derivation Not specified Produces family of sizes from one run High; big headline, lab-reported Which One to Spike First
Token Superposition Training (TST) is the highest-leverage bet. It is a pretraining recipe change with no inference-side cost downstream. If it replicates at even 1.6x on a 1B continued-pretraining run with no val-loss regression, it pays for itself on the next full run. The 2-3x claim at matched FLOPs with no architecture change is either free or it is not, and the experiment is bounded.
Datology's result is the clearest evidence this year that data curation dominates compute for VLM training. Beating InternVL3.5-2B by about 10 points while using 17x less compute, purely through data selection, suggests most teams are over-spending on GPU-hours and under-investing in dataset engineering. Their 4B model lands near-frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B, which is a real serving-cost win, not a leaderboard one.
Star Elastic's 360x is the claim most likely to shrink under independent reproduction. Even at 30x it restructures how model-size tiers get produced for different deployment targets.
The marginal dollar in VLM training has moved from compute to curation. Datology's 17x result is the strongest evidence this year.
Adjacent: SWE-ZERO-12M-trajectories
Kevin Li released 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages, positioned as the largest open agentic trace corpus. The thing this doesn't tell you is durability: open releases at this scale tend to acquire licensing frictions within a few months. Useful for SFT, reward-model training, and synthetic eval generation while the license window is clean.
Action items
- Spike Token Superposition Training on a 1B continued-pretraining run against a matched-FLOPs baseline this quarter
- Pull SWE-ZERO-12M-trajectories and stand up a preprocessing pipeline (dedup, license filter, language stratification) before licensing frictions appear
- Run a data curation ablation on your VLM or multimodal pipeline: measure quality at 5x and 10x dataset reduction with aggressive filtering
- Budget Q3 training runs assuming 2-3x efficiency improvements are plausible; don't lock full-year GPU reservations at current utilization assumptions
Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode this week
◆ QUICK HITS
Update: ML infra CVE surface expanded — Apache Iceberg (CVSS 9.9) lets attackers redirect table metadata to poisoned S3 prefixes; Argo CD 3.2-3.3 leaks plaintext K8s Secrets; PraisonAI exploited within 4 hours of disclosure; 18-year NGINX RCE in rewrite module affects model-serving gateways
LiteLLM landed in the KEV catalog this week
DuckDB shipped Quack HTTP client-server mode — single-node analytics becomes a shared service; combined with ECS Fargate pattern, credible replacement for Spark-on-Glue jobs under 100GB
DuckDB shipped a client-server mode this week
Kafka Share Groups decouple consumer parallelism from partition count with reported linear 8x scaling at 32 instances — first candidate: LLM-in-the-loop enrichment consumers that are I/O-bound on model APIs
DuckDB shipped a client-server mode this week
Only 15% of organizations have data foundations ready for agentic AI at scale (Fivetran); data quality and lineage cited as #1 blocker by ~50% — most 'agent projects' funded this quarter are actually data-platform projects with an agent on top
DuckDB shipped a client-server mode this week
Full-duplex voice models emerge as distinct architecture class: TML-Interaction-Small reports 0.40s turn-taking latency vs. 0.57s (Gemini Flash Live) and 1.18s (GPT-Realtime-2.0) — 3x gap over OpenAI on the naturalness metric
TML is reporting 0.40 seconds of full-duplex latency
Duolingo publicly pegs AI-generated content 'slop' at ~20% requiring human QC — and reversed its blanket 'evaluate employees on AI usage' policy after observing performative adoption without productivity lift
Duolingo's twenty percent AI slop rate
LLM-as-a-Verifier eliminates the tie problem plaguing LLM-as-a-Judge by decomposing into repeated binary verifications with token-level scoring — one-day harness rewrite to measure tie-rate and CI-width improvement
An Ollama endpoint exposed to the public internet gets picked up by Shodan in about three hours
Affirm claims transformer-based underwriting outperforms legacy GBMs across 27M consumers — but new PCAOB guidance requires deterministic execution and tamper-evident audit trails that transformers don't natively provide
The transformer underwriting models are outperforming
◆ Bottom line
The take.
Anthropic killed the flat-rate developer discount, admitted an 8x capacity planning miss, and leased a competitor's entire GPU fleet to keep the lights on — all while OpenAI is paying you to evaluate the alternative. Meanwhile, 59% of production tokens are agentic but nearly 100% of eval harnesses are single-turn, and AI models just cleared the autonomous-exploit-chain threshold that was theoretical six months ago. The three things that need to change before June 15: re-run Claude unit economics against the new metering, rebuild eval around trajectories not turns, and add a cyber-capability tier to your model release gate.
Frequently asked
- What exactly changed in Anthropic's pricing this week and when do the next cliffs hit?
- Claude subscriptions now convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses, eliminating the 70-90% effective discount power users were extracting. Starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) moves to a separate credit bucket with no rollover and overflow at API rates. There are no native per-user telemetry, budget alerts, or SLAs to soften the transition.
- Why is single-turn eval accuracy a misleading metric when 59% of tokens are agentic?
- Single-turn scoring measures final-answer correctness, but multi-step tool loops fail by burning tokens — a planner can spend 40,000 tokens arguing with itself while still hitting 90%+ accuracy. The bill lives on the cost path, which the harness never sees. Trajectory-level metrics (tool-call F1, steps-to-completion, $/successful-task) are required to surface the actual failure mode, and cost models fit on 3:1 input-output ratios are roughly 5x off when real agentic traces run 15:1.
- How should I weigh the three training efficiency results before committing Q3 compute?
- Token Superposition Training is the highest-leverage spike: a pretraining recipe change with no inference-side cost, claiming 2-3x wall-clock at matched FLOPs on scales up to 10B-A1B MoE. Datology's 17x VLM compute reduction with +11.7 points across 20 benchmarks is the strongest evidence that data curation now dominates raw FLOPs, and their 4B model serves at 3.3x lower response FLOPs than Qwen3-VL-4B. NVIDIA's Star Elastic 360x claim is most likely to shrink under replication but still matters at 30x.
- What does the Mozilla vs. curl bug-finding gap actually tell us about model capability?
- It tells us harness engineering dominates model selection by at least 50x on vulnerability discovery tasks. Mozilla's wrapper around Claude Mythos Preview — reproducible test cases, ephemeral VMs, fuzzing integration — surfaced 271 Firefox 150 bugs including sandbox escapes and UAFs. Daniel Stenberg pointed the same model at curl with a thinner setup and got 1 low-severity CVE plus 4 false positives. The weights were identical; the scaffolding was not.
- Is the Ramp 34.4% vs 32.3% spend crossover real evidence Anthropic passed OpenAI in enterprise?
- No, it's directionally suggestive but within measurement error for the question most teams care about. Ramp captures who gets billed on a corporate card, which doesn't reflect token volume, workload criticality, or production dependency, and large enterprise contracts typically flow through invoice and ACH instead. The robust signal across sources is that multi-vendor procurement is now the default; the specific vendor ranking is noise.
◆ Same day, different angle
Read this day as…
◆ Recent in data science
Keep reading.
- Princeton's ICML 2026 audit added GPT 5.5, Gemini 3.5 Flash, and Claude Opus 4.7 and found zero meaningful reliability improvement over pred…
- Hugging Face Transformers has an RCE path that fires from model config files — not pickle weights — across 2.2 billion installs.
- Anthropic ended the flat-rate Claude subsidy this week.
- Anthropic killed the flat-rate Claude subscription this week.
- Anthropic quietly killed the 70-90% effective discount on programmatic Claude usage — subscriptions now convert to dollar-matched API credit…