What exactly changed in Anthropic's pricing this week and when do the next cliffs hit?

Claude subscriptions now convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses, eliminating the 70-90% effective discount power users were extracting. Starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) moves to a separate credit bucket with no rollover and overflow at API rates. There are no native per-user telemetry, budget alerts, or SLAs to soften the transition.

Why is single-turn eval accuracy a misleading metric when 59% of tokens are agentic?

Single-turn scoring measures final-answer correctness, but multi-step tool loops fail by burning tokens — a planner can spend 40,000 tokens arguing with itself while still hitting 90%+ accuracy. The bill lives on the cost path, which the harness never sees. Trajectory-level metrics (tool-call F1, steps-to-completion, $/successful-task) are required to surface the actual failure mode, and cost models fit on 3:1 input-output ratios are roughly 5x off when real agentic traces run 15:1.

How should I weigh the three training efficiency results before committing Q3 compute?

Token Superposition Training is the highest-leverage spike: a pretraining recipe change with no inference-side cost, claiming 2-3x wall-clock at matched FLOPs on scales up to 10B-A1B MoE. Datology's 17x VLM compute reduction with +11.7 points across 20 benchmarks is the strongest evidence that data curation now dominates raw FLOPs, and their 4B model serves at 3.3x lower response FLOPs than Qwen3-VL-4B. NVIDIA's Star Elastic 360x claim is most likely to shrink under replication but still matters at 30x.

What does the Mozilla vs. curl bug-finding gap actually tell us about model capability?

It tells us harness engineering dominates model selection by at least 50x on vulnerability discovery tasks. Mozilla's wrapper around Claude Mythos Preview — reproducible test cases, ephemeral VMs, fuzzing integration — surfaced 271 Firefox 150 bugs including sandbox escapes and UAFs. Daniel Stenberg pointed the same model at curl with a thinner setup and got 1 low-severity CVE plus 4 false positives. The weights were identical; the scaffolding was not.

Is the Ramp 34.4% vs 32.3% spend crossover real evidence Anthropic passed OpenAI in enterprise?

No, it's directionally suggestive but within measurement error for the question most teams care about. Ramp captures who gets billed on a corporate card, which doesn't reflect token volume, workload criticality, or production dependency, and large enterprise contracts typically flow through invoice and ACH instead. The robust signal across sources is that multi-vendor procurement is now the default; the specific vendor ranking is noise.

Edition 2026-05-15 · read as Data Science

AnthropicEndsSubscriptionArbitrage,ResetsAgentEconomics

Sources: 36
Words: 1,514
Read: 8min

Topics Agentic AI LLM Inference AI Capital

◆ The signal

Anthropic killed the 70-90% effective discount on programmatic Claude usage this week — subscriptions now convert to dollar-matched API credits across Agent SDK, GitHub Actions, and third-party harnesses — while simultaneously admitting they planned for 10x growth and got 80x, forcing an emergency lease of xAI's entire 220,000-GPU Colossus 1 cluster. OpenAI dropped a 2-month-free Codex enterprise switch promo the same day. If you haven't re-run unit economics on your agent stack since Monday, you're making a pricing decision by default.

Key facts

Anthropic ended the 70-90% effective discount on programmatic Claude usage, converting subscriptions to dollar-matched API credits across Agent SDK, GitHub Actions, and third-party harnesses.
Anthropic planned for 10x growth but hit 80x, triggering an emergency lease of xAI's 220,000+ GPU Colossus 1 cluster.
Vercel's AI Gateway data across 200,000 teams over 7 months shows 59% of tokens are now agentic, with Anthropic capturing 61% of dollar spend and Google 38% of token volume.
Mozilla's custom agentic harness around Claude Mythos Preview surfaced 271 bugs in Firefox 150, while the same model on curl yielded only 1 low-severity CVE — a harness-driven delta of roughly 271x.
UK AISI reports Mythos cleared both of its hardest simulated attack ranges (full network takeover) and GPT-5.5-cyber cleared one of two, indicating a discrete capability jump in autonomous cyber-offense.

◆ INTELLIGENCE MAP

01
Anthropic Metering Cliff + OpenAI Counter-Offensive
act now
Claude subscriptions now meter at API list price across all programmatic surfaces. June 15 third-party tool credits become separate, non-rolling buckets. ServiceNow already burned its full-year Claude budget by May. OpenAI's 2-month free Codex promo is an asymmetric-payoff evaluation window.
70-90%
effective discount eliminated
11
sources
- Growth miss
- Colossus GPUs leased
- Anthropic B2B share
- OpenAI B2B share
- June 15 deadline
1. Anthropic B2B Share34.4
2. OpenAI B2B Share32.3
02
59% Agentic Volume: Eval Harnesses Measure the Minority
act now
Vercel's AI Gateway shows 59% of production tokens are now multi-turn agentic traces, not single-shot completions. Anthropic captures 61% of spend via Opus; Google captures 38% of volume via Flash. Most eval harnesses still score single-turn responses — they're benchmarking traffic that no longer dominates.
59%
tokens now agentic
5
sources
- Opus spend share
- Flash volume share
- MCP token overhead
- Teams tracked
1. Agentic tokens59
2. Single-turn tokens41
03
AI Cyber Capability Crosses Discrete Threshold
monitor
Anthropic's Mythos is the first model to clear both AISI attack ranges (full network takeover). Mozilla's custom harness surfaced 271 Firefox bugs vs. curl's 1 — same model, 271x delta from scaffolding alone. Google confirmed a threat actor using LLMs for live espionage tooling. Patch SLAs calibrated to human speed are now mis-calibrated.
271x
harness yield delta
7
sources
- Mozilla bugs found
- curl bugs found
- MDASH Windows fixes
- PraisonAI exploit time
1. Mozilla (custom harness)271
2. curl (generic scan)1
3. MDASH (Windows)16
04
Training Efficiency: Three Results Shift Unit Economics
monitor
Nous TST reports 2-3x wall-clock speedup at matched FLOPs with no inference architecture change (validated to 10B). Datology beats InternVL3.5-2B by ~10 points at 17x less compute via data curation alone. NVIDIA Star Elastic claims 360x cheaper model-family derivation from one post-training run. The marginal dollar in training is moving from compute to curation.
17x
compute reduction (VLM)
2
sources
- TST speedup
- Datology compute cut
- Star Elastic cost cut
- SWE-ZERO corpus
1. Nous TST3
2. Datology (curation)17
3. NVIDIA Star Elastic360
05
Compute Supply Crunch Quantified: 4:1 Demand Ratio
background
Nebius reports 4+ customers per GPU and 684% YoY revenue growth, guiding $3-3.4B in 2026 from $530M. Cerebras IPO'd at $56B with a 70% day-one pop backed by OpenAI's $20B commitment. Cisco AI orders jumping $5B→$9B. The 9GW Stratos project faces 4,000 complaints and a referendum. Reserved capacity beats on-demand for H2 2026.
4:1
GPU demand-to-supply
6
sources
- Nebius 2026 guide
- Nebius YoY growth
- Cerebras valuation
- Cisco AI orders jump
1. Nebius 2025 Rev530
2. Nebius 2026 Guide3200

◆ DEEP DIVES

01
Anthropic's Pricing Cliff: Metering, Capacity, and the 30-Day Window
What Changed This Week
Anthropic tightened the pricing surface this week, and the budget written in October no longer covers the workload run in November. Claude subscriptions now convert to dollar-matched API credits across every programmatic surface: Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The 70-90% effective discount power users were extracting on alternative harnesses is gone. Starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) lands in a separate credit bucket with no rollover and overflow at API rates. Dario Amodei has admitted Anthropic planned for 10x growth and hit 80x, which is why they are emergency-leasing xAI's 220,000+ GPU Colossus 1 cluster (H100, H200, GB200). The capacity scramble is the cause. The metering is the consequence.
The Contradiction Worth Surfacing
Sources disagree on what the Ramp 34.4% vs 32.3% crossover means. Several cite it as evidence Anthropic is winning. Others correctly note that Ramp measures who gets billed on a corporate card, which doesn't capture token volume, workload criticality, or production dependency. OpenAI counters that large enterprise contracts go through invoice and ACH, not cards. The directional signal is robust across sources: multi-vendor procurement is now the default. The specific ranking is noise within measurement error.
ServiceNow burned its full-year Claude budget by May. The cost-attribution gap bites most teams within one quarter of going live.
The No-SLA Problem
Anthropic provides no native per-user usage telemetry, no tool-level consumption breakdown, no SLAs on latency or availability, no budget alerts, and no anomaly detection. For enterprise SaaS at this price point, that is anomalous. The thing this doesn't tell you from the pricing page is that you cannot attribute which tenant, prompt, or feature drove the bill until the invoice arrives.
Capability Enterprise SaaS norm Anthropic (today)
Per-user attribution Native dashboards Not exposed
Budget alerts Standard Absent
Latency/availability SLA Contractual None
Anomaly detection Built-in Absent
The OpenAI Counter
Sam Altman dropped a 2-month-free Codex enterprise switch promo the same day Anthropic metered. That is a zero-cost evaluation window. The right read is to run it through an internal harness, not vendor benchmarks, with trajectory-level instrumentation that measures how agents succeed, not just pass@1.
Action items
- Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) and reconcile projected token burn against new credit caps by end of next week
- Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily budget alerts before June 15
- Activate OpenAI's 2-month Codex promo and instrument a head-to-head eval on matched prompts and tool schemas
- Avoid locking annual Anthropic contracts until post-Colossus integration stability is observable (expect 6-8 weeks)
Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with

Capability	Enterprise SaaS norm	Anthropic (today)
Per-user attribution	Native dashboards	Not exposed
Budget alerts	Standard	Absent
Latency/availability SLA	Contractual	None
Anomaly detection	Built-in	Absent

59% Agentic: Rebuild the Eval Harness Around Trajectories, Not Turns

The Production Data

Vercel's AI Gateway, spanning 200,000 teams over 7 months, reports 59% of all tokens are now agentic — multi-turn, tool-calling traces. This is production telemetry, not a forecast. The spend-volume split is where the routing behavior shows up: Anthropic captures 61% of dollar spend on reasoning and planning nodes via Opus, while Google captures 38% of token volume on high-throughput utility calls via Flash. Teams are already tiering by node type. Eval and cost code has not caught up.

Why Current Evals Are Measuring the Wrong Bottleneck

Most production eval harnesses still score single-turn responses against reference answers. When 59% of traffic is multi-step tool loops with retries, the failure mode is a planner burning 40,000 tokens arguing with itself, not final-answer accuracy. Accuracy is 90%+ in both cases. The bill lives on the cost path, which the harness does not see.

Cost models have the inverse problem. They were fit when input-output ratios sat around 3:1. Agentic traces run 15:1 on input with cache-hit rates that vary by provider. A forecast built on last year's ratio is off by roughly 5x on spend. That is not a calibration error. That is the wrong model.

If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.

The MDASH Validation

Microsoft's MDASH, a 100+ agent ensemble, outperformed Anthropic's Mythos on CyberGym by decomposing vulnerability discovery into scan → adversarial debate → PoC exploitation stages. The result is consistent with classical ML priors: ensemble topology with explicit disagreement beats monolithic models on complex tasks. Caveat: no cost or latency numbers published. The thing CyberGym doesn't measure is the inference bill for 100+ agents, which is the number that decides whether this ships.

The Architecture That Wins

Layer	Pattern	Evidence
Routing	Cheap triage → expensive reasoning, gated by confidence	Abridge (80M+ conversations), Vercel spend/volume split
Eval	Trajectory-level: tool-call F1, steps-to-completion, $/successful-task	Kapoor: outcome-only metrics hide reward hacking in capable agents
Memory	External event-driven store, not weights	Microsoft agent memory: 97.2% precision at 400-500 memories
Grounding	Knowledge Graph + MCP, not vector RAG alone	SAP + ServiceNow converged independently; Glean: MCP +30% tokens vs tuned retrieval

Action items

Add trajectory-level metrics (tool-call precision/recall, steps-to-completion, cost-per-successful-task) to eval harness this sprint
Instrument per-node token cost in agent graphs and route utility calls (summarization, JSON extraction, query rewriting) to Flash/Haiku-class models
Run a 1-hour spike measuring MCP/tool-calling token overhead vs. retrieval-first baseline on 100 production traces
Persist full agent trajectories (tool calls, intermediate state, file diffs) and audit stratified sample of 'passing' rollouts for reward hacking

Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs is the combination

AI Cyber Capability Crossed a Discrete Threshold — Your Release Gate Needs a New Tier

The Capability Jump

UK AISI evaluated the newest Mythos and GPT-5.5-cyber variants on autonomous cyber-offense tasks. Mythos cleared both of AISI's hardest simulated attack ranges (full network takeover). GPT-5.5-cyber cleared one of two. The prior Mythos generation topped out at 'advanced persistence.' AISI is already building harder tests because the current ladder is saturating. The shape of the curve matters here. This is not smooth interpolation. It looks like a discrete unlock, comparable to GPT-3.5→4 on agentic tool-use.

The 271x Harness Delta

Same model, two teams, two orders of magnitude difference. Mozilla wrapped Claude Mythos Preview in a custom agentic harness integrated with their fuzzing infrastructure and surfaced 271 bugs in Firefox 150, including sandbox escapes, UAFs, and race conditions. Daniel Stenberg pointed the same model at curl and got 1 low-severity CVE with 4 false positives. His verdict: 'primarily marketing.'

The variable that moved was the harness, not the weights. Mozilla's wrapper emits reproducible test cases, scales across ephemeral VMs, and integrates into their security lifecycle. The thing the headline number doesn't tell you is which factor dominated: tool integration, compute budget per target, or corpus priors. On the evidence available, eval scaffolding dominates model choice by at least 50x on this task.

Vulnerability discovery just moved from human-weeks to model-minutes. If the patch SLA is benchmarked against human speed, it's tuned to last year's threat model.

Live Misuse Confirmed

Google's threat-intel team observed a hacking group using LLMs to build cybercrime tooling, the exact scenario flagged when Mythos shipped. Anthropic published a case study of Claude Code running an estimated 80-90% of tactical work across ~30 targets in what they call the first largely AI-executed espionage campaign. The 80-90% figure is self-reported and hard to audit, but the direction is consistent with Google's observation. This is no longer a red-team thought experiment.

What This Means for Release Gates

Current gate	What it misses	Required addition
Refusal rate on static prompts	End-to-end chain completion	Staged rubric: recon → lateral movement → persistence → exfil
Prompt injection catch rate	Tool-call chain misuse	Agent-trajectory anomaly classifier on known-bad patterns
Single-model benchmark	Harness-amplified capability	Eval the deployed system, not the weights in isolation

Action items

Add a staged cyber-capability rubric (recon/access/lateral/persist/exfil) to your agent release gate before the next model upgrade
Run a red-team spike using a frontier model against your own codebase and internal tools; measure time-to-first-exploit vs. human baseline
Log and feature-engineer agent action sequences; train a lightweight classifier on known-bad trajectories for production monitoring
Spike a domain-specific agentic harness modeled on Mozilla's pattern (reproducible test cases + ephemeral VMs + existing signal pipelines) for one complex internal service

Sources:Mythos cleared the AISI attack ranges · The headline claim is that AI models have reached full network takeover · Mozilla shipped 271 bugs · PraisonAI auth bypass exploited in four hours · Anthropic published the case study this week

Three Training Efficiency Results That Change Your Q3 Compute Budget

The Claims

The marginal dollar in model development looks like it is moving away from raw FLOP-hours and toward recipe design and data curation. The week's research drops point that direction unevenly, hitting different phases of the training pipeline.

Work	Claim	Scale validated	Inference impact	Replication risk
Nous Research TST	2-3x wall-clock at matched FLOPs	270M → 10B-A1B MoE	None — no architecture change	Medium; single-source, clean claim
Datology VLM	+11.7 pts on 20 benchmarks at 17x less compute	2B and 4B	Lower response FLOPs (real serving win)	Medium; benchmark-selection risk
NVIDIA Star Elastic	360x cheaper model-family derivation	Not specified	Produces family of sizes from one run	High; big headline, lab-reported

Which One to Spike First

Token Superposition Training (TST) is the highest-leverage bet. It is a pretraining recipe change with no inference-side cost downstream. If it replicates at even 1.6x on a 1B continued-pretraining run with no val-loss regression, it pays for itself on the next full run. The 2-3x claim at matched FLOPs with no architecture change is either free or it is not, and the experiment is bounded.

Datology's result is the clearest evidence this year that data curation dominates compute for VLM training. Beating InternVL3.5-2B by about 10 points while using 17x less compute, purely through data selection, suggests most teams are over-spending on GPU-hours and under-investing in dataset engineering. Their 4B model lands near-frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B, which is a real serving-cost win, not a leaderboard one.

Star Elastic's 360x is the claim most likely to shrink under independent reproduction. Even at 30x it restructures how model-size tiers get produced for different deployment targets.

The marginal dollar in VLM training has moved from compute to curation. Datology's 17x result is the strongest evidence this year.

Adjacent: SWE-ZERO-12M-trajectories

Kevin Li released 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages, positioned as the largest open agentic trace corpus. The thing this doesn't tell you is durability: open releases at this scale tend to acquire licensing frictions within a few months. Useful for SFT, reward-model training, and synthetic eval generation while the license window is clean.

Action items

Spike Token Superposition Training on a 1B continued-pretraining run against a matched-FLOPs baseline this quarter
Pull SWE-ZERO-12M-trajectories and stand up a preprocessing pipeline (dedup, license filter, language stratification) before licensing frictions appear
Run a data curation ablation on your VLM or multimodal pipeline: measure quality at 5x and 10x dataset reduction with aggressive filtering
Budget Q3 training runs assuming 2-3x efficiency improvements are plausible; don't lock full-year GPU reservations at current utilization assumptions

Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode this week

◆ QUICK HITS

Update: ML infra CVE surface expanded — Apache Iceberg (CVSS 9.9) lets attackers redirect table metadata to poisoned S3 prefixes; Argo CD 3.2-3.3 leaks plaintext K8s Secrets; PraisonAI exploited within 4 hours of disclosure; 18-year NGINX RCE in rewrite module affects model-serving gateways
LiteLLM landed in the KEV catalog this week
DuckDB shipped Quack HTTP client-server mode — single-node analytics becomes a shared service; combined with ECS Fargate pattern, credible replacement for Spark-on-Glue jobs under 100GB
DuckDB shipped a client-server mode this week
Kafka Share Groups decouple consumer parallelism from partition count with reported linear 8x scaling at 32 instances — first candidate: LLM-in-the-loop enrichment consumers that are I/O-bound on model APIs
DuckDB shipped a client-server mode this week
Only 15% of organizations have data foundations ready for agentic AI at scale (Fivetran); data quality and lineage cited as #1 blocker by ~50% — most 'agent projects' funded this quarter are actually data-platform projects with an agent on top
DuckDB shipped a client-server mode this week
Full-duplex voice models emerge as distinct architecture class: TML-Interaction-Small reports 0.40s turn-taking latency vs. 0.57s (Gemini Flash Live) and 1.18s (GPT-Realtime-2.0) — 3x gap over OpenAI on the naturalness metric
TML is reporting 0.40 seconds of full-duplex latency
Duolingo publicly pegs AI-generated content 'slop' at ~20% requiring human QC — and reversed its blanket 'evaluate employees on AI usage' policy after observing performative adoption without productivity lift
Duolingo's twenty percent AI slop rate
LLM-as-a-Verifier eliminates the tie problem plaguing LLM-as-a-Judge by decomposing into repeated binary verifications with token-level scoring — one-day harness rewrite to measure tie-rate and CI-width improvement
An Ollama endpoint exposed to the public internet gets picked up by Shodan in about three hours
Affirm claims transformer-based underwriting outperforms legacy GBMs across 27M consumers — but new PCAOB guidance requires deterministic execution and tamper-evident audit trails that transformers don't natively provide
The transformer underwriting models are outperforming

◆ Bottom line

The take.

Anthropic killed the flat-rate developer discount, admitted an 8x capacity planning miss, and leased a competitor's entire GPU fleet to keep the lights on — all while OpenAI is paying you to evaluate the alternative. Meanwhile, 59% of production tokens are agentic but nearly 100% of eval harnesses are single-turn, and AI models just cleared the autonomous-exploit-chain threshold that was theoretical six months ago. The three things that need to change before June 15: re-run Claude unit economics against the new metering, rebuild eval around trajectories not turns, and add a cyber-capability tier to your model release gate.

Frequently asked

What exactly changed in Anthropic's pricing this week and when do the next cliffs hit?: Claude subscriptions now convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses, eliminating the 70-90% effective discount power users were extracting. Starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) moves to a separate credit bucket with no rollover and overflow at API rates. There are no native per-user telemetry, budget alerts, or SLAs to soften the transition.
Why is single-turn eval accuracy a misleading metric when 59% of tokens are agentic?: Single-turn scoring measures final-answer correctness, but multi-step tool loops fail by burning tokens — a planner can spend 40,000 tokens arguing with itself while still hitting 90%+ accuracy. The bill lives on the cost path, which the harness never sees. Trajectory-level metrics (tool-call F1, steps-to-completion, $/successful-task) are required to surface the actual failure mode, and cost models fit on 3:1 input-output ratios are roughly 5x off when real agentic traces run 15:1.
How should I weigh the three training efficiency results before committing Q3 compute?: Token Superposition Training is the highest-leverage spike: a pretraining recipe change with no inference-side cost, claiming 2-3x wall-clock at matched FLOPs on scales up to 10B-A1B MoE. Datology's 17x VLM compute reduction with +11.7 points across 20 benchmarks is the strongest evidence that data curation now dominates raw FLOPs, and their 4B model serves at 3.3x lower response FLOPs than Qwen3-VL-4B. NVIDIA's Star Elastic 360x claim is most likely to shrink under replication but still matters at 30x.
What does the Mozilla vs. curl bug-finding gap actually tell us about model capability?: It tells us harness engineering dominates model selection by at least 50x on vulnerability discovery tasks. Mozilla's wrapper around Claude Mythos Preview — reproducible test cases, ephemeral VMs, fuzzing integration — surfaced 271 Firefox 150 bugs including sandbox escapes and UAFs. Daniel Stenberg pointed the same model at curl with a thinner setup and got 1 low-severity CVE plus 4 false positives. The weights were identical; the scaffolding was not.
Is the Ramp 34.4% vs 32.3% spend crossover real evidence Anthropic passed OpenAI in enterprise?: No, it's directionally suggestive but within measurement error for the question most teams care about. Ramp captures who gets billed on a corporate card, which doesn't reflect token volume, workload criticality, or production dependency, and large enterprise contracts typically flow through invoice and ACH instead. The robust signal across sources is that multi-vendor procurement is now the default; the specific vendor ranking is noise.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

AnthropicEndsSubscriptionArbitrage,ResetsAgentEconomics

◆ INTELLIGENCE MAP

◆ DEEP DIVES

What Changed This Week

The Contradiction Worth Surfacing

The No-SLA Problem

The OpenAI Counter

The Production Data

Why Current Evals Are Measuring the Wrong Bottleneck

The MDASH Validation

The Architecture That Wins

The Capability Jump

The 271x Harness Delta

Live Misuse Confirmed

What This Means for Release Gates

The Claims

Which One to Spike First

Adjacent: SWE-ZERO-12M-trajectories

◆ QUICK HITS

The take.

Frequently asked

◆ RELATED THREADS