Edition 2026-05-19 · read as Data Science
AnthropicEndsClaudeFlat-Rate,AgentStacksFace3-5xBill
- Sources
- 36
- Words
- 1,722
- Read
- 9min
Topics Agentic AI LLM Inference AI Capital
◆ The signal
Anthropic quietly killed the flat-rate Claude developer subsidy — subscriptions now convert to dollar-matched API credits, metering every Agent SDK, GitHub Action, and batch eval job at list price. This eliminates the 70-90% effective discount power users had been getting. OpenAI dropped a 2-month-free Codex enterprise switch promo the same day, and Vercel's production data shows 59% of all tokens are now agentic. If you haven't re-priced your Claude-dependent agent stack this sprint, you're making a pricing decision by default while the cost model underneath it just changed by 3-5x.
◆ INTELLIGENCE MAP
01 Anthropic's Triple Squeeze: Metered Credits, 80x Capacity Miss, June 15 Cliff
act nowAnthropic metered programmatic Claude usage at list-rate API credits, admitted an 80x-vs-10x capacity planning miss, leased xAI's 220K-GPU Colossus 1 cluster, and will split third-party tool billing on June 15. ServiceNow burned its full-year Claude budget by May. Enterprise share flipped to 34.4% vs OpenAI's 32.3%.
- B2B share lead
- Colossus GPUs leased
- ARR growth (4 mo)
- Opus 4.7 image cost
- IPO target
02 59% Agentic: Eval Harnesses and Cost Models Are Measuring the Minority
act nowVercel's AI Gateway puts agentic workloads at 59% of all production tokens. Anthropic captures 61% of spend (via Opus for reasoning), Google captures 38% of volume (via Flash for throughput). Single-turn eval harnesses now score the minority of traffic. Cost models built on 3:1 input-output ratios are off by ~5x for 15:1 agentic traces.
- Anthropic spend share
- Google volume share
- Agentic I/O ratio
- MCP token overhead
03 AI Cyber Capability Crossed AISI Threshold — Harness Dominates Model
monitorAnthropic's Mythos is the first model to clear both AISI simulated attack ranges (full network takeover). GPT-5.5-cyber cleared one of two. Mozilla's agentic harness found 271 Firefox bugs with the same model that found 1 CVE in curl without one — a 271:1 delta proving harness engineering dominates model selection for vulnerability discovery.
- AISI ranges cleared
- Mozilla bugs found
- curl bugs found
- Palo Alto products scanned
- With harness (Mozilla)271
- Without harness (curl)1
04 Training Efficiency: Three Papers Move the Unit Economics
monitorNous TST reports 2-3x wall-clock speedup at matched FLOPs (validated 270M→10B MoE). Datology hit +11.7 pts on VLM benchmarks at 17x less training compute via pure data curation. NVIDIA Star Elastic claims 360x cheaper model-family derivation from a single post-training run. All three change the $/capability math for teams running their own training.
- TST wall-clock gain
- Star Elastic savings
- Datology VLM lift
- SWE-ZERO corpus
05 GPU Supply: 4:1 Demand Ratio and the Inference Silicon Fork
backgroundNebius reports 4+ customers competing for every GPU online, posting 684% YoY revenue growth with $3-3.4B 2026 guidance. Cerebras IPO'd at $56B (+70% day one) backed by OpenAI's $20B commitment. Cisco AI orders jumping $5B→$9B. The inference hardware layer is diversifying away from Nvidia faster than most capacity plans assume.
- Nebius YoY growth
- Cerebras valuation
- OpenAI-Cerebras deal
- Cisco AI order growth
- Nebius 2025 rev530
- Nebius 2026 guide3200
◆ DEEP DIVES
01 Anthropic's Metering Cliff: Re-Price Your Agent Stack Before June 15
What Happened
Anthropic shipped three changes that interact badly for anyone running Claude in production. First, subscriptions now convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The implicit 70-90% subsidy on programmatic usage is gone. Second, Dario Amodei conceded the company planned for 10x growth and got 80x, which is the read on weeks of degraded Claude Code performance: a capacity miss, not a model regression. Third, starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) moves to a separate credit bucket with no rollover and overflow at API rates.
The capacity patch is leasing xAI's entire Colossus 1 cluster — 220,000+ GPUs spanning H100, H200, and GB200 — from a CEO who publicly insulted Anthropic three months ago. Rate limits are being raised in parallel: Claude Code 5-hour caps doubling, peak-hours throttling removed, Opus API limits "substantially raised."
Why Sources Disagree on Market Position
Seven independent sources reported Anthropic overtaking OpenAI in enterprise adoption (34.4% vs 32.3% per Ramp). The thing this doesn't tell you is what the metric measures: credit-card billing share, not token volume, not workload criticality, not large-enterprise invoiced spend. OpenAI's point that $1M+ ACV accounts pay by ACH, not card, is correct and material. Ramp's own economist separately flagged Opus 4.7 tripling image costs and mounting reliability complaints. The crossover is real for bottoms-up developer adoption. It is not yet a statement about the Fortune 500.
ServiceNow burned its full-year Claude budget by May. National Life Group's CIO says Claude is 'not great for companies' wanting per-user monitoring. The frontier model most teams build on has no native cost attribution, no SLAs, and no per-user telemetry.
The Compound Effect
Token consumption in agentic workflows is non-linear. A reflection loop or tool-use chain can 10x spend per task with no proportional quality gain, and the per-task variance is wide enough that an average tells you very little. Remove the subscription subsidy on the same date the vendor still ships no per-tenant attribution, and the failure mode is a silent budget overrun that surfaces in the monthly invoice rather than the observability dashboard. Teams with gateway-level logging in place before June 15 absorb the change. Teams discovering it in the invoice do not.
Surface Before After (May-June 2026) Agent SDK / claude-p Flat subscription covers heavy use Dollar-matched API credits, then list rate Third-party tools (Zed, etc.) Covered by plan Separate bucket, no rollover (June 15) Claude Code caps 5-hour limit, peak throttled Doubled, throttling removed Opus API rate limits Constrained during crunch 'Substantially raised' post-Colossus Action items
- Audit every Claude-backed workload (Agent SDK, claude-p, GitHub Actions, batch evals) and reconcile projected token burn against the new credit cap by end of this sprint
- Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily budget alerts before June 15
- Run OpenAI's 2-month-free Codex enterprise switch promo as a controlled head-to-head on your actual eval harness with matched prompts and tool schemas
- Re-baseline all Claude benchmarks (throughput, p95 latency, rate-limit headroom) after Colossus 1 integration stabilizes — do not commit to workarounds built against the degraded period
Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel's AI Gateway production index
02 59% Agentic: Your Eval Harness and Cost Model Just Became Minority Instruments
The Production Reality
Vercel's AI Gateway production index is the only multi-tenant usage dataset with 200K+ teams and 7 months of data, which is the disclaimer up front. It puts agentic workloads at 59% of all token volume. Six months ago that number was under 20%. The composition shift is the fastest since completion-to-chat endpoints, and it happened while most eval harnesses were still scoring single-turn responses against reference answers.
Three structural mismatches follow. First, cost models are off by ~5x: agentic traces run 15:1 input-to-output versus the 3:1 most forecasts assume, with heavy cache reuse on some providers and none on others. The thing this doesn't tell you is which providers, so the 5x is a population estimate, not a per-vendor one. Second, eval harnesses measure the minority: single-turn accuracy at 90%+ hides the planner that burns 40K tokens arguing with itself before giving up. Third, the spend/volume split is the routing signal: Anthropic captures 61% of dollars via Opus (reasoning), Google captures 38% of tokens via Flash (throughput), with no vendor loyalty observed.
The Architecture That Works at Scale
Abridge's disclosed architecture across 80M+ clinical conversations is a second data point on the same pattern: cheap fast model triages, expensive model reasons only when needed, with memory externalized from weights and LLM judges calibrated against human annotators. Two independent sources converging on confidence-gated routing across a constellation of models is enough to call it the de facto production pattern, not a research proposal.
If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out — update the harness before you update the model.
What the MCP Overhead Tells You
Glean reports off-the-shelf MCP burning 30% more tokens than retrieval-tuned knowledge graphs on agentic tasks, losing 2.5x head-to-head on preference. The methodology is vendor-published with no disclosure. Treat the magnitude as a hypothesis. The directional claim, that naive tool listings balloon context windows while reranked snippets do the same work cheaper, matches what SAP and ServiceNow concluded independently this week. All three converge on Knowledge Graph grounding + MCP-governed execution as the enterprise agent reference architecture.
Multi-agent decomposition adds a dimension worth a separate ablation. Microsoft's MDASH (100+ agents) beat Anthropic's Mythos on CyberGym by decomposing tasks into scan → adversarial debate → PoC exploitation stages. The 100-agent count is a design choice. No ablation isolates the staging from the ensemble size, so it is possible but not established that the staging is doing the work. The decompose-debate-verify pattern is consistent with ensemble priors and is the cheap experiment to run on any workload with auto-verifiable outputs.
Action items
- Add trajectory-level metrics to the eval harness this sprint: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate
- Instrument per-node token cost in your agent graph and route utility calls (summarization, JSON extraction, query rewriting) to Flash/Haiku-class models within 2 weeks
- Run a 1-hour spike measuring MCP tool-calling overhead vs. a rerank/KG baseline on 100 replayed production agent traces
- Add LLM-judge-to-human-annotator agreement (Cohen's kappa) as a tracked SLI in the eval pipeline; re-calibrate quarterly
Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs
03 AISI Range Saturation: Autonomous Cyber Capability Is Now a First-Class Eval Axis
The Threshold Crossing
The UK AI Security Institute evaluated the newest Anthropic Mythos and OpenAI GPT-5.5-cyber on autonomous cyber-offense tasks. Both completed full network takeovers in controlled environments, one capability tier above the prior Mythos generation, which topped out at "advanced persistence." Mythos cleared both of AISI's hardest tests. GPT-5.5-cyber cleared one. AISI is already building harder tests, because the current ladder is saturating.
This is the first time a national evaluator has publicly stated that frontier models can complete an end-to-end attack chain. Neither lab is releasing these variants broadly. Anthropic gates Mythos to select enterprises and government agencies.
The 271:1 Harness Signal
Two teams ran Claude Mythos against C codebases in the same month. Mozilla wrapped a custom agentic harness around existing fuzzing infrastructure and surfaced 271 bugs in Firefox 150, including sandbox escapes, use-after-frees, and race conditions. Daniel Stenberg pointed Mythos at curl with a generic scanner and got exactly 1 low-severity CVE with 4 false positives. His verdict: "primarily marketing."
Same model, two orders of magnitude apart. The variable that moved was the harness. Mozilla's wrapper emits reproducible test cases, scales across ephemeral VMs, and integrates with their security lifecycle. The thing the leaderboard score doesn't tell you is which scaffold the model was wearing. This is the strongest public evidence to date that eval budget dominates model choice by at least 50x on real codebases.
Dimension Mozilla + Firefox Stenberg + curl Model Claude Mythos Preview Claude Mythos Preview Harness Custom agentic, fuzzer-integrated Out-of-box scan True positives 271 (incl. sandbox escapes) 1 low-severity CVE False positive rate Tooling-filtered ~80% Vulnerability discovery just moved from human-weeks to model-minutes. If the patch SLA is not benchmarked against inference time, the defense is tuned to last year's threat model.
Implications for Agent Deployments
Refusal-rate harnesses calibrated on GPT-4-era capability assumptions will produce false negatives against Mythos-class attackers. The miscalibration is structural, not a tuning fix. For anything agentic with tool access, the eval needs a staged rubric covering recon, initial access, lateral movement, persistence, and exfil, not input-side prompt filters scored in isolation. Microsoft's MDASH shipped 16 real Windows fixes off multi-model bug-hunting, which is the load-bearing data point that the capability has crossed the utility threshold for both offense and defense at the same time.
Action items
- Add an autonomous-cyber-capability tier to your model eval harness this quarter — include AISI-style attack-range tasks for any model with tool/shell access
- Spike a domain-specific agentic vulnerability harness on your own codebase modeled on Mozilla's pattern: reproducible test cases + ephemeral VMs + integration into existing signal pipelines
- Instrument agent logs for tool-use patterns matching recon → lateral movement → persistence signatures, not just prompt-injection strings
- Compress critical-patch SLA to model-release cadence rather than CVE-publication cadence
Sources:CyberScoop: Mythos cleared AISI · The Information AM: full network takeover · Clint Gibler: Mozilla 271 vs curl 1 · The Hacker News: MDASH 16 fixes · Risky.Biz: guardrail bypass economy · Bloomberg Technology: AI cybercrime observed
04 Three Training Efficiency Papers That Change This Quarter's Build Math
Three recipe and curation results worth pricing in
This week's drops share a through-line for teams running their own training or distillation: the marginal dollar in model development has moved from raw compute to recipe and curation. Each result shifts unit economics in a direction that matters, and each carries its own replication risk profile.
1. Nous Token Superposition Training (TST)
Reports 2-3x wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M through 10B-A1B MoE. The thing this measures is pretraining throughput. The thing it doesn't measure is whether the speedup holds at frontier scale under independent replication. Single source, but the mechanistic claim is clean. If it holds, it's a free 2-3x on training runs without touching the serving stack.
2. NVIDIA Star Elastic
Claims one post-training run produces a family of reasoning model sizes at 360x lower cost than pretraining a family, 7x better than SOTA compression. Headline numbers of that size always shrink under independent eval. The question is the floor, not the ceiling. A 30x hold would still restructure how size tiers are produced for routing. A 10x hold probably would not.
3. Datology VLM Curation
Hit +11.7 points on 20 VLM benchmarks at 2B params, beating InternVL3.5-2B by about 10 points at 17x less training compute. Produced a near-frontier 4B model at 3.3x lower response FLOPs than Qwen3-VL-4B. The lever was purely data curation, not architecture or scale. This is the clearest evidence this year that curation dominates compute at the VLM frontier.
Work Claim Validated Scale Inference Impact Replication Risk Nous TST 2-3x wall-clock at matched FLOPs 270M → 10B MoE None — no arch change Medium Star Elastic 360x cheaper family derivation Unspecified Produces size tiers from one run High Datology VLM +11.7 pts at 17x less compute 2B and 4B 3.3x lower response FLOPs Medium TST is the one to spike first: it's a training recipe with no inference-side tax. If it replicates at even 1.6x, it pays for itself on the next full run.
Adjacent: SWE-ZERO-12M-Trajectories
Kevin Li released 112B tokens, 12M trajectories, 122K PRs across 3K repos in 16 languages, positioned as the largest open agentic trace corpus. Useful for SFT, reward-model training, and synthetic eval generation. Open releases at this size tend to acquire licensing friction within months. Worth mirroring before that happens.
Action items
- Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline this month
- Pull SWE-ZERO-12M-trajectories and stand up a preprocessing pipeline (dedup, license filter, language stratification) before licensing frictions accumulate
- Run a data-curation ablation on your VLM pipeline: systematic filtering and reweighting against a compute-matched baseline
- Benchmark Star Elastic's model-family derivation against your current distillation/quantization pipeline once the paper drops
Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode
◆ QUICK HITS
DuckDB shipped Quack HTTP client-server mode — single-node Spark/Glue jobs under ~100GB now have a clean migration path to ECS Fargate + DuckDB + Terraform at 50%+ cost reduction
DuckDB shipped a client-server mode this week
Update: Apache Iceberg (CVE-2026-42812, CVSS 9.9) lets attackers redirect table metadata writes to attacker-controlled S3 — next query reads poisoned Parquet, next training run ingests corrupted features
LiteLLM landed in the KEV catalog this week
TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s Gemini and 1.18s GPT-Realtime — a 3x spread on the metric that determines perceived voice-agent naturalness
TML is reporting 0.40 seconds of full-duplex latency
Affirm says transformer-based underwriting outperforms legacy GBMs across 27M consumers — first production signal that sequence models beat tabular trees in consumer credit, but PCAOB now requires deterministic execution for regulated ML
The transformer underwriting models are outperforming
Only 15% of organizations have the data foundation for agentic AI at scale (Fivetran); data quality/lineage is the #1 blocker cited by ~50% — scope agent projects as data-platform projects with an agent on top
DuckDB shipped a client-server mode this week
LLM-as-a-Verifier beats LLM-as-a-Judge on tie-rate and decision accuracy by decomposing criteria into repeated binary verifications with token-level scoring — a one-day harness rewrite to test
An Ollama endpoint exposed to the public internet
Duolingo publicly pegs AI-generated content 'slop' at ~20% requiring human QC and reversed its blanket 'evaluate employees on AI usage' policy after observing performative adoption without productivity lift
Duolingo's twenty percent AI slop rate
AI agents bypass legacy bot detection at 81% success rate — any user-facing surface feeding experiments or online learning is ingesting traffic the classifiers cannot see
SAP's Autonomous Enterprise and ServiceNow's Action Fabric
Microsoft agent memory architecture (consolidation + forgetting + delayed maturation) stabilizes at 400-500 memories with 97.2% retention precision — concrete alternative to bloated prompts or flat vector top-k
DuckDB shipped a client-server mode this week
Persona drift in LLM agents is measurable within 8 conversational turns (Li et al., COLM 2024) — add a distinctive verbal tic as a regex-based drift canary before any long-session agent ships
AI personas drift within eight turns
◆ Bottom line
The take.
Anthropic killed the flat-rate Claude subsidy the same week Vercel's production data showed 59% of all tokens are agentic — meaning your cost model is wrong by the subscription change AND by the workload-mix shift simultaneously. The teams that survive this quarter are the ones that meter at the gateway, route by task difficulty, and never let a single vendor's pricing change become a sprint-level emergency because they already have the abstraction layer. If you're still calling anthropic.messages.create directly with no fallback, that's no longer technical debt — it's an unhedged financial position.
Frequently asked
- What exactly changed with Anthropic's pricing, and when does it take effect?
- Subscriptions now convert to dollar-matched API credits across the Agent SDK, claude-p, GitHub Actions, and third-party harnesses, eliminating the prior 70-90% effective subsidy on programmatic usage. The change is already live, and on June 15 third-party tools (Zed, Conductor, OpenCode, T3 Code) move to a separate credit bucket with no rollover and overflow billed at API list rates.
- How should I instrument cost attribution given Claude has no native per-user telemetry?
- Deploy an LLM gateway like LiteLLM or Portkey in front of every Claude-backed call, with per-user, per-feature tagging and daily budget alerts wired in before June 15. Anthropic ships no native per-tenant attribution, so without gateway-level logging, overruns surface in the monthly invoice rather than your observability dashboard.
- Why do single-turn eval scores no longer reflect production reality?
- Vercel's production index shows agentic workloads now account for 59% of all token volume, with 15:1 input-to-output ratios versus the 3:1 most cost models assume. A planner that scores 90%+ on single-turn accuracy can still burn 40K tokens arguing with itself before failing, so trajectory-level metrics — tool-call precision, steps-to-completion, cost-per-successful-task, error recovery — are required to see where the bill actually lives.
- How much does the eval harness matter relative to model choice for vulnerability discovery?
- By at least 50x on real codebases. The same Claude Mythos Preview surfaced 271 true-positive bugs in Firefox 150 under Mozilla's custom agentic harness with reproducible test cases and ephemeral VMs, but produced only 1 low-severity CVE and roughly 80% false positives against curl with an out-of-box scanner. Harness investment dominates model selection on agentic security workloads.
- Which training-efficiency result should I prioritize spiking first?
- Nous Token Superposition Training, because it claims 2-3x pretraining wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M up to 10B-A1B MoE. Even if it replicates at only 1.6x on a 1B-param continued-pretraining run, it pays for itself on the next full run without touching the serving stack — unlike Star Elastic's 360x family-derivation claim, which carries much higher replication risk.
◆ Same day, different angle
Read this day as…
◆ Recent in data science
Keep reading.
- Princeton's ICML 2026 audit added GPT 5.5, Gemini 3.5 Flash, and Claude Opus 4.7 and found zero meaningful reliability improvement over pred…
- Hugging Face Transformers has an RCE path that fires from model config files — not pickle weights — across 2.2 billion installs.
- Anthropic ended the flat-rate Claude subsidy this week.
- Anthropic killed the flat-rate Claude subscription this week.
- Anthropic quietly killed the 70-90% effective discount on programmatic Claude usage — subscriptions now convert to dollar-matched API credit…