Edition 2026-05-31 · read as Data Science
AnthropicEndsClaudeSubscriptionDiscount,LeasesxAIGPUs
- Sources
- 36
- Words
- 1,723
- Read
- 9min
Topics Agentic AI LLM Inference AI Capital
◆ The signal
Anthropic quietly killed the 70-90% effective discount on programmatic Claude usage — subscriptions now convert to dollar-matched API credits across Agent SDK, GitHub Actions, and third-party harnesses — while simultaneously admitting an 80x capacity miss that forced them to lease xAI's entire 220,000-GPU Colossus 1 cluster. OpenAI dropped a 2-month free Codex enterprise switch promo the same day. If you haven't reconciled your Claude token burn against the new credit cap this week, you're making a pricing decision by default.
◆ INTELLIGENCE MAP
01 Anthropic Credit Reset + Capacity Crisis
act nowAnthropic metered all programmatic Claude usage at API rates, killing the alt-harness subsidy. ServiceNow burned its full-year budget by May. The 80x capacity miss drove an emergency lease of xAI's 220K-GPU Colossus 1 cluster. OpenAI's 2-month free Codex promo targets the exact developers Anthropic just alienated.
- Planned growth
- Actual growth
- Colossus GPUs
- Anthropic B2B share
- OpenAI B2B share
- Planned capacity10
- Actual demand80
02 Agentic Traffic Is Now the Majority — Eval Harnesses Measure the Minority
monitorVercel's AI Gateway puts agentic workloads at 59% of all token volume across 200K teams. Anthropic captures 61% of spend via Opus; Google captures 38% of volume via Flash. Single-turn eval harnesses are now benchmarking the minority of production traffic — trajectory-level metrics are overdue.
- Agentic token share
- Anthropic spend share
- Google volume share
- Teams observed
03 AI Cyber Capability Crosses Full-Takeover Threshold
monitorAnthropic's Mythos is the first model to clear both UK AISI simulated attack ranges — achieving full network takeover in controlled tests. GPT-5.5-cyber cleared one. Separately, Google confirmed a threat actor using AI to build cybercrime tooling in the wild. Patch SLAs calibrated to CVE cadence are now measuring the wrong clock.
- Mythos ranges cleared
- GPT-5.5-cyber cleared
- Palo Alto products scanned
- PraisonAI exploit time
- 01Mythos (new)Full takeover
- 02GPT-5.5-cyberPartial takeover
- 03Mythos (prior)Advanced persistence
04 Training Efficiency Breakthroughs: 2-360x Compute Reductions
monitorThree research drops change unit economics for anyone pretraining or distilling: Nous TST delivers 2-3x wall-clock speedup at matched FLOPs (validated 270M→10B MoE). Datology beats InternVL3.5-2B by 10pts at 17x less compute via data curation alone. NVIDIA Star Elastic claims 360x cheaper model-family production from one post-training run.
- TST speedup
- Datology compute savings
- Star Elastic savings
- Datology benchmark lift
05 Compute Supply Crunch: 4:1 Demand Ratio, Siting Backlash
backgroundNebius reports 4+ customers competing per GPU at 684% YoY revenue growth. Cerebras IPO'd at $56B with a $20B OpenAI commitment. Utah's 9GW Stratos project faces 4,000 complaints and a referendum. Cisco's AI order guidance jumps from $5B to $9B with explicit memory hardware shortage. H2 training capacity priced on today's availability is mispriced.
- Nebius YoY growth
- Demand:supply ratio
- Cerebras IPO cap
- Stratos complaints
- Nebius 2025 rev530
- Nebius 2026 guide3200
◆ DEEP DIVES
01 Anthropic's Double Shock: Credit Metering Kills the Subsidy, 80x Miss Forces Colossus Lease
The Pricing Reset
Anthropic converted every Claude subscription into a dollar-matched API credit bucket. The implicit 70-90% discount teams were getting by running Agent SDK, GitHub Actions, or third-party harness workloads against a $200 Max plan is gone. Starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) draws from a separate credit allocation with no rollover and overflow at API rates. Any cost model built before this date is numerically wrong, not approximately wrong.
ServiceNow's CDIO publicly confirmed they burned their full-year Claude budget by May after the price hikes. The thing this doesn't tell you is how much was preventable: Anthropic ships no native per-user, per-tool usage telemetry, and no SLAs on latency or availability. You cannot attribute spend you cannot measure.
The Capacity Story Behind the Price Story
Dario Amodei at Code with Claude admitted they planned for 10x growth and got 80x in revenue and usage. That 8x forecast error explains the degradation reports from the last several weeks. What users were reading as model regressions was a capacity wall. The emergency fix is leasing xAI's entire Colossus 1 cluster (220,000+ GPUs spanning H100, H200, and GB200) from the CEO who called Anthropic 'misanthropic' three months ago.
Surface Before After (May 7-14) Claude Code limits 5-hour cap Doubled Peak-hours throttle Reduced for Pro/Max Removed Opus API rate limits Squeezed 'Substantially raised' Fleet composition Anthropic-managed Heterogeneous incl. GB200 Any Claude benchmark from before May 7 is stale. Re-baseline after the new caps land, not before — otherwise the delta you attribute to a prompt change is mostly capacity noise.
The OpenAI Counter-Offensive
Hours after the metering announcement, OpenAI dropped a 2-month free Codex enterprise switch promo. Ramp's April data showed the first-ever Anthropic lead at 34.4% vs 32.3%, so OpenAI is pricing a direct assault on the developers Anthropic just alienated. Treat this as an asymmetric-payoff evaluation window: free head-to-head data on workloads you actually run, not on someone else's leaderboard.
What This Means for Your Stack
The combined read across nine sources: single-provider Claude dependency carries unpriced risk, and the forecast-error bound on that risk is now 8x. Anthropic is targeting an October IPO with a CFO hired specifically for margin improvement. The base rate says pricing stays sticky or rises from here.
Action items
- Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) against the new credit cap and flag jobs that will exhaust credits before month-end
- Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily budget alerts in front of all Claude traffic
- Accept OpenAI's 2-month Codex evaluation under the enterprise switch promo; instrument head-to-head on your eval harness with matched prompts
- Re-run Claude Code and Opus API baselines (throughput, p95 latency, rate-limit headroom) post-Colossus integration before shipping any workarounds designed for the degraded period
Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward... · Anthropic shipped without the telemetry hooks... · Vercel published a number worth sitting with: 59%... · Anthropic passes OpenAI in B2B
02 59% of Production Tokens Are Agentic — Your Eval Harness Is Scoring the Minority
The Production Data
Vercel's AI Gateway index, drawn from 200,000 teams over 7 months, puts agentic workloads at 59% of all token volume. That is measurement, not forecast. Anthropic captures 61% of spend through Opus. Google captures 38% of volume through Flash. The data shows no vendor loyalty. Customers route by task.
The spend-versus-volume gap is the structural read: expensive models do the planning and reasoning nodes, cheap models do the high-throughput utility calls like retrieval rewriting, extraction, and classification. Teams paying Opus rates for every agent step are overspending on the 59% of calls that do not need it.
Why Single-Turn Evals Are Now Dangerously Misleading
Most eval harnesses in production still score single-turn responses against a reference answer. That was the right design in 2023. Once the median request is a multi-step tool loop with retries, the metric you want is different:
Old metric (single-turn) New metric (agentic) Why it matters Accuracy on final answer Cost-per-successful-task A 40K-token argument with itself costs real money MMLU/HumanEval Tool-call precision & recall Wrong tool selection cascades through the trajectory Mean latency Steps-to-completion p95 trajectory, not p95 request Pass@1 Recovery-from-error rate Real agents fail and retry; pass@1 hides this Sayash Kapoor's framing is the cleanest version of this: outcome-only metrics systematically underestimate failure modes in capable agents. Stronger agents surface benchmark bugs and reward-hacking paths that weaker agents never reach. The pass@1 curve flattens at exactly the point where real reliability starts to diverge. The thing pass@1 doesn't measure is the long tail you will actually ship into.
The Production Reference Architecture
Abridge (80M+ clinical conversations, 250 health systems, $5.3B valuation) has disclosed enough to lift the pattern: cheap fast model for triage, expensive model for reasoning, confidence-gated routing, LLM judges calibrated against human annotators quarterly, memory externalized into event stores. Microsoft's MDASH reports the same decomposition on the security side, with scan, debate, and exploit stages beating monolithic approaches on CyberGym.
A router that treats every call as independent is leaving money and latency on the table once a meaningful fraction of traffic is agentic. Session-aware routing and tool-calling reliability matter more than MMLU deltas.
The Glean benchmark claiming MCP uses 30% more tokens than a retrieval-tuned knowledge graph is vendor-published with no methodology, so treat the magnitude as untrusted. The direction matches what the production traces show: naive tool listings balloon context windows. I would expect the real number to land somewhere south of 30% on a clean rerun, but still positive enough to matter.
Action items
- Add trajectory-level metrics to your eval harness this sprint: task success rate, tool-call F1, steps-to-completion, cost-per-successful-task, recovery-from-error rate
- Instrument per-node token cost in your agent graphs and route utility calls (summarization, JSON extraction, query rewriting) to Flash/Haiku-class models
- Add LLM-judge-to-human-annotator agreement as a tracked SLI; re-calibrate quarterly with Cohen's kappa against gold labels
- Run a 1-hour spike measuring token overhead of your MCP/tool-calling setup vs. a retrieval-first baseline on 100 sampled production traces
Sources:Agentic traffic crossed fifty-nine percent of tokens... · Vercel published a number worth sitting with: 59%... · The CyberGym result is the kind of finding... · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs is the combination...
03 Apache Lakehouse Stack Under Critical Attack: Iceberg, Polaris, Argo CD
The New CVEs Landing on the Data Stack
This week's advisory cycle concentrates on infrastructure data teams actually run in production. The LiteLLM KEV entry was flagged earlier this week. Three new critical CVEs landed that target lakehouse and MLOps infrastructure directly, which is a different class of problem.
Component CVE / CVSS Impact Blast Radius Apache Iceberg CVE-2026-42812 / 9.9 Metadata write redirect to attacker-controlled S3 Poisoned tables, corrupted training data Apache Polaris CVE-2026-42809/10/11 / 9.9 Credential broadening S3/GCS creds, cross-tenant access Argo CD 3.2.x/3.3.x CVE-2026-42880 / 9.6 Missing authorization Plaintext K8s Secret extraction n8n CVE-2026-42233 / 9.8 SQL injection + OAuth theft Workflow DB, OAuth sessions Kestra ≤1.3.3 CVE-2026-38428 / 9.8 SQL injection Pipeline metadata, schedules Why Iceberg CVE-2026-42812 Is the One That Matters
An attacker with table-write permission can redirect metadata pointers at an attacker-controlled S3 prefix. The next query reads poisoned Parquet. The next training run ingests silently corrupted features and produces a model that looks fine on the eval set. The thing standard lakehouse observability doesn't cover is pointer changes; it covers row changes, not pointer changes. Most monitoring stacks will not see this.
Combined with the Polaris credential-broadening issue, the plausible path runs from "compromised analyst notebook" to "cross-tenant data theft."
Draw a reference architecture for a modern data team, throw darts at it, and every throw hits a CVSS of 9.0 or higher.
Argo CD: Model Registry Tokens Are Exposed
The missing-authorization flaw lets low-privilege users extract plaintext Kubernetes Secrets in reachable namespaces. For teams running model services or training jobs through Argo CD 3.2 or 3.3, that set includes model-registry tokens, HuggingFace PATs, database passwords, and cloud credentials. Rotation costs more than the patch. Skipping it is not a defensible decision.
The Pattern
These are not obscure memory-corruption bugs in C libraries. They are authorization failures and unsafe input handling in Python, Go, and Java tools that shipped fast. ML infrastructure was built at startup velocity and is now getting the security attention web frameworks got a decade ago. CISA is tracking AI-infra exploits the way it tracks Exchange or Ivanti. The downstream effect, predictable enough to underwrite, is procurement friction on anything LLM-adjacent for the next few quarters.
Action items
- Patch Argo CD to ≥3.2.12 / ≥3.3.10 and rotate every Kubernetes Secret in namespaces it can read — this week
- Audit Iceberg/Polaris catalog configurations: enforce explicit storage credential scoping and add write-path allowlisting for table metadata locations
- Run a dependency scan for n8n, Kestra, Spring Cloud Config, and Redis across your ML orchestration stack; pin to patched versions
- Add metadata-pointer integrity checks to your lakehouse monitoring — alert on catalog location changes, not just row-level changes
Sources:LiteLLM landed in the KEV catalog this week... · An Ollama endpoint exposed to the public internet... · Agent stacks are now in scope for attackers
04 Training Compute Breakthroughs: TST, Datology, and Star Elastic Change Unit Economics
Three Results That Move the Budget
Three research drops landed the same week, each aimed at a different line item in the training compute bill. Read together, the marginal dollar in model development is moving from raw FLOPs toward training recipes and data curation. That is a claim about where to spend, not a claim that compute stopped mattering.
Work Claim Scale Validated Inference Impact Replication Risk Nous TST 2-3x wall-clock at matched FLOPs 270M → 10B-A1B MoE None — no architecture change Medium; single-source, clean claim NVIDIA Star Elastic 360x cheaper model-family production Not specified Produces family of sizes from one run High; big number, lab-reported Datology VLM +11.7 pts on 20 benchmarks; 17x less compute 2B and 4B params Lower response FLOPs — real serving win Medium; benchmark-selection risk What Each Means for Your Roadmap
TST is the one to spike first. Token Superposition Training is a pretraining recipe change with no inference-side architecture change. If it replicates at even 1.6x on a continued-pretraining run with no val-loss regression, it pays for itself on the next full run. The mechanism — superposing multiple token targets per forward pass — is clean, and the authors validated it from 270M up to 10B MoE. The thing this doesn't tell you is how it behaves on your data mix at your context length, which is where most pretraining recipes lose half their reported gain.
Datology is the clearest evidence this year that the marginal VLM dollar has moved from compute to curation. Their 2B model beats InternVL3.5-2B by about 10 points at 17x less training compute, through data selection and mixture optimization alone. At 4B params, they reach near-frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B. The training cost number is interesting; the response FLOPs number is what shows up on the serving bill.
Star Elastic's 360x number is the kind of claim that always shrinks under independent eval. Given the setup, I expect it to hold up by roughly an order of magnitude less on third-party benchmarks. Even a 30x retention would change how teams produce size tiers. One post-training run yielding a family from 1B to 70B is categorically different from training each size independently.
TST requires no inference-time changes. Datology requires no model architecture changes. Both are 'free' efficiency wins conditional on reproduction — and both are cheap enough to spike this quarter.
The Data Curation Thesis
Datology's result sits alongside the Fivetran readiness index, which finds that only 15% of organizations have the data foundation for agentic AI, with about 50% citing data quality and lineage as the top blocker. The correlation is suggestive, not causal: bigger models still help, and curated data is partially a proxy for teams that know what they are doing. The lakehouse stats gap compounds the problem. Iceberg, Delta, and Parquet treat column-level stats as optional, and stale or missing stats produce plans that cost 3x what they should without surfacing a hard error. That last failure mode is the one to instrument first, because it does not show up on any leaderboard.
Action items
- Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline this quarter
- Run ANALYZE/compute-stats coverage audit across your Iceberg/Delta tables; add stats freshness to table-level SLAs
- If running VLM training: replicate Datology's data-curation-first methodology on a 2B base before scaling compute
- Score your next agent project against data readiness dimensions (quality, lineage, governance) before greenlighting compute spend
Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode this week
◆ QUICK HITS
DuckDB shipped Quack HTTP client-server mode — Spark-on-Glue jobs under 100GB are now credible migration candidates to ECS Fargate + DuckDB at 50%+ cost reduction
DuckDB shipped a client-server mode this week
Update: Mythos is first model to achieve full network takeover on both AISI simulated attack ranges; GPT-5.5-cyber cleared one — add staged cyber-capability rubric (recon → lateral movement → persistence → exfil) to agent release gates
Mythos cleared the AISI attack ranges this week
Kafka Share Groups decouple consumer parallelism from partition count with ~linear 8x scaling at 32 instances on I/O-bound workloads — benchmark on embedding/enrichment consumers first
DuckDB shipped a client-server mode this week
SWE-ZERO-12M-trajectories released: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages — largest open agentic trace corpus, useful for SFT and reward-model training before licensing frictions arrive
Claude just metered your agent SDK calls
TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s Gemini-3.1-flash-live and 1.18s GPT-Realtime-2.0 — full-duplex voice is becoming a distinct model class; add turn-taking latency (user-EOS → first audio byte) to voice eval
TML is reporting 0.40 seconds of full-duplex latency
Duolingo CEO publicly pegged AI-generated content slop at ~20% requiring human QC — use as calibration anchor for your own generation acceptance rate before building custom benchmarks
Duolingo's twenty percent AI slop rate is the number worth staring at
Gemini reproducibly leaks real phone numbers (4 independent incidents) — add PII extraction eval suite (canary insertion, divergence attacks, membership inference probes) to LLM CI before the next release cut
Gemini is the latest model to surface PII from its training data
LLM-as-a-Verifier (decomposed binary verifications with token-level scoring) outperforms LLM-as-a-Judge on tie-rate and decision accuracy — a one-day rewrite of one pairwise judge is the cheapest variance reduction available
An Ollama endpoint exposed to the public internet gets picked up by Shodan...
Only 15% of organizations have data foundations ready for agentic AI (Fivetran); ~50% cite quality/lineage as #1 blocker — score target domains against readiness dimensions before greenlighting agent compute
DuckDB shipped a client-server mode this week
Opus 4.7 tripled image processing costs — re-price multimodal inference budget and run head-to-head vs GPT-4V and Gemini on your actual image workload this sprint
Anthropic passes OpenAI in B2B
◆ Bottom line
The take.
Anthropic killed the programmatic Claude discount (70-90% gone overnight), admitted an 80x capacity miss that forced them to rent a competitor's entire GPU fleet, and still has no native cost-attribution telemetry — while Vercel confirmed 59% of production tokens are now agentic traffic that your single-turn eval harness doesn't measure. The three things to ship this sprint: a gateway with per-feature budget caps, trajectory-level eval metrics, and patches for Iceberg/Argo CD before someone poisons your training data through a CVSS 9.9 you didn't know existed.
Frequently asked
- How do I figure out if my Claude workloads will blow through the new credit cap?
- Put an LLM gateway like LiteLLM or Portkey in front of all Claude traffic and tag every call by user, feature, and tool. Anthropic ships no native per-user, per-tool telemetry, so the attribution layer is yours to build. Once tagged, project current daily burn against the dollar-matched credit bucket and flag any job that exhausts before month-end — third-party tool usage (Zed, Conductor, OpenCode, T3 Code) draws from a separate allocation with no rollover starting June 15.
- Are old Claude benchmarks still valid after the Colossus lease and the new rate caps?
- No — re-baseline after the new caps land. The 80x usage miss caused weeks of capacity-driven degradation that users mistook for model regressions, and Anthropic has since doubled Claude Code limits, removed peak-hours throttling for Pro/Max, and substantially raised Opus API rate limits on a heterogeneous fleet that now includes GB200s. Any throughput, p95 latency, or rate-limit-headroom number from before May 7 mixes capacity noise into whatever delta you're trying to measure.
- Why are single-turn evals insufficient now that 59% of tokens are agentic?
- Single-turn accuracy scores the minority of production traffic and hides the failure modes that matter in tool loops. Trajectory-level metrics — cost-per-successful-task, tool-call precision and recall, steps-to-completion, and recovery-from-error rate — capture what actually breaks: wrong tool selection cascading through a trajectory, 40K-token self-arguments, and pass@1 curves that flatten right where real reliability diverges. Outcome-only metrics also systematically underestimate reward-hacking paths that stronger agents reach.
- Which of the new lakehouse CVEs should I patch first, and why?
- Patch Argo CD (CVE-2026-42880) this week and rotate every Kubernetes Secret in reachable namespaces, because the missing-authorization flaw exposes plaintext model-registry tokens, HuggingFace PATs, DB passwords, and cloud credentials. In parallel, harden Iceberg and Polaris: CVE-2026-42812 lets a write-permitted attacker redirect table metadata to an attacker-controlled S3 prefix, poisoning Parquet files that training runs ingest silently. Standard row-level lakehouse monitoring does not catch pointer mutations, so add catalog-location change alerts.
- Is Token Superposition Training worth a spike this quarter compared to Star Elastic or Datology's VLM result?
- TST is the cleanest spike candidate because it changes only the pretraining recipe — no inference-side architecture change — and was validated from 270M up to a 10B-A1B MoE with 2-3x wall-clock at matched FLOPs. Even a 1.6x replication on a continued-pretraining run with no val-loss regression pays for itself on the next full run. Star Elastic's 360x model-family number will likely shrink by an order of magnitude under independent eval, and Datology's gains are VLM-specific and depend on curation pipelines you may not have.
◆ Same day, different angle
Read this day as…
◆ Recent in data science
Keep reading.
- Princeton's ICML 2026 audit added GPT 5.5, Gemini 3.5 Flash, and Claude Opus 4.7 and found zero meaningful reliability improvement over pred…
- Hugging Face Transformers has an RCE path that fires from model config files — not pickle weights — across 2.2 billion installs.
- Anthropic ended the flat-rate Claude subsidy this week.
- Anthropic killed the flat-rate Claude subscription this week.
- Anthropic's June 15 credit metering removes what was effectively a 70-90% subsidy on Claude-backed agents and eval harnesses.