How do I quickly tell if my Claude programmatic spend is overrunning the new credit cap?

Pull the last 30 days of token burn across Agent SDK, GitHub Actions, and any third-party harnesses (Zed, Conductor, OpenCode, T3 Code), price it at API list rates, and compare to your dollar-matched subscription credits. If list-rate spend exceeds credits, the delta is your projected monthly overrun. Anthropic ships no native per-user telemetry, so the reconciliation has to come from your gateway logs or provider invoices.

Why are single-turn eval scores misleading when most production traffic is agentic?

Single-turn harnesses score final-answer accuracy against a reference, which stays high even when a planner burns 40,000 tokens looping before it converges. With 59% of tokens now in multi-turn tool-calling traces, the dominant failure mode is cost and step-count blowup, not wrong answers. Trajectory-level metrics — tool-call F1, steps-to-completion, and cost-per-successful-task — are required to see it.

Which training-efficiency result is safest to pilot first, and why?

Token Superposition Training is the safest pilot because it is a pretraining recipe change with no inference-side impact. The serving architecture is unchanged, so a failed replication costs only the spike run, while a successful one delivers 2–3x wall-clock at matched FLOPs. It has also been validated from 270M up to a 10B-A1B MoE, which covers most continued-pretraining workloads.

Why does the Mozilla vs. curl bug-yield gap matter for agent design?

Same model weights produced 271 actionable Firefox bugs versus one low-severity curl CVE with four false positives — the only variable was the harness. Mozilla wrapped the model in a fuzzer-integrated agentic loop with reproducible test-case emission; the curl run was an out-of-box scan. The lesson is that eval and harness investment dominates model selection by roughly 50x on yield.

What's the immediate blast radius of the Iceberg and Argo CD CVEs for a data team?

The Iceberg CVE (9.9) lets an attacker redirect table metadata writes to an S3 prefix they control, so the next query or training run silently ingests poisoned Parquet — a feature-store and model-training corruption vector. The Argo CD CVE (9.6) lets read-only users extract plaintext Kubernetes Secrets, which typically include model-registry tokens, object-store keys, and cloud credentials. Both require patching and credential rotation this week, not next sprint.

Edition 2026-05-24 · read as Data Science

AnthropicEndsClaudeSubscriptionDiscount,OpenAICounters

Sources: 36
Words: 1,503
Read: 8min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

Anthropic killed the 70-90% effective discount on programmatic Claude usage overnight — subscriptions now convert to dollar-matched API credits across Agent SDK, GitHub Actions, and third-party harnesses. Hours later, OpenAI dropped a 2-month-free Codex enterprise switch promo. If you haven't reconciled your Claude token burn against the new credit cap this week, you're making a pricing decision by default, and the overrun is already accumulating.

Key facts

Anthropic converted programmatic Claude usage from flat-rate subscriptions to dollar-matched API credits, eliminating a 70-90% effective discount for power users.
Anthropic emergency-leased xAI's Colossus 1 cluster of 220,000+ GPUs after revenue and usage grew 80x against a 10x plan, per Dario Amodei.
Vercel's AI Gateway data across 200,000 teams shows 59% of tokens are now agentic, up from under 20% six months earlier, with Anthropic at 61% of spend and Google at 38% of volume.
Anthropic's Mythos became the first model to clear both UK AISI simulated attack ranges, while Mozilla's custom agentic harness using Mythos surfaced 271 bugs in Firefox 150.
Critical CVEs landed this week including Apache Iceberg CVE-2026-42812 (CVSS 9.9) enabling metadata redirection, Argo CD (CVSS 9.6) exposing plaintext K8s Secrets, and an 18-year latent unauthenticated RCE in the NGINX rewrite module.

◆ INTELLIGENCE MAP

01
Anthropic Pricing Reset + Capacity Crisis
act now
Claude programmatic usage now metered at API rates (was 70-90% discounted). Anthropic's 80x growth vs 10x planned forced a Colossus 1 lease (220K GPUs). ServiceNow blew its full-year Claude budget by May. June 15 brings a separate credit split for third-party tools. OpenAI is counter-offering 2 months free Codex.
80x
growth vs plan (capacity)
11
sources
- Growth vs plan
- Colossus GPUs leased
- Anthropic B2B share
- OpenAI B2B share
- Opus 4.7 image cost
1. Anthropic B2B Share34.4
2. OpenAI B2B Share32.3
02
59% Agentic Traffic vs. Single-Turn Eval Harnesses
act now
Vercel's AI Gateway production index: 59% of tokens are multi-turn agentic workloads. Anthropic captures 61% of spend via Opus; Google captures 38% of volume via Flash. Cost models built on single-turn ratios are off by ~5x. Duolingo anchors production slop at 20% — a rare honest quality number from a real deployment.
59%
tokens now agentic
5
sources
- Agentic token share
- Anthropic spend share
- Google volume share
- Duolingo slop rate
- Cost model error
1. Agentic tokens59
2. Single-turn tokens41
03
Training Efficiency: 2-360x Improvements in One Week
monitor
Three independent results compress training economics. Nous TST delivers 2-3x wall-clock speedup at matched FLOPs with no inference-time change (validated to 10B). NVIDIA Star Elastic produces model-size families from one post-training run at 360x lower cost. Datology beats InternVL3.5-2B by 10 points on 20 VLM benchmarks at 17x less compute via pure curation.
17x
less VLM training compute
2
sources
- TST wall-clock gain
- Star Elastic cost cut
- Datology compute cut
- Datology benchmark gain
- TST scale validated
1. Nous TST3
2. Datology VLM17
3. Star Elastic360
04
AI Cyber Offense Crosses Full-Takeover Threshold
monitor
AISI confirmed Mythos is the first model to clear both simulated attack ranges (full network takeover). Google intercepted AI-built cybercrime tooling in the wild — no longer hypothetical. Mozilla's 271 bugs vs curl's 1 CVE using the same model proves the harness, not the weights, determines vulnerability yield. Expect patch SLAs calibrated to quarterly CVE cadence to fail against model-release cadence.
271
bugs (Mozilla harness)
7
sources
- AISI ranges cleared
- Mozilla bugs found
- curl bugs found
- Products scanned (Palo)
- PraisonAI exploit time
1. Mozilla (custom harness)271
2. curl (generic scan)1
05
Compute Supply Crunch: 4:1 Demand-to-Supply Ratio
background
Nebius posted 684% YoY revenue with 4+ customers per GPU and guides $3-3.4B for 2026. Cerebras IPO'd at $56B with a 70% first-day pop; OpenAI committed $20B. The 9GW Stratos data center drew 4,000 complaints and a referendum. H2 training capacity priced on today's availability is probably mispriced.
4:1
demand-to-supply ratio
5
sources
- Nebius YoY growth
- Nebius 2026 guide
- Cerebras market cap
- Stratos power need
- Cisco AI order growth
1. Nebius 2025 Rev530
2. Nebius 2026 Guide3200

◆ DEEP DIVES

Anthropic's Triple Shock: Your Claude Cost Model Broke Overnight

What Happened

Anthropic shipped pricing changes inside a 48-hour window that compound on each other. First, all programmatic subscription usage (Agent SDK, claude-p, GitHub Actions, third-party harnesses) converted from flat-rate to dollar-matched API credits. The 70-90% effective discount power users had been quietly running on is gone. Second, starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) lands in a separate credit bucket with no rollover. Overflow bills at API list rates. Third, Opus 4.7 tripled image processing costs.

The capacity numbers explain the timing. Dario Amodei conceded Anthropic planned for 10x growth and got 80x in revenue and usage. The gap forced an emergency lease of xAI's entire Colossus 1 cluster — 220,000+ GPUs across H100, H200, and GB200. That is roughly 45% of xAI's current capacity changing hands.

Why It Matters Now

ServiceNow's CDIO burned through the full-year Claude budget by May. National Life Group's CIO publicly called Claude 'not great for companies' wanting per-user monitoring. Both complaints route to the same root cause. Anthropic provides no native per-user, per-tool usage telemetry and no SLAs on latency or availability. Customers wire their own observability or fly blind.

If the vendor cannot tell you which user burned the token, the problem is not cost. It is observability, and it is yours to fix before the next invoice.

OpenAI's response shipped the same day: a 2-month-free Codex enterprise switch promotion aimed at developers Anthropic just repriced. Ramp puts Anthropic at 34.4% of paying businesses against OpenAI at 32.3%, the first crossover. The thing this doesn't tell you is how the share moves once the new pricing hits actual invoices, which is the measurement that matters. The market is contested. Switching costs are the only moat left.

The Capacity Math

Surface	Before	After (May 7-14)
Claude Code limits	5-hour cap	Doubled
Peak-hours throttle	Reduced limits	Removed
Opus API rate limits	Squeezed	'Substantially raised'
Fleet composition	Anthropic-managed	+220K GPUs via Colossus

Any Claude benchmark from before May 7 is stale. Architectural decisions made on April or May numbers describe a system that no longer exists; aggressive caching keyed to the old discount structure is the obvious example. Re-baseline after the new caps land.

Action items

Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) and reconcile projected token burn against the new credit cap before month-end
Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily token budget alerts this sprint
Run a 2-month Codex evaluation under OpenAI's free enterprise switch promo using matched prompts and your own eval harness
Avoid long-term Anthropic commits (annual contracts, dedicated capacity) until post-Colossus integration stability is observable in 6-8 weeks

Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with · Ramp's AI Index shows Anthropic at 34.4%

02
59% Agentic Traffic — Your Eval Harness Is Scoring the Minority
The Production Reality
Vercel's AI Gateway production index, across 200,000 teams over 7 months, reports 59% of all tokens are now agentic — multi-turn, tool-calling traces. Six months ago that figure was under 20%. Most cost models in use were fit when input-output ratios sat near 3:1. Agentic traces run closer to 15:1 on input, with heavy cache reuse on some providers and none on others. A forecast built on last year's ratio is off by roughly 5x on spend.
The eval harness mismatch is worse. Most harnesses still score single-turn responses against reference answers. The thing this doesn't tell you is what happens when the median request is a multi-step tool loop with retries, and the failure mode is a planner burning 40,000 tokens arguing with itself before giving up. Final-answer accuracy reads 90%+ in both the healthy and the pathological case. The bill, not the accuracy, is where the failure lives.
The Spend-Volume Divergence
Provider Share of Spend Share of Volume Primary Model Implied Role
Anthropic 61% — Opus Reasoning / planning nodes
Google — 38% Flash High-throughput utility calls
Others ~39% ~62% Mixed Mixed
That is a textbook tiered-routing signature. Expensive models handle planning and reasoning; cheap models do retrieval rewriting, extraction, and classification. Running Opus on every agent step leaves 20-40% of inference spend on the table at constant trajectory completion rate. The condition on that estimate is a stable router policy and unchanged tool surface; rewrite either and you re-measure.
The Duolingo Anchor
Duolingo's CEO pegged AI-generated content 'slop' at ~20% requiring human QC. That is a rare production quality number from a deployment at volume, and it is the anchor an internal harness should be calibrated against. The same company reversed its blanket 'evaluate all employees on AI usage' policy after observing performative adoption without productivity lift. Goodhart's Law arrived on schedule.
LLM-as-Verifier vs. LLM-as-Judge
A methodology paper this week argues that decomposing evaluation into repeated binary verifications at token granularity eliminates the tie-rate problem in pairwise judges. The mechanism is mathematically clean: one high-variance categorical judgment replaced by k lower-variance Bernoulli tests. Worth a one-day swap on one eval pipeline to measure the tie-rate reduction before committing further.
If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.
Action items
- Add trajectory-level metrics (tool-call F1, steps-to-completion, cost-per-successful-task) to your eval harness alongside single-turn benchmarks this sprint
- Instrument per-node token cost in agent graphs and route utility calls (JSON extraction, query rewriting) to Flash/Haiku-class models
- Benchmark your LLM output acceptance rate against Duolingo's ~20% slop disclosure; adjust HITL budget if delta exceeds 10pp in either direction
- Prototype LLM-as-Verifier on one existing eval pipeline; measure tie-rate and inter-run variance against current judge setup
Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · Duolingo's twenty percent AI slop rate · LLM-as-a-Verifier reframes evaluation

Provider	Share of Spend	Share of Volume	Primary Model	Implied Role
Anthropic	61%	—	Opus	Reasoning / planning nodes
Google	—	38%	Flash	High-throughput utility calls
Others	~39%	~62%	Mixed	Mixed

Training Efficiency Moved 2-360x — The Unit Economics of Your Next Run Just Changed

Three Results, Same Direction

Training unit economics shifted on three fronts this week, all pointing the same way. Each result matters for any team running its own pretraining, fine-tuning, or distillation.

Work	Claim	Scale Validated	Inference Impact	Replication Risk
Nous TST	2-3x wall-clock at matched FLOPs	270M → 10B-A1B MoE	None — no arch change	Medium; single-source
NVIDIA Star Elastic	360x cheaper model-family derivation	Not specified	Produces family from one run	High; lab-reported
Datology VLM	+11.7 pts, 17x less compute	2B and 4B params	Lower response FLOPs	Medium; benchmark-selection risk

Which One to Spike First

Token Superposition Training (TST) is the cleanest opportunity. It is a pretraining recipe change with no inference-side downstream. If it replicates, it is a free 2-3x on wall-clock with no new serving complexity, since the architecture at serve time is unchanged. The validation range, 270M to 10B, covers most continued-pretraining workloads teams actually run.

Star Elastic's 360x is the kind of headline number that always shrinks under independent eval. Given the gap between paper conditions and production data, I would plan around a 30x hold and treat anything above that as upside. Even at 30x, one post-training run producing the full edge/mobile/server family eliminates the multi-run coordination overhead teams currently eat to maintain deployment tiers.

Datology's result is the clearest evidence this year that the marginal dollar in VLM training has moved from compute to curation. They beat InternVL3.5-2B by ~10 points at 17x less training compute, purely via data selection. The thing this doesn't tell you on its own is the serving side: the near-frontier 4B model achieves 3.3x lower response FLOPs than Qwen3-VL-4B, which is a real inference-cost win on top of the training-cost win.

The DuckDB + Kafka Infrastructure Shift

Two data-infrastructure releases land on the same architectural assumption. DuckDB's Quack HTTP protocol turns embedded analytics into a proper shared service, and is a credible replacement for Spark-on-Glue for single-node ETL under ~100GB. Kafka Share Groups break the partition==max-consumers ceiling with ~linear 8x scaling at 32 instances on I/O-bound workloads. Combined with the ECS Fargate + Terraform pattern, that is a template for deleting the distributed-compute costume a lot of single-node jobs are wearing.

VLM training value moved from compute to curation. Data pipelines are following a parallel move from cluster orchestration to statistics hygiene.

Action items

Spike Token Superposition Training on a 1B continued-pretraining run against a matched-FLOPs baseline this quarter
Audit Glue/EMR job catalog for single-node candidates (<100GB working set) and port one onto ECS Fargate + DuckDB + Terraform
Run ANALYZE/compute-stats coverage audit across Iceberg/Delta tables; add stats maintenance to table-level SLAs
Benchmark Kafka Share Groups against most partition-bound consumer group (embedding/enrichment workloads first)

Sources:DuckDB shipped a client-server mode this week · Claude just metered your agent SDK calls · DuckDB + Quack plus ECS Fargate/DuckDB/Terraform ETL pattern

AI Cyber Offense Hit Full Network Takeover — Your Eval Harness Doesn't Measure This

The Capability Threshold

UK AISI confirmed that Anthropic's Mythos cleared both simulated attack ranges, the first model to do so. The prior generation topped out at 'advanced persistence.' GPT-5.5-cyber cleared one of two. AISI is already building harder tests because the current ones are saturating, which is what you'd expect when the leaderboard catches up to the eval rather than the underlying capability. Separately, Google's threat-intel team observed a hacking group using AI to build cybercrime tooling in the wild. That is the first production-grade confirmation of agentic misuse moving from red-team hypothesis to detected incident.

The Harness Delta That Matters

Mozilla wrapped a custom agentic harness around existing fuzzing infrastructure and surfaced 271 bugs in Firefox 150, including sandbox escapes, use-after-frees, and race conditions. Daniel Stenberg pointed the same model at curl and got 1 low-severity CVE with 4 false positives. Same weights, two orders of magnitude difference in yield. The variable that moved was the harness.

Dimension	Mozilla + Firefox	Stenberg + curl
Model	Mythos Preview	Mythos Preview
Harness	Custom agentic, fuzzer-integrated	Out-of-box scan
Bugs surfaced	271	5 claimed (4 FP)
True positives	Acted upon	1 low-severity CVE

The implication for anyone shipping agents: eval budget dominates model choice by 50x or more. The thing the headline benchmark doesn't tell you is which harness produced it. A week of domain-specific scorer and reproducible test-case emitter work looks like a better trade than the next dollar of inference.

New CVEs Targeting Your Stack This Week

Beyond the previously reported LiteLLM KEV, new critical vulnerabilities landed on infrastructure data teams actually run:

Apache Iceberg (CVE-2026-42812, CVSS 9.9): attackers redirect table metadata writes to attacker-chosen S3 prefixes. Next query reads poisoned Parquet. Next training run ingests corrupted features.
Apache Polaris (CVE-2026-42809/10/11, CVSS 9.9): credential-broadening enables cross-tenant cloud access.
Argo CD (3.2.x/3.3.x, CVSS 9.6): read-only users extract plaintext K8s Secrets including model-registry tokens and cloud credentials.
NGINX rewrite module: 18-year latent unauthenticated RCE affecting every model-serving gateway behind NGINX.

Vulnerability discovery just moved from human-weeks to model-minutes. If the patch SLA is not benchmarked against inference time, the defense is tuned to last year's threat model.

Action items

Patch Argo CD to ≥3.2.12 / ≥3.3.10 and rotate every K8s Secret in reachable namespaces this week
Audit Iceberg/Polaris catalog configurations: enforce write-path allowlisting for table metadata locations and rotate exposed credentials
Patch NGINX across all inference gateways; audit rewrite-module usage in model routing configs
Add autonomous-cyber-capability tier to model eval harness (AISI-style attack-range tasks) before next model upgrade

Sources:LiteLLM landed in the KEV catalog · Mythos cleared the AISI attack ranges · PraisonAI exploited in 4 hours · Mozilla shipped 271 bugs · Google's report of a threat actor using AI · The UK AISI evaluations report full network takeovers

◆ QUICK HITS

Update: Apache Iceberg/Polaris CVSS 9.9 — metadata write redirect enables silent training data poisoning via attacker-chosen S3 prefixes; new this cycle, not previously reported
LiteLLM landed in the KEV catalog this week
SWE-ZERO-12M-trajectories released: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages — largest open agentic trace corpus; acquire before licensing frictions appear
Claude just metered your agent SDK calls
Fivetran readiness index: only 15% of orgs have data foundation for agentic AI; data quality/lineage cited as #1 blocker by ~50% — use as gating scorecard before agent project approval
DuckDB shipped a client-server mode this week
TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s Gemini-flash-live and 1.18s GPT-Realtime-2.0 — 3x gap on the metric that determines voice-agent naturalness
TML is reporting 0.40 seconds of full-duplex latency
Microsoft's agent memory architecture (consolidation + forgetting + delayed maturation) stabilizes at 400-500 memories with 97.2% retention precision — alternative to flat vector top-k
DuckDB shipped a client-server mode this week
Affirm claims transformer-based underwriting beats legacy GBMs at scale (27M consumers) — but new PCAOB/COSO guidance requires deterministic execution and tamper-evident audit trails that transformers don't satisfy by default
The transformer underwriting models are outperforming
Claude Code /goal command: runs unattended against a stated objective with Haiku-as-evaluator, but termination judges only the transcript — wire PostToolUse hooks to pipe test output before trusting it
Anthropic shipped a /goal command in Claude Code
GPT-5.5 Instant claims 52.5% fewer hallucinated claims vs 5.3 Instant — vendor-reported, no methodology disclosed; run against your internal hallucination set before changing defaults
TML is reporting 0.40 seconds of full-duplex latency
Persona drift measurable within 8 dialogue turns (Li et al., COLM 2024) — add a verbal-tic canary and regex detector to agent logs as a zero-cost drift signal
AI personas drift within eight turns
Gemini reproducibly leaks real phone numbers from training data — multiple independent incidents; add PII extraction eval (canary insertion + divergence attacks) to CI before next release cut
Gemini is the latest model to surface PII

◆ Bottom line

The take.

Anthropic killed the flat-rate Claude discount overnight while admitting an 8x capacity-planning miss, 59% of production tokens are now agentic traces your single-turn eval harness can't score, and three training-efficiency results (2-360x) mean the unit economics of your next run are materially different from your last one — the teams that re-price, re-instrument, and re-baseline this sprint will be the ones still shipping when the invoice arrives.

Frequently asked

How do I quickly tell if my Claude programmatic spend is overrunning the new credit cap?: Pull the last 30 days of token burn across Agent SDK, GitHub Actions, and any third-party harnesses (Zed, Conductor, OpenCode, T3 Code), price it at API list rates, and compare to your dollar-matched subscription credits. If list-rate spend exceeds credits, the delta is your projected monthly overrun. Anthropic ships no native per-user telemetry, so the reconciliation has to come from your gateway logs or provider invoices.
Why are single-turn eval scores misleading when most production traffic is agentic?: Single-turn harnesses score final-answer accuracy against a reference, which stays high even when a planner burns 40,000 tokens looping before it converges. With 59% of tokens now in multi-turn tool-calling traces, the dominant failure mode is cost and step-count blowup, not wrong answers. Trajectory-level metrics — tool-call F1, steps-to-completion, and cost-per-successful-task — are required to see it.
Which training-efficiency result is safest to pilot first, and why?: Token Superposition Training is the safest pilot because it is a pretraining recipe change with no inference-side impact. The serving architecture is unchanged, so a failed replication costs only the spike run, while a successful one delivers 2–3x wall-clock at matched FLOPs. It has also been validated from 270M up to a 10B-A1B MoE, which covers most continued-pretraining workloads.
Why does the Mozilla vs. curl bug-yield gap matter for agent design?: Same model weights produced 271 actionable Firefox bugs versus one low-severity curl CVE with four false positives — the only variable was the harness. Mozilla wrapped the model in a fuzzer-integrated agentic loop with reproducible test-case emission; the curl run was an out-of-box scan. The lesson is that eval and harness investment dominates model selection by roughly 50x on yield.
What's the immediate blast radius of the Iceberg and Argo CD CVEs for a data team?: The Iceberg CVE (9.9) lets an attacker redirect table metadata writes to an S3 prefix they control, so the next query or training run silently ingests poisoned Parquet — a feature-store and model-training corruption vector. The Argo CD CVE (9.6) lets read-only users extract plaintext Kubernetes Secrets, which typically include model-registry tokens, object-store keys, and cloud credentials. Both require patching and credential rotation this week, not next sprint.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

AnthropicEndsClaudeSubscriptionDiscount,OpenAICounters

◆ INTELLIGENCE MAP

◆ DEEP DIVES

What Happened

Why It Matters Now

The Capacity Math

The Production Reality

The Spend-Volume Divergence

The Duolingo Anchor

LLM-as-Verifier vs. LLM-as-Judge

Three Results, Same Direction

Which One to Spike First

The DuckDB + Kafka Infrastructure Shift

The Capability Threshold

The Harness Delta That Matters

New CVEs Targeting Your Stack This Week

◆ QUICK HITS

The take.

Frequently asked

◆ RELATED THREADS