Edition 2026-05-23 · read as Data Science
AnthropicEndsClaude'sHidden70-90%ProgrammaticDiscount
- Sources
- 36
- Words
- 1,633
- Read
- 8min
Topics Agentic AI LLM Inference AI Regulation
◆ The signal
Anthropic converted Claude subscriptions to dollar-matched API credits across Agent SDK, GitHub Actions, and third-party harnesses, which retires the implicit 70-90% programmatic discount that a lot of teams quietly built their unit economics on. OpenAI posted a 2-month-free Codex enterprise switch promo into the same news cycle, which is the playbook we have watched both vendors run before. Workloads not reconciled against the new credit cap will run 3-5x last week's invoice. That is a pricing decision either way.
◆ INTELLIGENCE MAP
01 Anthropic's Triple Squeeze: Metering + 80x Capacity Miss + June 15 Cliff
act nowAnthropic metered programmatic usage, admitted an 8x capacity planning miss (80x growth vs 10x planned), leased xAI's entire 220K-GPU Colossus 1 cluster, and will split third-party tool credits on June 15. ServiceNow already burned its full-year Claude budget. OpenAI is pricing a counter-offensive.
- Growth vs plan
- Colossus GPUs
- Ramp share lead
- June 15 deadline
- Planned growth10
- Actual growth80
02 59% Agentic Tokens: Your Eval Harness Measures the Minority
monitorVercel's AI Gateway production index shows 59% of all tokens are now agentic multi-turn traffic. Anthropic captures 61% of spend (Opus), Google 38% of volume (Flash). Single-turn eval harnesses, cost models built on 3:1 I/O ratios, and per-request routing assumptions are measuring last year's workload shape.
- Agentic token share
- Anthropic spend share
- Google volume share
- MCP token overhead
03 AI Cyber Capability Crosses AISI Threshold: Harness Dominates Model
monitorMythos is the first model to clear both AISI simulated attack ranges. Mozilla's custom harness found 271 Firefox bugs with Claude Mythos vs 1 CVE from an out-of-box scan on curl — same model, 271:1 yield difference. MDASH shipped 16 real Windows patches. Google detected AI-built cybercrime tooling in the wild.
- Mozilla bugs found
- curl bugs found
- MDASH Windows fixes
- AISI ranges cleared
04 Training Efficiency: 2-360x Compute Reductions Ship This Quarter
backgroundNous TST delivers 2-3x wall-clock speedup at matched FLOPs with no inference architecture change (validated 270M→10B). NVIDIA Star Elastic claims 360x cheaper model-size families from one post-training run. Datology beats InternVL3.5-2B by 10 pts at 17x less compute via curation alone.
- TST speedup
- Star Elastic saving
- Datology compute cut
- SWE-ZERO corpus
05 Lakehouse & Orchestrator CVEs: Iceberg/Polaris CVSS 9.9
act nowApache Iceberg (CVE-2026-42812) lets attackers redirect table metadata to attacker-controlled storage — poisoning training data silently. Apache Polaris has three CVSS 9.9 credential-broadening bugs. Argo CD 9.6 exposes plaintext K8s secrets. PraisonAI agent framework was weaponized in 4 hours post-disclosure.
- Iceberg CVSS
- Polaris CVEs
- PraisonAI time-to-exploit
- NGINX RCE age
- 01Iceberg metadata redirect9.9
- 02Polaris cred broadening9.9
- 03Argo CD secret leak9.6
- 04NGINX rewrite RCE9.1
◆ DEEP DIVES
01 Anthropic's Triple Squeeze: Metering, Capacity Crisis, and the June 15 Pricing Cliff
What Changed This Week
Three distinct Anthropic moves converged into a single cost event for any team running Claude programmatically:
- Programmatic usage is now metered. Claude subscriptions convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The implicit 70-90% effective discount that alt-harness users had been getting is dead.
- The 80x capacity admission. Dario Amodei conceded at Code with Claude on May 6 that Anthropic planned for 10x growth and hit 80x. The emergency patch: leasing xAI's entire Colossus 1 cluster (220,000+ GPUs) spanning H100, H200, and GB200.
- June 15 third-party tool split. Starting in 30 days, Claude usage through Conductor, Zed, OpenCode, and T3 Code gets a separate credit bucket. No subsidized tokens, no rollover, overflow bills at API rates.
The Cross-Source Pattern
Multiple independent sources confirm this is a coordinated pricing tightening ahead of Anthropic's October IPO. ServiceNow's CDIO has already burned the full-year Claude budget by May. National Life Group's CIO publicly called Claude 'not great for companies' wanting per-user monitoring. Anthropic provides no native per-user telemetry and no SLAs — unusual for a dependency on the critical path of production features.
The vendor that captures 34.4% of enterprise spend cannot tell customers which user burned the tokens.
Meanwhile, OpenAI dropped a 2-month-free Codex enterprise switch promo the same day Anthropic metered usage. Ramp's April data shows Anthropic edging OpenAI 34.4% vs 32.3% — the first lead change. This is OpenAI pricing a counter-offensive against the exact developers Anthropic just alienated.
What Benchmarks From Before May 7 Are Now Stale
Surface Before After (May 7-14) Claude Code (Pro/Max) 5-hour limit Doubled Peak-hours throttle Reduced limits Removed Opus API rate limits Squeezed during crunch 'Substantially raised' Fleet composition Anthropic-managed Heterogeneous (incl. GB200) Any architectural decision made against April numbers — aggressive caching, prompt compression, provider migration — is calibrated to conditions that no longer exist.
The Practical Implications
Token consumption in agentic workflows is non-linear. A reflection loop or tool-use chain can 10x spend per task without proportional quality gain. With no native cost attribution at the user or prompt level, the overage shows up the way it showed up at ServiceNow — after the money is gone. The workaround is entirely on the customer: gateway-level logging with per-tenant tagging, daily budget alerts, and hard caps per feature.
Action items
- Audit every Claude-backed workload (Agent SDK, claude-p, GitHub Actions, batch evals) and reconcile projected token burn against the new credit cap this sprint
- Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily token budget alerts before June 15
- Run a 2-month head-to-head Codex evaluation under OpenAI's enterprise switch promo with matched prompts and tool schemas
- Re-baseline Claude throughput and latency benchmarks post-Colossus integration before shipping any workaround from the April crunch period
Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with
02 59% Agentic Tokens, 100% Single-Turn Evals: The Measurement Gap That's Costing You 5x
The Production Reality
Vercel's AI Gateway production index — the first multi-tenant usage snapshot this quarter — puts agentic workloads at 59% of all token volume across 200,000 teams. Six months ago it was under 20%. That composition shift is faster than the completion-to-chat transition, and most production stacks were architected before it started.
The spend-versus-volume split is the routing signal. Anthropic captures 61% of dollars through Opus on reasoning and planning nodes. Google captures 38% of tokens through Flash on throughput and utility calls. Expensive models plan, cheap models fan out. Teams without a tiered router are paying Opus rates for work Flash does at 5-10x lower cost.
Why Single-Turn Evals Are Lying
Most agent eval harnesses still score a single response against a reference answer. That was the right call in 2023. With 59% of traffic now multi-step tool loops, three things break:
- Cost models break. Input-to-output ratios on agentic traces moved from roughly 3:1 to roughly 15:1. A forecast built on last year's ratio is off by about 5x on spend.
- Accuracy masks cost paths. A planner that burns 40,000 tokens arguing with itself and then gives up can still post 90%+ final-answer accuracy. The thing this doesn't tell you is where the bill lives.
- Tool-call reliability is invisible. An end-state-only score cannot distinguish a clean first attempt from a 3-retry recovery, and the production bottleneck is the second one.
Abridge's production architecture across 80M+ clinical conversations is the cleanest validation of the alternative: constellation-of-models with fast/slow routing, LLM judges calibrated against human-annotated ground truth, and memory externalized from weights into event-driven stores.
If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.
The Duolingo Anchor
Duolingo disclosed a ~20% unusable rate for AI-generated content at scale. Production quality numbers at that resolution are rare. They also reversed the blanket 'evaluate all employees on AI usage' policy after observing performative adoption with no productivity lift. Both points calibrate expectations. Well-resourced teams ship 1-in-5 outputs that need human rework, and token-usage-as-KPI is a Goodhart trap.
What the Eval Harness Needs
Metric What It Catches Effort Tool-call precision/recall Wrong tool selections, hallucinated args Days Steps-to-completion Wandering planners, retry storms Days Cost-per-successful-task Efficiency at constant quality Hours Recovery-from-error rate Resilience vs. brittleness Week Per-node model attribution Which step needs Opus vs. Flash Week Action items
- Add trajectory-level metrics (tool-call F1, steps-to-completion, cost-per-successful-task) to eval harness alongside existing single-turn benchmarks this sprint
- Instrument per-node token cost in your agent graphs and route utility calls (summarization, extraction, query rewriting) to Flash/Haiku-class models
- Add LLM-judge-to-human-annotator agreement as a tracked metric, re-calibrated quarterly
- Benchmark your LLM output acceptance rate against Duolingo's 20% slop baseline and adjust HITL staffing accordingly
Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · Abridge runs model routing across 100M conversations · Duolingo's twenty percent AI slop rate · AI Gateway data puts agentic workloads at fifty-nine percent
03 Mythos Cleared Both AISI Ranges — And the 271:1 Harness Result Proves What Actually Matters
The Capability Threshold
Anthropic's Claude Mythos Preview is the first model to clear both UK AISI simulated attack ranges, a discrete ladder running from 'advanced persistence' to 'full network takeover.' The prior Mythos generation topped out one tier below. GPT-5.5-cyber cleared one of the two. AISI is already building harder tests because the current ones are saturating, which is the more informative signal.
This is not a benchmark delta. It is a discrete unlock: a task no prior model could complete end-to-end is now routinely clearable. Congress is pressing Anthropic through closed-door briefings, and the reported center of gravity is shifting from CISA (defensive) to NSA (offensive/intelligence).
The 271:1 Harness Result
Two teams ran Claude Mythos against large C codebases in the same month. The results diverge by two orders of magnitude:
Dimension Mozilla + Firefox Stenberg + curl Model Claude Mythos Preview Claude Mythos Preview Harness Custom agentic, fuzzer-integrated Out-of-box scan Bugs surfaced 271 (UAFs, sandbox escapes) 5 claimed, 1 real CVE CI integration Yes — patches scanned on landing None Daniel Stenberg's verdict on the curl result: 'primarily marketing.' Mozilla's security team is integrating it into CI for all landing patches. Same model, same weights. The harness is the variable that moved.
This lines up with what Microsoft's MDASH showed separately: 16 real Windows vulnerabilities patched in May Patch Tuesday, found by a multi-model ensemble system. The pattern is that value accrues to the Cartesian product of how you probe × how you mutate × how you measure, not to the weights alone.
When a frontier model yields 271 bugs for one team and 1 CVE for another against the same language, the harness is the product, not the model.
Offensive AI Is Now a Detected Incident Class
Google's threat intelligence team identified a hacking group actively using LLMs to build cybercrime tooling, the first production-grade detection behind the post-Mythos misuse concerns. Palo Alto's AI-driven scanning surfaced serious vulnerabilities across 130+ products. The attacker economics have shifted. Inference is cheap, orchestration is cheap, and the expensive line item was always the operator. The model replaced the operator.
For teams shipping coding agents, the implications are concrete:
- Refusal-rate harnesses measure the wrong bottleneck. The thing a refusal rate doesn't tell you is whether the model can execute the chain. A staged rubric (recon, initial access, lateral movement, persistence, exfil) run against every model upgrade is the closer match.
- Agent trajectory telemetry is security telemetry. If a frontier model can execute a takeover in a lab, misuse attempts in production chain tool calls the same way. Log agent action sequences and train a lightweight classifier on known-bad trajectories.
- Patch SLAs need to benchmark against inference time, not human-weeks. Vulnerability discovery that used to take teams months now takes model-minutes.
Action items
- Add a cyber-capability tier to your model eval harness — include AISI-style staged attack tasks for any model with tool/shell access before next model upgrade
- Spike a domain-specific agentic vuln-discovery harness on one internal service, modeled on Mozilla's pattern (reproducible test cases + ephemeral VMs + integration with existing pipelines)
- Instrument agent action sequences in production and alert on tool-use patterns matching recon→lateral-movement→persistence signatures
- Compress critical-patch SLA from quarterly to monthly cadence and add CVE volume monitoring
Sources:Mythos cleared the AISI attack ranges · Mozilla shipped 271 bugs · CyberGym result multi-agent ensembles · The headline claim is that AI models have reached full network takeover · PraisonAI weaponized within four hours · Google's report of a threat actor using AI
04 Lakehouse Trust Boundary Collapsed: Iceberg, Polaris, and Argo CD Create a Data-Poisoning Path
Three CVEs That Chain Into Training Data Corruption
A new batch of critical CVEs lands on the exact infrastructure most ML teams run in production. These are not generic patch advisories. They target the data layer, which is the trust boundary between what a model trains on and what an attacker can manipulate.
The Attack Chain
- Apache Polaris (CVE-2026-42809/10/11, CVSS 9.9) — Credential-broadening bugs let an attacker with limited catalog access escalate to full S3/GCS credential exposure, enabling cross-tenant data access.
- Apache Iceberg (CVE-2026-42812, CVSS 9.9) — An attacker with table-write permission can redirect metadata pointers to an attacker-controlled S3 prefix, so the next query reads poisoned Parquet and the next training run ingests silently corrupted features.
- Argo CD (CVE-2026-42880, CVSS 9.6) — Read-only users can extract plaintext Kubernetes Secrets. For teams running model services via Argo CD, that means model-registry tokens, HuggingFace PATs, and cloud credentials.
The combined path: compromised analyst notebook, Polaris credential escalation, Iceberg metadata redirect, poisoned training data. The failure mode is silent. Default lakehouse logging covers row changes, not pointer changes, so a row-count and schema check passes cleanly while the bytes have moved underneath.
Additional ML-Stack Vectors
Component CVE / CVSS Impact PraisonAI (agent framework) CVE-2026-44338 / high Auth bypass; exploited in 4 hours post-disclosure NGINX rewrite module CVSS 9.1 Unauthenticated RCE; 18 years latent; affects model-serving ingress n8n workflow orchestrator CVE-2026-42233 / 9.8 SQLi + OAuth token theft Kestra orchestrator CVE-2026-38428 / 9.8 SQLi into pipeline metadata The PraisonAI four-hour exploitation window resets the floor. Agent frameworks, model-serving gateways, and workflow orchestrators all shipped fast, and are now receiving the security attention web frameworks got a decade ago. The thing the CVSS score doesn't measure is blast radius inside an ML platform: a read-only Argo CD bug is a 9.6, but for a team that stores HuggingFace PATs as Secrets, the practical impact is the model registry.
If the reference architecture runs Iceberg, Polaris, Argo CD, or any modern orchestrator, there is patching homework before the next experiment and credential-rotation homework before the next sprint.
Action items
- Patch Iceberg/Polaris catalog configurations immediately: enforce explicit storage credential scoping, add write-path allowlisting for table metadata locations, and audit pointer mutations in catalog logs
- Patch Argo CD to ≥3.2.12 / ≥3.3.10 and rotate every Kubernetes Secret in namespaces it can read — including model-registry tokens and HuggingFace PATs
- Patch NGINX across all inference gateways and ingress-nginx controllers; audit rewrite-module usage in model routing configs
- Inventory all agent frameworks (PraisonAI, LangChain, CrewAI, AutoGen), pin versions, subscribe to CVE feeds, and enforce same-day patching cadence
Sources:LiteLLM landed in the KEV catalog this week · An Ollama endpoint exposed to the public internet · PraisonAI an open-source multi-agent framework · The Hacker News NGINX RCE
◆ QUICK HITS
DuckDB shipped Quack HTTP client-server mode — credible Spark-on-Glue replacement for single-node ETL under ~100GB; benchmark your p95 query before migrating
DuckDB shipped a client-server mode this week
Kafka Share Groups report ~linear 8x throughput at 32 consumers by decoupling parallelism from partition count — spike on your most partition-bound embedding/enrichment consumer first
DuckDB shipped a client-server mode this week
Only 15% of organizations have the data foundation for agentic AI (Fivetran); ~50% cite data quality/lineage as #1 blocker — score target domains before greenlighting agent projects
DuckDB shipped a client-server mode this week
TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s (Gemini Flash) and 1.18s (GPT-Realtime-2.0) — a 3x gap on the metric that determines perceived naturalness in voice agents
TML is reporting 0.40 seconds of full-duplex latency
Update: Exposed AI inference endpoints (Ollama, MCP, LangServe) indexed by Shodan within 3 hours; 23% of honeypot traffic now targets AI-specific paths like /.well-known/mcp.json
An Ollama endpoint exposed to the public internet gets picked up by Shodan in about three hours
Nebius GPU demand ratio at 4:1 (customers per GPU), 684% YoY revenue growth, guiding $3-3.4B in 2026 — lock H2 GPU reservations across 2+ providers before quarterly sellouts
The 4:1 ratio is the headline number
Opus 4.7 tripled image processing cost silently — re-price multimodal inference budgets and eval Gemini/GPT-4V on your actual vision workload
Anthropic passes OpenAI in B2B
SAP (€100M partner fund) and ServiceNow (Action Fabric) both converged on Knowledge Graph + MCP as the enterprise agent architecture — RAG-over-docs losing ground to structured KG grounding
MCP plus knowledge graphs is the combination showing up
AI agents bypass legacy bot detection at 81% success rate — retrain abuse models with agent-generated traffic and add behavioral/request-graph features before next model refresh
MCP plus knowledge graphs is the combination showing up
Gemini reproducibly emits real phone numbers from training data (3 independent cases) — add PII extraction eval suite (canary insertion + divergence attacks) to LLM CI before next release
Gemini is the latest model to surface PII from its training data
New COSO/PCAOB guidance requires deterministic execution and tamper-evident audit trails for ML in regulated finance — audit seed management, GPU non-determinism flags, and model-artifact immutability now
The transformer underwriting models are outperforming
◆ Bottom line
The take.
Anthropic metered your Claude subscriptions overnight, admitted an 8x capacity planning miss, and set a June 15 deadline for third-party tool pricing — all while 59% of production tokens shifted to agentic workloads your single-turn eval harness can't measure, and Apache Iceberg/Polaris shipped CVSS 9.9 bugs that create a silent path from compromised notebook to poisoned training data. The week's action list: reconcile Claude spend before the credit cap bites, add trajectory-level metrics to the eval harness, and patch the lakehouse before the next training run ingests something an attacker put there.
Frequently asked
- How do I figure out if my Claude workloads will hit the 3-5x invoice spike?
- Audit every Claude-backed workload running through Agent SDK, claude-p, GitHub Actions, batch evals, and third-party harnesses, then reconcile projected token burn against the new dollar-matched credit cap. The implicit 70-90% programmatic discount is gone, so any team that budgeted flat subscription cost is silently accruing overage. Deploy an LLM gateway like LiteLLM or Portkey with per-user, per-feature tagging and daily token budget alerts before the June 15 third-party tool split creates a separate, non-rolling credit bucket.
- If 59% of tokens are agentic, what should my eval harness actually measure?
- Add trajectory-level metrics alongside single-turn benchmarks: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate, and per-node model attribution. Single-turn final-answer accuracy hides 40,000-token planner loops and can't distinguish a clean first attempt from a 3-retry recovery. Also instrument per-node token cost so utility calls (summarization, extraction, query rewriting) route to Flash/Haiku-class models while Opus stays on planning.
- Why did the same Claude model find 271 bugs for one team and only 1 CVE for another?
- The harness was the variable. Mozilla wrapped Claude Mythos Preview in a custom agentic, fuzzer-integrated pipeline with reproducible test cases, ephemeral VMs, and CI integration on landing patches, surfacing 271 bugs including UAFs and sandbox escapes. Stenberg ran an out-of-box scan against curl and got 5 claims with 1 real CVE. Same weights, two orders of magnitude difference — value accrues to how you probe × mutate × measure, not to the model alone.
- What's the concrete data-poisoning path through the new Iceberg and Polaris CVEs?
- A compromised analyst notebook uses Polaris credential-broadening bugs (CVE-2026-42809/10/11, CVSS 9.9) to escalate to full S3/GCS credentials, then exploits Iceberg (CVE-2026-42812, CVSS 9.9) to redirect table metadata pointers to an attacker-controlled S3 prefix. The next training run ingests silently poisoned Parquet. Default lakehouse logging tracks row changes, not pointer changes, so row counts and schema checks pass cleanly while the underlying bytes have moved.
- Is the OpenAI Codex 2-month-free promo worth taking seriously as a hedge?
- Yes — it's an asymmetric-payoff free evaluation with a 60-day window, dropped into the news cycle the same day Anthropic metered programmatic usage. Run a head-to-head with matched prompts and tool schemas against your current Claude workflows. Ramp's April data shows Anthropic at 34.4% versus OpenAI at 32.3%, the first lead change, so OpenAI is pricing a counter-offensive at exactly the developers Anthropic just alienated. Worst case you confirm Claude parity; best case you find a migration path before the October IPO pricing settles.
◆ Same day, different angle
Read this day as…
◆ Recent in data science
Keep reading.
- Princeton's ICML 2026 audit added GPT 5.5, Gemini 3.5 Flash, and Claude Opus 4.7 and found zero meaningful reliability improvement over pred…
- Hugging Face Transformers has an RCE path that fires from model config files — not pickle weights — across 2.2 billion installs.
- Anthropic ended the flat-rate Claude subsidy this week.
- Anthropic killed the flat-rate Claude subscription this week.
- Anthropic quietly killed the 70-90% effective discount on programmatic Claude usage — subscriptions now convert to dollar-matched API credit…