Data Science daily

Edition 2026-05-23 · read as Data Science

AnthropicEndsClaude'sHidden70-90%ProgrammaticDiscount

Sources
36
Words
1,633
Read
8min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

Anthropic converted Claude subscriptions to dollar-matched API credits across Agent SDK, GitHub Actions, and third-party harnesses, which retires the implicit 70-90% programmatic discount that a lot of teams quietly built their unit economics on. OpenAI posted a 2-month-free Codex enterprise switch promo into the same news cycle, which is the playbook we have watched both vendors run before. Workloads not reconciled against the new credit cap will run 3-5x last week's invoice. That is a pricing decision either way.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic's Triple Squeeze: Metering + 80x Capacity Miss + June 15 Cliff

    act now

    Anthropic metered programmatic usage, admitted an 8x capacity planning miss (80x growth vs 10x planned), leased xAI's entire 220K-GPU Colossus 1 cluster, and will split third-party tool credits on June 15. ServiceNow already burned its full-year Claude budget. OpenAI is pricing a counter-offensive.

    80x
    growth vs capacity plan
    11
    sources
    • Growth vs plan
    • Colossus GPUs
    • Ramp share lead
    • June 15 deadline
    1. Planned growth10
    2. Actual growth80
  2. 02

    59% Agentic Tokens: Your Eval Harness Measures the Minority

    monitor

    Vercel's AI Gateway production index shows 59% of all tokens are now agentic multi-turn traffic. Anthropic captures 61% of spend (Opus), Google 38% of volume (Flash). Single-turn eval harnesses, cost models built on 3:1 I/O ratios, and per-request routing assumptions are measuring last year's workload shape.

    59%
    tokens now agentic
    6
    sources
    • Agentic token share
    • Anthropic spend share
    • Google volume share
    • MCP token overhead
    1. Agentic traffic59
    2. Single-turn traffic41
  3. 03

    AI Cyber Capability Crosses AISI Threshold: Harness Dominates Model

    monitor

    Mythos is the first model to clear both AISI simulated attack ranges. Mozilla's custom harness found 271 Firefox bugs with Claude Mythos vs 1 CVE from an out-of-box scan on curl — same model, 271:1 yield difference. MDASH shipped 16 real Windows patches. Google detected AI-built cybercrime tooling in the wild.

    271:1
    harness yield gap
    7
    sources
    • Mozilla bugs found
    • curl bugs found
    • MDASH Windows fixes
    • AISI ranges cleared
    1. Mozilla (custom harness)271
    2. curl (out-of-box)1
  4. 04

    Training Efficiency: 2-360x Compute Reductions Ship This Quarter

    background

    Nous TST delivers 2-3x wall-clock speedup at matched FLOPs with no inference architecture change (validated 270M→10B). NVIDIA Star Elastic claims 360x cheaper model-size families from one post-training run. Datology beats InternVL3.5-2B by 10 pts at 17x less compute via curation alone.

    2-3x
    TST wall-clock speedup
    1
    sources
    • TST speedup
    • Star Elastic saving
    • Datology compute cut
    • SWE-ZERO corpus
    1. TST pretraining3
    2. Star Elastic family360
    3. Datology VLM curation17
  5. 05

    Lakehouse & Orchestrator CVEs: Iceberg/Polaris CVSS 9.9

    act now

    Apache Iceberg (CVE-2026-42812) lets attackers redirect table metadata to attacker-controlled storage — poisoning training data silently. Apache Polaris has three CVSS 9.9 credential-broadening bugs. Argo CD 9.6 exposes plaintext K8s secrets. PraisonAI agent framework was weaponized in 4 hours post-disclosure.

    9.9
    Iceberg/Polaris CVSS
    3
    sources
    • Iceberg CVSS
    • Polaris CVEs
    • PraisonAI time-to-exploit
    • NGINX RCE age
    1. 01Iceberg metadata redirect9.9
    2. 02Polaris cred broadening9.9
    3. 03Argo CD secret leak9.6
    4. 04NGINX rewrite RCE9.1

◆ DEEP DIVES

  1. 01

    Anthropic's Triple Squeeze: Metering, Capacity Crisis, and the June 15 Pricing Cliff

    What Changed This Week

    Three distinct Anthropic moves converged into a single cost event for any team running Claude programmatically:

    1. Programmatic usage is now metered. Claude subscriptions convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The implicit 70-90% effective discount that alt-harness users had been getting is dead.
    2. The 80x capacity admission. Dario Amodei conceded at Code with Claude on May 6 that Anthropic planned for 10x growth and hit 80x. The emergency patch: leasing xAI's entire Colossus 1 cluster (220,000+ GPUs) spanning H100, H200, and GB200.
    3. June 15 third-party tool split. Starting in 30 days, Claude usage through Conductor, Zed, OpenCode, and T3 Code gets a separate credit bucket. No subsidized tokens, no rollover, overflow bills at API rates.

    The Cross-Source Pattern

    Multiple independent sources confirm this is a coordinated pricing tightening ahead of Anthropic's October IPO. ServiceNow's CDIO has already burned the full-year Claude budget by May. National Life Group's CIO publicly called Claude 'not great for companies' wanting per-user monitoring. Anthropic provides no native per-user telemetry and no SLAs — unusual for a dependency on the critical path of production features.

    The vendor that captures 34.4% of enterprise spend cannot tell customers which user burned the tokens.

    Meanwhile, OpenAI dropped a 2-month-free Codex enterprise switch promo the same day Anthropic metered usage. Ramp's April data shows Anthropic edging OpenAI 34.4% vs 32.3% — the first lead change. This is OpenAI pricing a counter-offensive against the exact developers Anthropic just alienated.

    What Benchmarks From Before May 7 Are Now Stale

    SurfaceBeforeAfter (May 7-14)
    Claude Code (Pro/Max)5-hour limitDoubled
    Peak-hours throttleReduced limitsRemoved
    Opus API rate limitsSqueezed during crunch'Substantially raised'
    Fleet compositionAnthropic-managedHeterogeneous (incl. GB200)

    Any architectural decision made against April numbers — aggressive caching, prompt compression, provider migration — is calibrated to conditions that no longer exist.


    The Practical Implications

    Token consumption in agentic workflows is non-linear. A reflection loop or tool-use chain can 10x spend per task without proportional quality gain. With no native cost attribution at the user or prompt level, the overage shows up the way it showed up at ServiceNow — after the money is gone. The workaround is entirely on the customer: gateway-level logging with per-tenant tagging, daily budget alerts, and hard caps per feature.

    Action items

    • Audit every Claude-backed workload (Agent SDK, claude-p, GitHub Actions, batch evals) and reconcile projected token burn against the new credit cap this sprint
    • Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily token budget alerts before June 15
    • Run a 2-month head-to-head Codex evaluation under OpenAI's enterprise switch promo with matched prompts and tool schemas
    • Re-baseline Claude throughput and latency benchmarks post-Colossus integration before shipping any workaround from the April crunch period

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with

  2. 02

    59% Agentic Tokens, 100% Single-Turn Evals: The Measurement Gap That's Costing You 5x

    The Production Reality

    Vercel's AI Gateway production index — the first multi-tenant usage snapshot this quarter — puts agentic workloads at 59% of all token volume across 200,000 teams. Six months ago it was under 20%. That composition shift is faster than the completion-to-chat transition, and most production stacks were architected before it started.

    The spend-versus-volume split is the routing signal. Anthropic captures 61% of dollars through Opus on reasoning and planning nodes. Google captures 38% of tokens through Flash on throughput and utility calls. Expensive models plan, cheap models fan out. Teams without a tiered router are paying Opus rates for work Flash does at 5-10x lower cost.


    Why Single-Turn Evals Are Lying

    Most agent eval harnesses still score a single response against a reference answer. That was the right call in 2023. With 59% of traffic now multi-step tool loops, three things break:

    • Cost models break. Input-to-output ratios on agentic traces moved from roughly 3:1 to roughly 15:1. A forecast built on last year's ratio is off by about 5x on spend.
    • Accuracy masks cost paths. A planner that burns 40,000 tokens arguing with itself and then gives up can still post 90%+ final-answer accuracy. The thing this doesn't tell you is where the bill lives.
    • Tool-call reliability is invisible. An end-state-only score cannot distinguish a clean first attempt from a 3-retry recovery, and the production bottleneck is the second one.

    Abridge's production architecture across 80M+ clinical conversations is the cleanest validation of the alternative: constellation-of-models with fast/slow routing, LLM judges calibrated against human-annotated ground truth, and memory externalized from weights into event-driven stores.

    If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.

    The Duolingo Anchor

    Duolingo disclosed a ~20% unusable rate for AI-generated content at scale. Production quality numbers at that resolution are rare. They also reversed the blanket 'evaluate all employees on AI usage' policy after observing performative adoption with no productivity lift. Both points calibrate expectations. Well-resourced teams ship 1-in-5 outputs that need human rework, and token-usage-as-KPI is a Goodhart trap.

    What the Eval Harness Needs

    MetricWhat It CatchesEffort
    Tool-call precision/recallWrong tool selections, hallucinated argsDays
    Steps-to-completionWandering planners, retry stormsDays
    Cost-per-successful-taskEfficiency at constant qualityHours
    Recovery-from-error rateResilience vs. brittlenessWeek
    Per-node model attributionWhich step needs Opus vs. FlashWeek

    Action items

    • Add trajectory-level metrics (tool-call F1, steps-to-completion, cost-per-successful-task) to eval harness alongside existing single-turn benchmarks this sprint
    • Instrument per-node token cost in your agent graphs and route utility calls (summarization, extraction, query rewriting) to Flash/Haiku-class models
    • Add LLM-judge-to-human-annotator agreement as a tracked metric, re-calibrated quarterly
    • Benchmark your LLM output acceptance rate against Duolingo's 20% slop baseline and adjust HITL staffing accordingly

    Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · Abridge runs model routing across 100M conversations · Duolingo's twenty percent AI slop rate · AI Gateway data puts agentic workloads at fifty-nine percent

  3. 03

    Mythos Cleared Both AISI Ranges — And the 271:1 Harness Result Proves What Actually Matters

    The Capability Threshold

    Anthropic's Claude Mythos Preview is the first model to clear both UK AISI simulated attack ranges, a discrete ladder running from 'advanced persistence' to 'full network takeover.' The prior Mythos generation topped out one tier below. GPT-5.5-cyber cleared one of the two. AISI is already building harder tests because the current ones are saturating, which is the more informative signal.

    This is not a benchmark delta. It is a discrete unlock: a task no prior model could complete end-to-end is now routinely clearable. Congress is pressing Anthropic through closed-door briefings, and the reported center of gravity is shifting from CISA (defensive) to NSA (offensive/intelligence).


    The 271:1 Harness Result

    Two teams ran Claude Mythos against large C codebases in the same month. The results diverge by two orders of magnitude:

    DimensionMozilla + FirefoxStenberg + curl
    ModelClaude Mythos PreviewClaude Mythos Preview
    HarnessCustom agentic, fuzzer-integratedOut-of-box scan
    Bugs surfaced271 (UAFs, sandbox escapes)5 claimed, 1 real CVE
    CI integrationYes — patches scanned on landingNone

    Daniel Stenberg's verdict on the curl result: 'primarily marketing.' Mozilla's security team is integrating it into CI for all landing patches. Same model, same weights. The harness is the variable that moved.

    This lines up with what Microsoft's MDASH showed separately: 16 real Windows vulnerabilities patched in May Patch Tuesday, found by a multi-model ensemble system. The pattern is that value accrues to the Cartesian product of how you probe × how you mutate × how you measure, not to the weights alone.

    When a frontier model yields 271 bugs for one team and 1 CVE for another against the same language, the harness is the product, not the model.

    Offensive AI Is Now a Detected Incident Class

    Google's threat intelligence team identified a hacking group actively using LLMs to build cybercrime tooling, the first production-grade detection behind the post-Mythos misuse concerns. Palo Alto's AI-driven scanning surfaced serious vulnerabilities across 130+ products. The attacker economics have shifted. Inference is cheap, orchestration is cheap, and the expensive line item was always the operator. The model replaced the operator.

    For teams shipping coding agents, the implications are concrete:

    • Refusal-rate harnesses measure the wrong bottleneck. The thing a refusal rate doesn't tell you is whether the model can execute the chain. A staged rubric (recon, initial access, lateral movement, persistence, exfil) run against every model upgrade is the closer match.
    • Agent trajectory telemetry is security telemetry. If a frontier model can execute a takeover in a lab, misuse attempts in production chain tool calls the same way. Log agent action sequences and train a lightweight classifier on known-bad trajectories.
    • Patch SLAs need to benchmark against inference time, not human-weeks. Vulnerability discovery that used to take teams months now takes model-minutes.

    Action items

    • Add a cyber-capability tier to your model eval harness — include AISI-style staged attack tasks for any model with tool/shell access before next model upgrade
    • Spike a domain-specific agentic vuln-discovery harness on one internal service, modeled on Mozilla's pattern (reproducible test cases + ephemeral VMs + integration with existing pipelines)
    • Instrument agent action sequences in production and alert on tool-use patterns matching recon→lateral-movement→persistence signatures
    • Compress critical-patch SLA from quarterly to monthly cadence and add CVE volume monitoring

    Sources:Mythos cleared the AISI attack ranges · Mozilla shipped 271 bugs · CyberGym result multi-agent ensembles · The headline claim is that AI models have reached full network takeover · PraisonAI weaponized within four hours · Google's report of a threat actor using AI

  4. 04

    Lakehouse Trust Boundary Collapsed: Iceberg, Polaris, and Argo CD Create a Data-Poisoning Path

    Three CVEs That Chain Into Training Data Corruption

    A new batch of critical CVEs lands on the exact infrastructure most ML teams run in production. These are not generic patch advisories. They target the data layer, which is the trust boundary between what a model trains on and what an attacker can manipulate.

    The Attack Chain

    1. Apache Polaris (CVE-2026-42809/10/11, CVSS 9.9) — Credential-broadening bugs let an attacker with limited catalog access escalate to full S3/GCS credential exposure, enabling cross-tenant data access.
    2. Apache Iceberg (CVE-2026-42812, CVSS 9.9) — An attacker with table-write permission can redirect metadata pointers to an attacker-controlled S3 prefix, so the next query reads poisoned Parquet and the next training run ingests silently corrupted features.
    3. Argo CD (CVE-2026-42880, CVSS 9.6) — Read-only users can extract plaintext Kubernetes Secrets. For teams running model services via Argo CD, that means model-registry tokens, HuggingFace PATs, and cloud credentials.

    The combined path: compromised analyst notebook, Polaris credential escalation, Iceberg metadata redirect, poisoned training data. The failure mode is silent. Default lakehouse logging covers row changes, not pointer changes, so a row-count and schema check passes cleanly while the bytes have moved underneath.


    Additional ML-Stack Vectors

    ComponentCVE / CVSSImpact
    PraisonAI (agent framework)CVE-2026-44338 / highAuth bypass; exploited in 4 hours post-disclosure
    NGINX rewrite moduleCVSS 9.1Unauthenticated RCE; 18 years latent; affects model-serving ingress
    n8n workflow orchestratorCVE-2026-42233 / 9.8SQLi + OAuth token theft
    Kestra orchestratorCVE-2026-38428 / 9.8SQLi into pipeline metadata

    The PraisonAI four-hour exploitation window resets the floor. Agent frameworks, model-serving gateways, and workflow orchestrators all shipped fast, and are now receiving the security attention web frameworks got a decade ago. The thing the CVSS score doesn't measure is blast radius inside an ML platform: a read-only Argo CD bug is a 9.6, but for a team that stores HuggingFace PATs as Secrets, the practical impact is the model registry.

    If the reference architecture runs Iceberg, Polaris, Argo CD, or any modern orchestrator, there is patching homework before the next experiment and credential-rotation homework before the next sprint.

    Action items

    • Patch Iceberg/Polaris catalog configurations immediately: enforce explicit storage credential scoping, add write-path allowlisting for table metadata locations, and audit pointer mutations in catalog logs
    • Patch Argo CD to ≥3.2.12 / ≥3.3.10 and rotate every Kubernetes Secret in namespaces it can read — including model-registry tokens and HuggingFace PATs
    • Patch NGINX across all inference gateways and ingress-nginx controllers; audit rewrite-module usage in model routing configs
    • Inventory all agent frameworks (PraisonAI, LangChain, CrewAI, AutoGen), pin versions, subscribe to CVE feeds, and enforce same-day patching cadence

    Sources:LiteLLM landed in the KEV catalog this week · An Ollama endpoint exposed to the public internet · PraisonAI an open-source multi-agent framework · The Hacker News NGINX RCE

◆ QUICK HITS

  • DuckDB shipped Quack HTTP client-server mode — credible Spark-on-Glue replacement for single-node ETL under ~100GB; benchmark your p95 query before migrating

    DuckDB shipped a client-server mode this week

  • Kafka Share Groups report ~linear 8x throughput at 32 consumers by decoupling parallelism from partition count — spike on your most partition-bound embedding/enrichment consumer first

    DuckDB shipped a client-server mode this week

  • Only 15% of organizations have the data foundation for agentic AI (Fivetran); ~50% cite data quality/lineage as #1 blocker — score target domains before greenlighting agent projects

    DuckDB shipped a client-server mode this week

  • TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s (Gemini Flash) and 1.18s (GPT-Realtime-2.0) — a 3x gap on the metric that determines perceived naturalness in voice agents

    TML is reporting 0.40 seconds of full-duplex latency

  • Update: Exposed AI inference endpoints (Ollama, MCP, LangServe) indexed by Shodan within 3 hours; 23% of honeypot traffic now targets AI-specific paths like /.well-known/mcp.json

    An Ollama endpoint exposed to the public internet gets picked up by Shodan in about three hours

  • Nebius GPU demand ratio at 4:1 (customers per GPU), 684% YoY revenue growth, guiding $3-3.4B in 2026 — lock H2 GPU reservations across 2+ providers before quarterly sellouts

    The 4:1 ratio is the headline number

  • Opus 4.7 tripled image processing cost silently — re-price multimodal inference budgets and eval Gemini/GPT-4V on your actual vision workload

    Anthropic passes OpenAI in B2B

  • SAP (€100M partner fund) and ServiceNow (Action Fabric) both converged on Knowledge Graph + MCP as the enterprise agent architecture — RAG-over-docs losing ground to structured KG grounding

    MCP plus knowledge graphs is the combination showing up

  • AI agents bypass legacy bot detection at 81% success rate — retrain abuse models with agent-generated traffic and add behavioral/request-graph features before next model refresh

    MCP plus knowledge graphs is the combination showing up

  • Gemini reproducibly emits real phone numbers from training data (3 independent cases) — add PII extraction eval suite (canary insertion + divergence attacks) to LLM CI before next release

    Gemini is the latest model to surface PII from its training data

  • New COSO/PCAOB guidance requires deterministic execution and tamper-evident audit trails for ML in regulated finance — audit seed management, GPU non-determinism flags, and model-artifact immutability now

    The transformer underwriting models are outperforming

◆ Bottom line

The take.

Anthropic metered your Claude subscriptions overnight, admitted an 8x capacity planning miss, and set a June 15 deadline for third-party tool pricing — all while 59% of production tokens shifted to agentic workloads your single-turn eval harness can't measure, and Apache Iceberg/Polaris shipped CVSS 9.9 bugs that create a silent path from compromised notebook to poisoned training data. The week's action list: reconcile Claude spend before the credit cap bites, add trajectory-level metrics to the eval harness, and patch the lakehouse before the next training run ingests something an attacker put there.

— Promit, reading as Data Science ·

Frequently asked

How do I figure out if my Claude workloads will hit the 3-5x invoice spike?
Audit every Claude-backed workload running through Agent SDK, claude-p, GitHub Actions, batch evals, and third-party harnesses, then reconcile projected token burn against the new dollar-matched credit cap. The implicit 70-90% programmatic discount is gone, so any team that budgeted flat subscription cost is silently accruing overage. Deploy an LLM gateway like LiteLLM or Portkey with per-user, per-feature tagging and daily token budget alerts before the June 15 third-party tool split creates a separate, non-rolling credit bucket.
If 59% of tokens are agentic, what should my eval harness actually measure?
Add trajectory-level metrics alongside single-turn benchmarks: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate, and per-node model attribution. Single-turn final-answer accuracy hides 40,000-token planner loops and can't distinguish a clean first attempt from a 3-retry recovery. Also instrument per-node token cost so utility calls (summarization, extraction, query rewriting) route to Flash/Haiku-class models while Opus stays on planning.
Why did the same Claude model find 271 bugs for one team and only 1 CVE for another?
The harness was the variable. Mozilla wrapped Claude Mythos Preview in a custom agentic, fuzzer-integrated pipeline with reproducible test cases, ephemeral VMs, and CI integration on landing patches, surfacing 271 bugs including UAFs and sandbox escapes. Stenberg ran an out-of-box scan against curl and got 5 claims with 1 real CVE. Same weights, two orders of magnitude difference — value accrues to how you probe × mutate × measure, not to the model alone.
What's the concrete data-poisoning path through the new Iceberg and Polaris CVEs?
A compromised analyst notebook uses Polaris credential-broadening bugs (CVE-2026-42809/10/11, CVSS 9.9) to escalate to full S3/GCS credentials, then exploits Iceberg (CVE-2026-42812, CVSS 9.9) to redirect table metadata pointers to an attacker-controlled S3 prefix. The next training run ingests silently poisoned Parquet. Default lakehouse logging tracks row changes, not pointer changes, so row counts and schema checks pass cleanly while the underlying bytes have moved.
Is the OpenAI Codex 2-month-free promo worth taking seriously as a hedge?
Yes — it's an asymmetric-payoff free evaluation with a 60-day window, dropped into the news cycle the same day Anthropic metered programmatic usage. Run a head-to-head with matched prompts and tool schemas against your current Claude workflows. Ramp's April data shows Anthropic at 34.4% versus OpenAI at 32.3%, the first lead change, so OpenAI is pricing a counter-offensive at exactly the developers Anthropic just alienated. Worst case you confirm Claude parity; best case you find a migration path before the October IPO pricing settles.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.