Data Science daily

Edition 2026-05-22 · read as Data Science

AnthropicEndsClaudeSubscriptionSubsidy,EvalCostsSpike

Sources
36
Words
1,362
Read
7min

Topics Agentic AI LLM Inference AI Capital

◆ The signal

Anthropic quietly metered Claude subscriptions to dollar-matched API credits, removing what had been a 70-90% effective subsidy on Agent SDK, GitHub Actions, and third-party harness calls. OpenAI announced a 2-month-free Codex enterprise switch promo the same day. The thing the pricing page doesn't tell you: any eval harness or batch pipeline budgeted against flat subscription cost is now charging at API rates, and the overrun shows up in this week's token burn, not next quarter's review.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic Triple Pricing Shock + Capacity Crisis

    act now

    Anthropic metered subscriptions at API rates, tripled Opus 4.7 image cost, and announced a June 15 third-party tool credit split — all while admitting an 80x capacity miss that forced leasing xAI's 220K-GPU Colossus 1 cluster. ServiceNow burned its full-year Claude budget by May. Any Claude cost model built before this week is wrong.

    80x
    capacity plan miss
    9
    sources
    • B2B share (Ramp)
    • ARR growth (4mo)
    • Colossus GPUs
    • Image cost change
    • June 15 credit split
    1. Planned growth10
    2. Actual growth80
    3. Capacity gap8
  2. 02

    59% Agentic Volume: Eval and Cost Models Obsolete

    act now

    Vercel's AI Gateway (200K teams, 7 months) reports 59% of tokens are now agentic multi-turn traffic. Anthropic captures 61% of spend via Opus; Google captures 38% of volume via Flash. Cost models built on 3:1 input-output ratios are off by ~5x; single-turn eval harnesses score the minority of production traffic.

    59%
    agentic token share
    4
    sources
    • Anthropic spend share
    • Google volume share
    • I/O ratio (agentic)
    • MCP token overhead
    1. Agentic tokens59
    2. Single-turn tokens41
  3. 03

    Training Efficiency Breakthroughs: 2-360x Gains

    monitor

    Three independent results shift pretraining and distillation economics: Nous TST delivers 2-3x wall-clock speedup at matched FLOPs with no inference change (validated 270M→10B). NVIDIA Star Elastic produces model-size families at 360x less cost than pretraining each. Datology beats InternVL3.5-2B by 10 pts at 17x less compute via pure data curation.

    17x
    compute reduction (VLM)
    2
    sources
    • TST speedup
    • Star Elastic savings
    • Datology bench gain
    • Datology compute cut
    1. Nous TST3
    2. NVIDIA Star Elastic360
    3. Datology curation17
  4. 04

    Compute Supply Crunch: 4:1 Demand and Neocloud Boom

    monitor

    Nebius reported 684% YoY revenue growth with 4+ customers per GPU, guiding $3-3.4B for 2026 vs $530M in 2025. Cerebras IPO'd at $56B with a $20B OpenAI commitment. Cisco AI orders jumping from $5B to $9B with explicit memory-hardware shortage. H2 capacity priced on today's availability is likely mispriced.

    4:1
    GPU demand-to-supply
    5
    sources
    • Nebius 2026 guide
    • Nebius YoY growth
    • Cerebras valuation
    • Cisco AI order jump
    1. Nebius 2025530
    2. Nebius 2026 guide3200
  5. 05

    AI Cyber Capability: AISI Ranges Saturating

    background

    Anthropic's Mythos cleared both AISI attack ranges (first model ever). GPT-5.5-cyber cleared one. Both achieved 'full network takeover' in controlled environments — a step-function above prior gen's 'advanced persistence.' AISI is already building harder tests. Google confirmed a threat actor using AI to build cybercrime tooling in the wild.

    2/2
    AISI ranges cleared
    4
    sources
    • Mythos ranges
    • GPT-5.5 ranges
    • Prior gen ceiling
    • Palo Alto scan
    1. 01Mythos (new)Full takeover
    2. 02GPT-5.5-cyberPartial takeover
    3. 03Mythos (prior)Adv. persistence

◆ DEEP DIVES

  1. 01

    Anthropic's Pricing Earthquake: Three Simultaneous Cost Shocks Hit Your Claude Stack

    What Happened

    The headline change: Claude subscriptions now convert to dollar-matched API credits for all programmatic usage, including Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The implicit 70-90% effective discount is gone. In the same week, Opus 4.7 tripled image processing cost, and starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) moves to a separate credit bucket equal to plan value with no rollover and overflow at API rates.

    The driver behind the pricing is capacity. Anthropic planned for 10x growth and is seeing 80x. The emergency fix is leasing xAI's entire Colossus 1 cluster, 220,000+ GPUs spanning H100, H200, and GB200. A CFO is in seat and the company is targeting an October IPO, which is a reasonable proxy for why margin-per-token is now a board metric.


    The Capacity Context

    The pricing changes read more cleanly alongside the capacity numbers. ServiceNow's CDIO burned through a full-year Claude budget by May. National Life Group's CIO called Claude 'not great for companies that want per-user monitoring,' and Anthropic ships no native per-user telemetry and no SLAs, which is unusual for a dependency sitting on production critical paths.

    ChangeImpactTimeline
    Subscription → API credits70-90% discount gone on batch/eval workloadsImmediate
    Opus 4.7 image cost3x on multimodal pipelinesImmediate
    June 15 third-party splitNo subsidized tokens for Zed/OpenCode/T330 days
    Colossus integrationp95/p99 variance during heterogeneous fleet mergeWeeks–months
    Any Claude benchmark from before May 7 is stale, and any cost model built against flat subscription rates is not directionally wrong, it is numerically wrong.

    The OpenAI Counter-Move

    Sam Altman posted a 2-month-free Codex enterprise switch promo on the same day Anthropic metered subscriptions. Ramp's April data showed Anthropic edging OpenAI for the first time, 34.4% vs 32.3%. The promo is an asymmetric-payoff bet: free to evaluate, with bounded switching cost if you already have a provider abstraction layer. OpenAI is pricing directly into the developer cohort Anthropic just alienated.

    Cross-Source Tension

    Sources disagree on the durability of Anthropic's lead. Ramp data is a card-spend proxy and measures who gets billed, not token volume or production criticality. A 20-seat pilot weighs the same as a company at inference scale, and OpenAI notes large enterprises rarely pay by card. The directional signal looks real; the magnitude does not. What is not uncertain is that the market is now genuinely multi-vendor, and architecture should reflect that.

    Action items

    • Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) and reconcile projected token burn against the new credit cap by end of this week
    • Deploy an LLM gateway (LiteLLM, Portkey) with per-user, per-feature tagging and daily budget alerts within this sprint
    • Run a 2-month Codex evaluation under OpenAI's enterprise switch promo with matched prompts against your Claude harness
    • Reforecast Claude inference spend for any team using Zed/Conductor/OpenCode modeling the post-June-15 scenario

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with · Ramp's AI Index shows Anthropic at 34.4%

  2. 02

    59% Agentic Volume: Your Eval Harness and Cost Model Are Measuring the Minority

    The Production Shift, Quantified

    Vercel's AI Gateway telemetry — 200,000 teams over 7 months — reports that 59% of all token volume is now agentic: multi-turn, tool-calling traces, not single-shot completions. Six months ago the share was under 20%. The interesting structure is the spend-volume split. Anthropic captures 61% of spend via Opus on reasoning and planning nodes, while Google captures 38% of volume via Flash on throughput and utility calls. Vendor loyalty is not visible in the data. Customers churn freely.

    This is not a leaderboard result. It is production telemetry from a multi-tenant gateway, which means most tokens in the wild are inside multi-step tool loops with retries, not the single-turn completions an eval harness was built to score.


    What Breaks at 59%

    Eval harnesses: single-turn accuracy on held-out prompts measures the 41% minority. The thing this doesn't tell you is whether a planner burns 40K tokens arguing with itself before giving up. Pass rate looks fine. Spend is 5x budget. Cost models: most were calibrated when input-output ratios sat at 3:1. Agentic traces run at ~15:1 on input, with heavy cache reuse on some providers and none on others. A forecast carried over from last year's ratio is off by roughly 5x on spend.

    MetricSingle-Turn World59% Agentic World
    Input:Output ratio~3:1~15:1
    Cost driverOutput tokensTrajectory length × retries
    Eval metricAccuracy on final answerCost-per-successful-task
    Routing unitSingle requestSession with KV cache state
    Failure modeWrong answerCorrect answer at 10x budget
    If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out — update the harness before you update the model.

    The Routing Architecture That Fits

    The Vercel pattern is consistent with what Abridge disclosed across 80M+ clinical conversations: a constellation of models, cheap fast triage in front, expensive reasoning behind, per-task selection. Given those numbers, the plausible envelope for cost reduction from tiered routing is 20-40% at constant trajectory completion rate. Glean's published benchmark puts off-the-shelf MCP at 30% more tokens than retrieval-tuned knowledge graphs. The methodology is not disclosed and the source is the vendor, so treat it as directional. It is consistent with the MCP context-window bloat people have measured independently.

    The Enterprise Convergence

    SAP (€100M partner investment) and ServiceNow (Action Fabric) have landed in the same place: agents need Knowledge Graph grounding + MCP-exposed workflows. The architectural drift is from RAG-over-docs to structured, entity-resolved context. The eval needs to move with it. Tool-use accuracy against grounded context is the production differentiator. It is on no public leaderboard.

    Action items

    • Add trajectory-level metrics to your eval harness this sprint: tool-call F1, steps-to-completion, cost-per-successful-task, recovery-from-error rate
    • Instrument per-node token cost across your agent graphs and route utility calls (summarization, extraction, query rewriting) to Flash/Haiku-class models within 2 weeks
    • Run a 1-hour spike measuring token overhead of current MCP/tool-calling setup vs. a retrieval-first baseline on 100 production traces
    • Add LLM-judge ↔ human-annotator agreement as a tracked SLI this quarter; re-calibrate when judge model changes

    Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · AI Gateway data puts agentic workloads at fifty-nine percent · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs

  3. 03

    Training Efficiency Frontier: Three Results That Change Unit Economics This Quarter

    Three Independent Wins, One Direction

    The marginal dollar in model development has moved from raw compute to architecture and curation. The evidence this week comes from three separate labs, each working a different layer of the training stack, and the results point the same way. The marginal dollar in model development has moved from raw compute to architecture and curation.

    WorkClaimScale ValidatedInference ImpactReplication Risk
    Nous Research TST2-3x wall-clock at matched FLOPs270M → 10B-A1B MoENone — no architecture changeMedium; single-source, clean claim
    NVIDIA Star Elastic360x cheaper model-family derivation; 7x vs SOTA compressionNot specifiedProduces family of sizes from one runHigh; big number, lab-reported
    Datology VLM+11.7 pts on 20 benchmarks at 17x less compute2B and 4B params3.3x lower response FLOPs (serving win)Medium; benchmark selection risk

    TST Is the One to Spike First

    Token Superposition Training is a pretraining recipe change with no inference-side downstream. If it replicates, it is a 2-3x improvement on wall-clock for free. The serving architecture is unchanged. The efficiency lives entirely in how gradients are computed during training. Validated from 270M up to a 10B-A1B MoE, which is the band most teams actually run for continued pretraining and domain adaptation.

    Datology: Curation Beats Compute

    Datology's result is the cleanest evidence so far that data curation now dominates compute scaling for VLMs. A 2B model lands about 10 points above InternVL3.5-2B at 17x less training compute, on data selection alone. At 4B, they reach near-frontier quality at 3.3x lower response FLOPs than Qwen3-VL-4B, which is a serving win and not just a training one. For any team maintaining a vision pipeline, this reorders the next budget line. Curation tooling before more GPUs.

    Star Elastic: Speculative but High-Leverage

    NVIDIA's claim that one post-training run produces a full family of model sizes at 360x lower cost than pretraining each is the kind of number that always shrinks under independent evaluation. The thing this number doesn't tell you is what holds at the small end versus the large end. Even a 30x hold would restructure how teams produce size tiers for routing. Paired with the 59% agentic routing pattern, cheap access to a 1B-to-70B spectrum from a single run is what a tiered architecture actually needs.

    TST is worth a spike on your next continued-pretraining run. If wall-clock comes in at even 1.6x with no val-loss regression, it pays for itself immediately.

    The Parallel Signal: Only 15% Are Ready

    Fivetran's readiness index reports that 15% of organizations have the data foundation for agentic AI. Data quality and lineage are the top blocker, cited by roughly 50%. Read against the efficiency results, the picture is consistent. Training is getting cheaper. Most teams cannot capitalize because the data layer is not there. Half the agent projects funded this quarter are data-platform projects with an agent bolted on.

    Action items

    • Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline within the next 2 weeks
    • Run an ablation benchmarking Qwen-2.5/3 or DeepSeek-V3 against your current production model on in-domain evals this quarter
    • Audit ANALYZE/compute-stats coverage across top-20 Iceberg/Delta tables; add stats freshness to table-level SLAs
    • Score target domains against Fivetran readiness dimensions (quality, lineage, governance) before greenlighting agentic AI projects

    Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode this week

◆ QUICK HITS

  • DuckDB shipped Quack HTTP client-server protocol — credible replacement for Spark-on-Glue for sub-100GB jobs; default config is localhost-only, no SSL, token auth

    DuckDB shipped a client-server mode this week

  • Kafka Share Groups report ~linear 8x throughput at 32 consumers on I/O-bound workloads — decouple consumer parallelism from partition count for embedding/enrichment pipelines

    DuckDB shipped a client-server mode this week

  • Update: Mythos cleared both AISI attack ranges (first model ever) — capability stepped from 'advanced persistence' to 'full network takeover'; AISI is building harder tests because current ladder is saturating

    Mythos cleared the AISI attack ranges this week

  • Apache Iceberg CVE-2026-42812 (CVSS 9.9): attacker with table-write permission can redirect metadata to attacker-controlled S3, poisoning training data on next read — audit Polaris catalog write-path allowlisting

    LiteLLM landed in the KEV catalog this week

  • Gemini reproducibly outputs real phone numbers from training data — three independent extraction cases; add PII canary + divergence-attack probes to LLM CI before next release

    Gemini is the latest model to surface PII

  • Duolingo publicly pegs AI-generated content rejection at ~20% requiring human QC — use as calibration anchor for your own generation pipeline acceptance rate

    Duolingo's twenty percent AI slop rate

  • TML-Interaction-Small hits 0.40s turn-taking latency vs 0.57s Gemini Flash and 1.18s GPT-Realtime — full-duplex voice becoming its own model class with 200ms micro-turn architecture

    TML is reporting 0.40 seconds of full-duplex latency

  • SWE-ZERO-12M-trajectories released: 112B tokens, 12M trajectories, 122K PRs across 3K repos and 16 languages — largest open agentic trace corpus; pull before licensing frictions arrive

    Claude just metered your agent SDK calls

  • Microsoft's agent memory architecture (consolidation + forgetting + delayed maturation) stabilizes at 400-500 memories with 97.2% retention precision — alternative to bloated prompts or flat vector top-k

    DuckDB shipped a client-server mode this week

  • Persona drift measurable within 8 dialogue turns per Li et al. (COLM 2024) — embed a verbal tic as a free drift canary via regex; costs one hour to instrument

    AI personas drift within eight turns

◆ Bottom line

The take.

Anthropic killed the flat-rate developer discount, tripled image costs, and announced a June 15 credit split — all while 59% of production tokens are now agentic and your eval harness still measures single-turn completions. The two cheapest things you can do this week: reconcile every Claude workload against the new metered credit cap, and add trajectory-level cost-per-successful-task to the eval harness that's currently scoring the minority of your traffic.

— Promit, reading as Data Science ·

Frequently asked

What changed in Anthropic's Claude subscription pricing?
Claude subscriptions now convert to dollar-matched API credits for all programmatic usage, including Agent SDK, claude-p, GitHub Actions, and third-party harnesses. This eliminates what had been a 70-90% effective discount on batch and eval workloads, so any pipeline budgeted against flat subscription cost is now charging at full API rates.
How should I reforecast Claude spend before the June 15 third-party tool change?
Model the post-June-15 scenario for any team using Zed, Conductor, OpenCode, or T3 Code, since third-party tool usage moves to a separate credit bucket equal to plan value with no rollover and overflow at API rates. Combine this with a per-workload audit of Agent SDK and batch eval token burn to catch silent overruns before credits exhaust mid-month.
Why are single-turn eval harnesses inadequate now?
Vercel gateway telemetry shows 59% of token volume is agentic multi-turn tool-calling, while most harnesses still measure single-turn accuracy on the 41% minority. Agentic traces also run at roughly 15:1 input-to-output versus the historical 3:1, so cost models calibrated on single-turn assumptions can be off by about 5x. Track tool-call F1, steps-to-completion, and cost-per-successful-task instead.
Is Token Superposition Training worth piloting now?
Yes — it's the lowest-risk efficiency bet on the table. TST is a pretraining recipe change with no inference-side impact, claiming 2-3x wall-clock improvement at matched FLOPs, validated from 270M up to a 10B-A1B MoE. Even a 1.6x replication on a continued-pretraining spike pays for itself immediately, since the serving architecture is unchanged.
Does the OpenAI Codex switch promo justify a real evaluation?
It's an asymmetric bet worth taking if you have a provider abstraction layer. The 2-month-free Codex enterprise promo lets you run matched prompts against your Claude harness at zero marginal cost, and Ramp's April data already shows a genuinely multi-vendor market (34.4% Anthropic vs 32.3% OpenAI on card spend). Worst case you validate confidence in Claude; best case you find a cost-competitive alternative.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.