Data Science daily

Edition 2026-05-19 · read as Data Science

AnthropicEndsClaudeFlat-Rate,AgentStacksFace3-5xBill

Sources
36
Words
1,722
Read
9min

Topics Agentic AI LLM Inference AI Capital

◆ The signal

Anthropic quietly killed the flat-rate Claude developer subsidy — subscriptions now convert to dollar-matched API credits, metering every Agent SDK, GitHub Action, and batch eval job at list price. This eliminates the 70-90% effective discount power users had been getting. OpenAI dropped a 2-month-free Codex enterprise switch promo the same day, and Vercel's production data shows 59% of all tokens are now agentic. If you haven't re-priced your Claude-dependent agent stack this sprint, you're making a pricing decision by default while the cost model underneath it just changed by 3-5x.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic's Triple Squeeze: Metered Credits, 80x Capacity Miss, June 15 Cliff

    act now

    Anthropic metered programmatic Claude usage at list-rate API credits, admitted an 80x-vs-10x capacity planning miss, leased xAI's 220K-GPU Colossus 1 cluster, and will split third-party tool billing on June 15. ServiceNow burned its full-year Claude budget by May. Enterprise share flipped to 34.4% vs OpenAI's 32.3%.

    80x
    growth vs plan miss
    9
    sources
    • B2B share lead
    • Colossus GPUs leased
    • ARR growth (4 mo)
    • Opus 4.7 image cost
    • IPO target
    1. Planned growth10
    2. Actual growth80
    3. Capacity gap8
  2. 02

    59% Agentic: Eval Harnesses and Cost Models Are Measuring the Minority

    act now

    Vercel's AI Gateway puts agentic workloads at 59% of all production tokens. Anthropic captures 61% of spend (via Opus for reasoning), Google captures 38% of volume (via Flash for throughput). Single-turn eval harnesses now score the minority of traffic. Cost models built on 3:1 input-output ratios are off by ~5x for 15:1 agentic traces.

    59%
    tokens now agentic
    5
    sources
    • Anthropic spend share
    • Google volume share
    • Agentic I/O ratio
    • MCP token overhead
    1. Agentic tokens59
    2. Single-turn tokens41
  3. 03

    AI Cyber Capability Crossed AISI Threshold — Harness Dominates Model

    monitor

    Anthropic's Mythos is the first model to clear both AISI simulated attack ranges (full network takeover). GPT-5.5-cyber cleared one of two. Mozilla's agentic harness found 271 Firefox bugs with the same model that found 1 CVE in curl without one — a 271:1 delta proving harness engineering dominates model selection for vulnerability discovery.

    271:1
    harness vs no-harness bugs
    6
    sources
    • AISI ranges cleared
    • Mozilla bugs found
    • curl bugs found
    • Palo Alto products scanned
    1. With harness (Mozilla)271
    2. Without harness (curl)1
  4. 04

    Training Efficiency: Three Papers Move the Unit Economics

    monitor

    Nous TST reports 2-3x wall-clock speedup at matched FLOPs (validated 270M→10B MoE). Datology hit +11.7 pts on VLM benchmarks at 17x less training compute via pure data curation. NVIDIA Star Elastic claims 360x cheaper model-family derivation from a single post-training run. All three change the $/capability math for teams running their own training.

    17x
    compute reduction (Datology)
    2
    sources
    • TST wall-clock gain
    • Star Elastic savings
    • Datology VLM lift
    • SWE-ZERO corpus
    1. Nous TST3
    2. Datology VLM17
    3. Star Elastic360
  5. 05

    GPU Supply: 4:1 Demand Ratio and the Inference Silicon Fork

    background

    Nebius reports 4+ customers competing for every GPU online, posting 684% YoY revenue growth with $3-3.4B 2026 guidance. Cerebras IPO'd at $56B (+70% day one) backed by OpenAI's $20B commitment. Cisco AI orders jumping $5B→$9B. The inference hardware layer is diversifying away from Nvidia faster than most capacity plans assume.

    4:1
    GPU demand-to-supply
    5
    sources
    • Nebius YoY growth
    • Cerebras valuation
    • OpenAI-Cerebras deal
    • Cisco AI order growth
    1. Nebius 2025 rev530
    2. Nebius 2026 guide3200

◆ DEEP DIVES

  1. 01

    Anthropic's Metering Cliff: Re-Price Your Agent Stack Before June 15

    What Happened

    Anthropic shipped three changes that interact badly for anyone running Claude in production. First, subscriptions now convert to dollar-matched API credits across Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The implicit 70-90% subsidy on programmatic usage is gone. Second, Dario Amodei conceded the company planned for 10x growth and got 80x, which is the read on weeks of degraded Claude Code performance: a capacity miss, not a model regression. Third, starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) moves to a separate credit bucket with no rollover and overflow at API rates.

    The capacity patch is leasing xAI's entire Colossus 1 cluster — 220,000+ GPUs spanning H100, H200, and GB200 — from a CEO who publicly insulted Anthropic three months ago. Rate limits are being raised in parallel: Claude Code 5-hour caps doubling, peak-hours throttling removed, Opus API limits "substantially raised."


    Why Sources Disagree on Market Position

    Seven independent sources reported Anthropic overtaking OpenAI in enterprise adoption (34.4% vs 32.3% per Ramp). The thing this doesn't tell you is what the metric measures: credit-card billing share, not token volume, not workload criticality, not large-enterprise invoiced spend. OpenAI's point that $1M+ ACV accounts pay by ACH, not card, is correct and material. Ramp's own economist separately flagged Opus 4.7 tripling image costs and mounting reliability complaints. The crossover is real for bottoms-up developer adoption. It is not yet a statement about the Fortune 500.

    ServiceNow burned its full-year Claude budget by May. National Life Group's CIO says Claude is 'not great for companies' wanting per-user monitoring. The frontier model most teams build on has no native cost attribution, no SLAs, and no per-user telemetry.

    The Compound Effect

    Token consumption in agentic workflows is non-linear. A reflection loop or tool-use chain can 10x spend per task with no proportional quality gain, and the per-task variance is wide enough that an average tells you very little. Remove the subscription subsidy on the same date the vendor still ships no per-tenant attribution, and the failure mode is a silent budget overrun that surfaces in the monthly invoice rather than the observability dashboard. Teams with gateway-level logging in place before June 15 absorb the change. Teams discovering it in the invoice do not.

    SurfaceBeforeAfter (May-June 2026)
    Agent SDK / claude-pFlat subscription covers heavy useDollar-matched API credits, then list rate
    Third-party tools (Zed, etc.)Covered by planSeparate bucket, no rollover (June 15)
    Claude Code caps5-hour limit, peak throttledDoubled, throttling removed
    Opus API rate limitsConstrained during crunch'Substantially raised' post-Colossus

    Action items

    • Audit every Claude-backed workload (Agent SDK, claude-p, GitHub Actions, batch evals) and reconcile projected token burn against the new credit cap by end of this sprint
    • Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily budget alerts before June 15
    • Run OpenAI's 2-month-free Codex enterprise switch promo as a controlled head-to-head on your actual eval harness with matched prompts and tool schemas
    • Re-baseline all Claude benchmarks (throughput, p95 latency, rate-limit headroom) after Colossus 1 integration stabilizes — do not commit to workarounds built against the degraded period

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel's AI Gateway production index

  2. 02

    59% Agentic: Your Eval Harness and Cost Model Just Became Minority Instruments

    The Production Reality

    Vercel's AI Gateway production index is the only multi-tenant usage dataset with 200K+ teams and 7 months of data, which is the disclaimer up front. It puts agentic workloads at 59% of all token volume. Six months ago that number was under 20%. The composition shift is the fastest since completion-to-chat endpoints, and it happened while most eval harnesses were still scoring single-turn responses against reference answers.

    Three structural mismatches follow. First, cost models are off by ~5x: agentic traces run 15:1 input-to-output versus the 3:1 most forecasts assume, with heavy cache reuse on some providers and none on others. The thing this doesn't tell you is which providers, so the 5x is a population estimate, not a per-vendor one. Second, eval harnesses measure the minority: single-turn accuracy at 90%+ hides the planner that burns 40K tokens arguing with itself before giving up. Third, the spend/volume split is the routing signal: Anthropic captures 61% of dollars via Opus (reasoning), Google captures 38% of tokens via Flash (throughput), with no vendor loyalty observed.


    The Architecture That Works at Scale

    Abridge's disclosed architecture across 80M+ clinical conversations is a second data point on the same pattern: cheap fast model triages, expensive model reasons only when needed, with memory externalized from weights and LLM judges calibrated against human annotators. Two independent sources converging on confidence-gated routing across a constellation of models is enough to call it the de facto production pattern, not a research proposal.

    If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out — update the harness before you update the model.

    What the MCP Overhead Tells You

    Glean reports off-the-shelf MCP burning 30% more tokens than retrieval-tuned knowledge graphs on agentic tasks, losing 2.5x head-to-head on preference. The methodology is vendor-published with no disclosure. Treat the magnitude as a hypothesis. The directional claim, that naive tool listings balloon context windows while reranked snippets do the same work cheaper, matches what SAP and ServiceNow concluded independently this week. All three converge on Knowledge Graph grounding + MCP-governed execution as the enterprise agent reference architecture.

    Multi-agent decomposition adds a dimension worth a separate ablation. Microsoft's MDASH (100+ agents) beat Anthropic's Mythos on CyberGym by decomposing tasks into scan → adversarial debate → PoC exploitation stages. The 100-agent count is a design choice. No ablation isolates the staging from the ensemble size, so it is possible but not established that the staging is doing the work. The decompose-debate-verify pattern is consistent with ensemble priors and is the cheap experiment to run on any workload with auto-verifiable outputs.

    Action items

    • Add trajectory-level metrics to the eval harness this sprint: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate
    • Instrument per-node token cost in your agent graph and route utility calls (summarization, JSON extraction, query rewriting) to Flash/Haiku-class models within 2 weeks
    • Run a 1-hour spike measuring MCP tool-calling overhead vs. a rerank/KG baseline on 100 replayed production agent traces
    • Add LLM-judge-to-human-annotator agreement (Cohen's kappa) as a tracked SLI in the eval pipeline; re-calibrate quarterly

    Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · Abridge runs model routing across 100M conversations · MCP plus knowledge graphs

  3. 03

    AISI Range Saturation: Autonomous Cyber Capability Is Now a First-Class Eval Axis

    The Threshold Crossing

    The UK AI Security Institute evaluated the newest Anthropic Mythos and OpenAI GPT-5.5-cyber on autonomous cyber-offense tasks. Both completed full network takeovers in controlled environments, one capability tier above the prior Mythos generation, which topped out at "advanced persistence." Mythos cleared both of AISI's hardest tests. GPT-5.5-cyber cleared one. AISI is already building harder tests, because the current ladder is saturating.

    This is the first time a national evaluator has publicly stated that frontier models can complete an end-to-end attack chain. Neither lab is releasing these variants broadly. Anthropic gates Mythos to select enterprises and government agencies.


    The 271:1 Harness Signal

    Two teams ran Claude Mythos against C codebases in the same month. Mozilla wrapped a custom agentic harness around existing fuzzing infrastructure and surfaced 271 bugs in Firefox 150, including sandbox escapes, use-after-frees, and race conditions. Daniel Stenberg pointed Mythos at curl with a generic scanner and got exactly 1 low-severity CVE with 4 false positives. His verdict: "primarily marketing."

    Same model, two orders of magnitude apart. The variable that moved was the harness. Mozilla's wrapper emits reproducible test cases, scales across ephemeral VMs, and integrates with their security lifecycle. The thing the leaderboard score doesn't tell you is which scaffold the model was wearing. This is the strongest public evidence to date that eval budget dominates model choice by at least 50x on real codebases.

    DimensionMozilla + FirefoxStenberg + curl
    ModelClaude Mythos PreviewClaude Mythos Preview
    HarnessCustom agentic, fuzzer-integratedOut-of-box scan
    True positives271 (incl. sandbox escapes)1 low-severity CVE
    False positive rateTooling-filtered~80%
    Vulnerability discovery just moved from human-weeks to model-minutes. If the patch SLA is not benchmarked against inference time, the defense is tuned to last year's threat model.

    Implications for Agent Deployments

    Refusal-rate harnesses calibrated on GPT-4-era capability assumptions will produce false negatives against Mythos-class attackers. The miscalibration is structural, not a tuning fix. For anything agentic with tool access, the eval needs a staged rubric covering recon, initial access, lateral movement, persistence, and exfil, not input-side prompt filters scored in isolation. Microsoft's MDASH shipped 16 real Windows fixes off multi-model bug-hunting, which is the load-bearing data point that the capability has crossed the utility threshold for both offense and defense at the same time.

    Action items

    • Add an autonomous-cyber-capability tier to your model eval harness this quarter — include AISI-style attack-range tasks for any model with tool/shell access
    • Spike a domain-specific agentic vulnerability harness on your own codebase modeled on Mozilla's pattern: reproducible test cases + ephemeral VMs + integration into existing signal pipelines
    • Instrument agent logs for tool-use patterns matching recon → lateral movement → persistence signatures, not just prompt-injection strings
    • Compress critical-patch SLA to model-release cadence rather than CVE-publication cadence

    Sources:CyberScoop: Mythos cleared AISI · The Information AM: full network takeover · Clint Gibler: Mozilla 271 vs curl 1 · The Hacker News: MDASH 16 fixes · Risky.Biz: guardrail bypass economy · Bloomberg Technology: AI cybercrime observed

  4. 04

    Three Training Efficiency Papers That Change This Quarter's Build Math

    Three recipe and curation results worth pricing in

    This week's drops share a through-line for teams running their own training or distillation: the marginal dollar in model development has moved from raw compute to recipe and curation. Each result shifts unit economics in a direction that matters, and each carries its own replication risk profile.


    1. Nous Token Superposition Training (TST)

    Reports 2-3x wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M through 10B-A1B MoE. The thing this measures is pretraining throughput. The thing it doesn't measure is whether the speedup holds at frontier scale under independent replication. Single source, but the mechanistic claim is clean. If it holds, it's a free 2-3x on training runs without touching the serving stack.

    2. NVIDIA Star Elastic

    Claims one post-training run produces a family of reasoning model sizes at 360x lower cost than pretraining a family, 7x better than SOTA compression. Headline numbers of that size always shrink under independent eval. The question is the floor, not the ceiling. A 30x hold would still restructure how size tiers are produced for routing. A 10x hold probably would not.

    3. Datology VLM Curation

    Hit +11.7 points on 20 VLM benchmarks at 2B params, beating InternVL3.5-2B by about 10 points at 17x less training compute. Produced a near-frontier 4B model at 3.3x lower response FLOPs than Qwen3-VL-4B. The lever was purely data curation, not architecture or scale. This is the clearest evidence this year that curation dominates compute at the VLM frontier.

    WorkClaimValidated ScaleInference ImpactReplication Risk
    Nous TST2-3x wall-clock at matched FLOPs270M → 10B MoENone — no arch changeMedium
    Star Elastic360x cheaper family derivationUnspecifiedProduces size tiers from one runHigh
    Datology VLM+11.7 pts at 17x less compute2B and 4B3.3x lower response FLOPsMedium
    TST is the one to spike first: it's a training recipe with no inference-side tax. If it replicates at even 1.6x, it pays for itself on the next full run.

    Adjacent: SWE-ZERO-12M-Trajectories

    Kevin Li released 112B tokens, 12M trajectories, 122K PRs across 3K repos in 16 languages, positioned as the largest open agentic trace corpus. Useful for SFT, reward-model training, and synthetic eval generation. Open releases at this size tend to acquire licensing friction within months. Worth mirroring before that happens.

    Action items

    • Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline this month
    • Pull SWE-ZERO-12M-trajectories and stand up a preprocessing pipeline (dedup, license filter, language stratification) before licensing frictions accumulate
    • Run a data-curation ablation on your VLM pipeline: systematic filtering and reweighting against a compute-matched baseline
    • Benchmark Star Elastic's model-family derivation against your current distillation/quantization pipeline once the paper drops

    Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode

◆ QUICK HITS

  • DuckDB shipped Quack HTTP client-server mode — single-node Spark/Glue jobs under ~100GB now have a clean migration path to ECS Fargate + DuckDB + Terraform at 50%+ cost reduction

    DuckDB shipped a client-server mode this week

  • Update: Apache Iceberg (CVE-2026-42812, CVSS 9.9) lets attackers redirect table metadata writes to attacker-controlled S3 — next query reads poisoned Parquet, next training run ingests corrupted features

    LiteLLM landed in the KEV catalog this week

  • TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s Gemini and 1.18s GPT-Realtime — a 3x spread on the metric that determines perceived voice-agent naturalness

    TML is reporting 0.40 seconds of full-duplex latency

  • Affirm says transformer-based underwriting outperforms legacy GBMs across 27M consumers — first production signal that sequence models beat tabular trees in consumer credit, but PCAOB now requires deterministic execution for regulated ML

    The transformer underwriting models are outperforming

  • Only 15% of organizations have the data foundation for agentic AI at scale (Fivetran); data quality/lineage is the #1 blocker cited by ~50% — scope agent projects as data-platform projects with an agent on top

    DuckDB shipped a client-server mode this week

  • LLM-as-a-Verifier beats LLM-as-a-Judge on tie-rate and decision accuracy by decomposing criteria into repeated binary verifications with token-level scoring — a one-day harness rewrite to test

    An Ollama endpoint exposed to the public internet

  • Duolingo publicly pegs AI-generated content 'slop' at ~20% requiring human QC and reversed its blanket 'evaluate employees on AI usage' policy after observing performative adoption without productivity lift

    Duolingo's twenty percent AI slop rate

  • AI agents bypass legacy bot detection at 81% success rate — any user-facing surface feeding experiments or online learning is ingesting traffic the classifiers cannot see

    SAP's Autonomous Enterprise and ServiceNow's Action Fabric

  • Microsoft agent memory architecture (consolidation + forgetting + delayed maturation) stabilizes at 400-500 memories with 97.2% retention precision — concrete alternative to bloated prompts or flat vector top-k

    DuckDB shipped a client-server mode this week

  • Persona drift in LLM agents is measurable within 8 conversational turns (Li et al., COLM 2024) — add a distinctive verbal tic as a regex-based drift canary before any long-session agent ships

    AI personas drift within eight turns

◆ Bottom line

The take.

Anthropic killed the flat-rate Claude subsidy the same week Vercel's production data showed 59% of all tokens are agentic — meaning your cost model is wrong by the subscription change AND by the workload-mix shift simultaneously. The teams that survive this quarter are the ones that meter at the gateway, route by task difficulty, and never let a single vendor's pricing change become a sprint-level emergency because they already have the abstraction layer. If you're still calling anthropic.messages.create directly with no fallback, that's no longer technical debt — it's an unhedged financial position.

— Promit, reading as Data Science ·

Frequently asked

What exactly changed with Anthropic's pricing, and when does it take effect?
Subscriptions now convert to dollar-matched API credits across the Agent SDK, claude-p, GitHub Actions, and third-party harnesses, eliminating the prior 70-90% effective subsidy on programmatic usage. The change is already live, and on June 15 third-party tools (Zed, Conductor, OpenCode, T3 Code) move to a separate credit bucket with no rollover and overflow billed at API list rates.
How should I instrument cost attribution given Claude has no native per-user telemetry?
Deploy an LLM gateway like LiteLLM or Portkey in front of every Claude-backed call, with per-user, per-feature tagging and daily budget alerts wired in before June 15. Anthropic ships no native per-tenant attribution, so without gateway-level logging, overruns surface in the monthly invoice rather than your observability dashboard.
Why do single-turn eval scores no longer reflect production reality?
Vercel's production index shows agentic workloads now account for 59% of all token volume, with 15:1 input-to-output ratios versus the 3:1 most cost models assume. A planner that scores 90%+ on single-turn accuracy can still burn 40K tokens arguing with itself before failing, so trajectory-level metrics — tool-call precision, steps-to-completion, cost-per-successful-task, error recovery — are required to see where the bill actually lives.
How much does the eval harness matter relative to model choice for vulnerability discovery?
By at least 50x on real codebases. The same Claude Mythos Preview surfaced 271 true-positive bugs in Firefox 150 under Mozilla's custom agentic harness with reproducible test cases and ephemeral VMs, but produced only 1 low-severity CVE and roughly 80% false positives against curl with an out-of-box scanner. Harness investment dominates model selection on agentic security workloads.
Which training-efficiency result should I prioritize spiking first?
Nous Token Superposition Training, because it claims 2-3x pretraining wall-clock speedup at matched FLOPs with no inference-time architecture change, validated from 270M up to 10B-A1B MoE. Even if it replicates at only 1.6x on a 1B-param continued-pretraining run, it pays for itself on the next full run without touching the serving stack — unlike Star Elastic's 360x family-derivation claim, which carries much higher replication risk.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.