Data Science daily

Edition 2026-05-24 · read as Data Science

AnthropicEndsClaudeSubscriptionDiscount,OpenAICounters

Sources
36
Words
1,503
Read
8min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

Anthropic killed the 70-90% effective discount on programmatic Claude usage overnight — subscriptions now convert to dollar-matched API credits across Agent SDK, GitHub Actions, and third-party harnesses. Hours later, OpenAI dropped a 2-month-free Codex enterprise switch promo. If you haven't reconciled your Claude token burn against the new credit cap this week, you're making a pricing decision by default, and the overrun is already accumulating.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic Pricing Reset + Capacity Crisis

    act now

    Claude programmatic usage now metered at API rates (was 70-90% discounted). Anthropic's 80x growth vs 10x planned forced a Colossus 1 lease (220K GPUs). ServiceNow blew its full-year Claude budget by May. June 15 brings a separate credit split for third-party tools. OpenAI is counter-offering 2 months free Codex.

    80x
    growth vs plan (capacity)
    11
    sources
    • Growth vs plan
    • Colossus GPUs leased
    • Anthropic B2B share
    • OpenAI B2B share
    • Opus 4.7 image cost
    1. Anthropic B2B Share34.4
    2. OpenAI B2B Share32.3
  2. 02

    59% Agentic Traffic vs. Single-Turn Eval Harnesses

    act now

    Vercel's AI Gateway production index: 59% of tokens are multi-turn agentic workloads. Anthropic captures 61% of spend via Opus; Google captures 38% of volume via Flash. Cost models built on single-turn ratios are off by ~5x. Duolingo anchors production slop at 20% — a rare honest quality number from a real deployment.

    59%
    tokens now agentic
    5
    sources
    • Agentic token share
    • Anthropic spend share
    • Google volume share
    • Duolingo slop rate
    • Cost model error
    1. Agentic tokens59
    2. Single-turn tokens41
  3. 03

    Training Efficiency: 2-360x Improvements in One Week

    monitor

    Three independent results compress training economics. Nous TST delivers 2-3x wall-clock speedup at matched FLOPs with no inference-time change (validated to 10B). NVIDIA Star Elastic produces model-size families from one post-training run at 360x lower cost. Datology beats InternVL3.5-2B by 10 points on 20 VLM benchmarks at 17x less compute via pure curation.

    17x
    less VLM training compute
    2
    sources
    • TST wall-clock gain
    • Star Elastic cost cut
    • Datology compute cut
    • Datology benchmark gain
    • TST scale validated
    1. Nous TST3
    2. Datology VLM17
    3. Star Elastic360
  4. 04

    AI Cyber Offense Crosses Full-Takeover Threshold

    monitor

    AISI confirmed Mythos is the first model to clear both simulated attack ranges (full network takeover). Google intercepted AI-built cybercrime tooling in the wild — no longer hypothetical. Mozilla's 271 bugs vs curl's 1 CVE using the same model proves the harness, not the weights, determines vulnerability yield. Expect patch SLAs calibrated to quarterly CVE cadence to fail against model-release cadence.

    271
    bugs (Mozilla harness)
    7
    sources
    • AISI ranges cleared
    • Mozilla bugs found
    • curl bugs found
    • Products scanned (Palo)
    • PraisonAI exploit time
    1. Mozilla (custom harness)271
    2. curl (generic scan)1
  5. 05

    Compute Supply Crunch: 4:1 Demand-to-Supply Ratio

    background

    Nebius posted 684% YoY revenue with 4+ customers per GPU and guides $3-3.4B for 2026. Cerebras IPO'd at $56B with a 70% first-day pop; OpenAI committed $20B. The 9GW Stratos data center drew 4,000 complaints and a referendum. H2 training capacity priced on today's availability is probably mispriced.

    4:1
    demand-to-supply ratio
    5
    sources
    • Nebius YoY growth
    • Nebius 2026 guide
    • Cerebras market cap
    • Stratos power need
    • Cisco AI order growth
    1. Nebius 2025 Rev530
    2. Nebius 2026 Guide3200

◆ DEEP DIVES

  1. 01

    Anthropic's Triple Shock: Your Claude Cost Model Broke Overnight

    What Happened

    Anthropic shipped pricing changes inside a 48-hour window that compound on each other. First, all programmatic subscription usage (Agent SDK, claude-p, GitHub Actions, third-party harnesses) converted from flat-rate to dollar-matched API credits. The 70-90% effective discount power users had been quietly running on is gone. Second, starting June 15, third-party tool usage (Zed, Conductor, OpenCode, T3 Code) lands in a separate credit bucket with no rollover. Overflow bills at API list rates. Third, Opus 4.7 tripled image processing costs.

    The capacity numbers explain the timing. Dario Amodei conceded Anthropic planned for 10x growth and got 80x in revenue and usage. The gap forced an emergency lease of xAI's entire Colossus 1 cluster — 220,000+ GPUs across H100, H200, and GB200. That is roughly 45% of xAI's current capacity changing hands.


    Why It Matters Now

    ServiceNow's CDIO burned through the full-year Claude budget by May. National Life Group's CIO publicly called Claude 'not great for companies' wanting per-user monitoring. Both complaints route to the same root cause. Anthropic provides no native per-user, per-tool usage telemetry and no SLAs on latency or availability. Customers wire their own observability or fly blind.

    If the vendor cannot tell you which user burned the token, the problem is not cost. It is observability, and it is yours to fix before the next invoice.

    OpenAI's response shipped the same day: a 2-month-free Codex enterprise switch promotion aimed at developers Anthropic just repriced. Ramp puts Anthropic at 34.4% of paying businesses against OpenAI at 32.3%, the first crossover. The thing this doesn't tell you is how the share moves once the new pricing hits actual invoices, which is the measurement that matters. The market is contested. Switching costs are the only moat left.


    The Capacity Math

    SurfaceBeforeAfter (May 7-14)
    Claude Code limits5-hour capDoubled
    Peak-hours throttleReduced limitsRemoved
    Opus API rate limitsSqueezed'Substantially raised'
    Fleet compositionAnthropic-managed+220K GPUs via Colossus

    Any Claude benchmark from before May 7 is stale. Architectural decisions made on April or May numbers describe a system that no longer exists; aggressive caching keyed to the old discount structure is the obvious example. Re-baseline after the new caps land.

    Action items

    • Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals) and reconcile projected token burn against the new credit cap before month-end
    • Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature tagging and daily token budget alerts this sprint
    • Run a 2-month Codex evaluation under OpenAI's free enterprise switch promo using matched prompts and your own eval harness
    • Avoid long-term Anthropic commits (annual contracts, dedicated capacity) until post-Colossus integration stability is observable in 6-8 weeks

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with · Ramp's AI Index shows Anthropic at 34.4%

  2. 02

    59% Agentic Traffic — Your Eval Harness Is Scoring the Minority

    The Production Reality

    Vercel's AI Gateway production index, across 200,000 teams over 7 months, reports 59% of all tokens are now agentic — multi-turn, tool-calling traces. Six months ago that figure was under 20%. Most cost models in use were fit when input-output ratios sat near 3:1. Agentic traces run closer to 15:1 on input, with heavy cache reuse on some providers and none on others. A forecast built on last year's ratio is off by roughly 5x on spend.

    The eval harness mismatch is worse. Most harnesses still score single-turn responses against reference answers. The thing this doesn't tell you is what happens when the median request is a multi-step tool loop with retries, and the failure mode is a planner burning 40,000 tokens arguing with itself before giving up. Final-answer accuracy reads 90%+ in both the healthy and the pathological case. The bill, not the accuracy, is where the failure lives.


    The Spend-Volume Divergence

    ProviderShare of SpendShare of VolumePrimary ModelImplied Role
    Anthropic61%OpusReasoning / planning nodes
    Google38%FlashHigh-throughput utility calls
    Others~39%~62%MixedMixed

    That is a textbook tiered-routing signature. Expensive models handle planning and reasoning; cheap models do retrieval rewriting, extraction, and classification. Running Opus on every agent step leaves 20-40% of inference spend on the table at constant trajectory completion rate. The condition on that estimate is a stable router policy and unchanged tool surface; rewrite either and you re-measure.


    The Duolingo Anchor

    Duolingo's CEO pegged AI-generated content 'slop' at ~20% requiring human QC. That is a rare production quality number from a deployment at volume, and it is the anchor an internal harness should be calibrated against. The same company reversed its blanket 'evaluate all employees on AI usage' policy after observing performative adoption without productivity lift. Goodhart's Law arrived on schedule.

    LLM-as-Verifier vs. LLM-as-Judge

    A methodology paper this week argues that decomposing evaluation into repeated binary verifications at token granularity eliminates the tie-rate problem in pairwise judges. The mechanism is mathematically clean: one high-variance categorical judgment replaced by k lower-variance Bernoulli tests. Worth a one-day swap on one eval pipeline to measure the tie-rate reduction before committing further.

    If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.

    Action items

    • Add trajectory-level metrics (tool-call F1, steps-to-completion, cost-per-successful-task) to your eval harness alongside single-turn benchmarks this sprint
    • Instrument per-node token cost in agent graphs and route utility calls (JSON extraction, query rewriting) to Flash/Haiku-class models
    • Benchmark your LLM output acceptance rate against Duolingo's ~20% slop disclosure; adjust HITL budget if delta exceeds 10pp in either direction
    • Prototype LLM-as-Verifier on one existing eval pipeline; measure tie-rate and inter-run variance against current judge setup

    Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · Duolingo's twenty percent AI slop rate · LLM-as-a-Verifier reframes evaluation

  3. 03

    Training Efficiency Moved 2-360x — The Unit Economics of Your Next Run Just Changed

    Three Results, Same Direction

    Training unit economics shifted on three fronts this week, all pointing the same way. Each result matters for any team running its own pretraining, fine-tuning, or distillation.

    WorkClaimScale ValidatedInference ImpactReplication Risk
    Nous TST2-3x wall-clock at matched FLOPs270M → 10B-A1B MoENone — no arch changeMedium; single-source
    NVIDIA Star Elastic360x cheaper model-family derivationNot specifiedProduces family from one runHigh; lab-reported
    Datology VLM+11.7 pts, 17x less compute2B and 4B paramsLower response FLOPsMedium; benchmark-selection risk

    Which One to Spike First

    Token Superposition Training (TST) is the cleanest opportunity. It is a pretraining recipe change with no inference-side downstream. If it replicates, it is a free 2-3x on wall-clock with no new serving complexity, since the architecture at serve time is unchanged. The validation range, 270M to 10B, covers most continued-pretraining workloads teams actually run.

    Star Elastic's 360x is the kind of headline number that always shrinks under independent eval. Given the gap between paper conditions and production data, I would plan around a 30x hold and treat anything above that as upside. Even at 30x, one post-training run producing the full edge/mobile/server family eliminates the multi-run coordination overhead teams currently eat to maintain deployment tiers.

    Datology's result is the clearest evidence this year that the marginal dollar in VLM training has moved from compute to curation. They beat InternVL3.5-2B by ~10 points at 17x less training compute, purely via data selection. The thing this doesn't tell you on its own is the serving side: the near-frontier 4B model achieves 3.3x lower response FLOPs than Qwen3-VL-4B, which is a real inference-cost win on top of the training-cost win.


    The DuckDB + Kafka Infrastructure Shift

    Two data-infrastructure releases land on the same architectural assumption. DuckDB's Quack HTTP protocol turns embedded analytics into a proper shared service, and is a credible replacement for Spark-on-Glue for single-node ETL under ~100GB. Kafka Share Groups break the partition==max-consumers ceiling with ~linear 8x scaling at 32 instances on I/O-bound workloads. Combined with the ECS Fargate + Terraform pattern, that is a template for deleting the distributed-compute costume a lot of single-node jobs are wearing.

    VLM training value moved from compute to curation. Data pipelines are following a parallel move from cluster orchestration to statistics hygiene.

    Action items

    • Spike Token Superposition Training on a 1B continued-pretraining run against a matched-FLOPs baseline this quarter
    • Audit Glue/EMR job catalog for single-node candidates (<100GB working set) and port one onto ECS Fargate + DuckDB + Terraform
    • Run ANALYZE/compute-stats coverage audit across Iceberg/Delta tables; add stats maintenance to table-level SLAs
    • Benchmark Kafka Share Groups against most partition-bound consumer group (embedding/enrichment workloads first)

    Sources:DuckDB shipped a client-server mode this week · Claude just metered your agent SDK calls · DuckDB + Quack plus ECS Fargate/DuckDB/Terraform ETL pattern

  4. 04

    AI Cyber Offense Hit Full Network Takeover — Your Eval Harness Doesn't Measure This

    The Capability Threshold

    UK AISI confirmed that Anthropic's Mythos cleared both simulated attack ranges, the first model to do so. The prior generation topped out at 'advanced persistence.' GPT-5.5-cyber cleared one of two. AISI is already building harder tests because the current ones are saturating, which is what you'd expect when the leaderboard catches up to the eval rather than the underlying capability. Separately, Google's threat-intel team observed a hacking group using AI to build cybercrime tooling in the wild. That is the first production-grade confirmation of agentic misuse moving from red-team hypothesis to detected incident.


    The Harness Delta That Matters

    Mozilla wrapped a custom agentic harness around existing fuzzing infrastructure and surfaced 271 bugs in Firefox 150, including sandbox escapes, use-after-frees, and race conditions. Daniel Stenberg pointed the same model at curl and got 1 low-severity CVE with 4 false positives. Same weights, two orders of magnitude difference in yield. The variable that moved was the harness.

    DimensionMozilla + FirefoxStenberg + curl
    ModelMythos PreviewMythos Preview
    HarnessCustom agentic, fuzzer-integratedOut-of-box scan
    Bugs surfaced2715 claimed (4 FP)
    True positivesActed upon1 low-severity CVE

    The implication for anyone shipping agents: eval budget dominates model choice by 50x or more. The thing the headline benchmark doesn't tell you is which harness produced it. A week of domain-specific scorer and reproducible test-case emitter work looks like a better trade than the next dollar of inference.


    New CVEs Targeting Your Stack This Week

    Beyond the previously reported LiteLLM KEV, new critical vulnerabilities landed on infrastructure data teams actually run:

    • Apache Iceberg (CVE-2026-42812, CVSS 9.9): attackers redirect table metadata writes to attacker-chosen S3 prefixes. Next query reads poisoned Parquet. Next training run ingests corrupted features.
    • Apache Polaris (CVE-2026-42809/10/11, CVSS 9.9): credential-broadening enables cross-tenant cloud access.
    • Argo CD (3.2.x/3.3.x, CVSS 9.6): read-only users extract plaintext K8s Secrets including model-registry tokens and cloud credentials.
    • NGINX rewrite module: 18-year latent unauthenticated RCE affecting every model-serving gateway behind NGINX.
    Vulnerability discovery just moved from human-weeks to model-minutes. If the patch SLA is not benchmarked against inference time, the defense is tuned to last year's threat model.

    Action items

    • Patch Argo CD to ≥3.2.12 / ≥3.3.10 and rotate every K8s Secret in reachable namespaces this week
    • Audit Iceberg/Polaris catalog configurations: enforce write-path allowlisting for table metadata locations and rotate exposed credentials
    • Patch NGINX across all inference gateways; audit rewrite-module usage in model routing configs
    • Add autonomous-cyber-capability tier to model eval harness (AISI-style attack-range tasks) before next model upgrade

    Sources:LiteLLM landed in the KEV catalog · Mythos cleared the AISI attack ranges · PraisonAI exploited in 4 hours · Mozilla shipped 271 bugs · Google's report of a threat actor using AI · The UK AISI evaluations report full network takeovers

◆ QUICK HITS

  • Update: Apache Iceberg/Polaris CVSS 9.9 — metadata write redirect enables silent training data poisoning via attacker-chosen S3 prefixes; new this cycle, not previously reported

    LiteLLM landed in the KEV catalog this week

  • SWE-ZERO-12M-trajectories released: 112B tokens, 12M trajectories, 122K PRs, 3K repos, 16 languages — largest open agentic trace corpus; acquire before licensing frictions appear

    Claude just metered your agent SDK calls

  • Fivetran readiness index: only 15% of orgs have data foundation for agentic AI; data quality/lineage cited as #1 blocker by ~50% — use as gating scorecard before agent project approval

    DuckDB shipped a client-server mode this week

  • TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s Gemini-flash-live and 1.18s GPT-Realtime-2.0 — 3x gap on the metric that determines voice-agent naturalness

    TML is reporting 0.40 seconds of full-duplex latency

  • Microsoft's agent memory architecture (consolidation + forgetting + delayed maturation) stabilizes at 400-500 memories with 97.2% retention precision — alternative to flat vector top-k

    DuckDB shipped a client-server mode this week

  • Affirm claims transformer-based underwriting beats legacy GBMs at scale (27M consumers) — but new PCAOB/COSO guidance requires deterministic execution and tamper-evident audit trails that transformers don't satisfy by default

    The transformer underwriting models are outperforming

  • Claude Code /goal command: runs unattended against a stated objective with Haiku-as-evaluator, but termination judges only the transcript — wire PostToolUse hooks to pipe test output before trusting it

    Anthropic shipped a /goal command in Claude Code

  • GPT-5.5 Instant claims 52.5% fewer hallucinated claims vs 5.3 Instant — vendor-reported, no methodology disclosed; run against your internal hallucination set before changing defaults

    TML is reporting 0.40 seconds of full-duplex latency

  • Persona drift measurable within 8 dialogue turns (Li et al., COLM 2024) — add a verbal-tic canary and regex detector to agent logs as a zero-cost drift signal

    AI personas drift within eight turns

  • Gemini reproducibly leaks real phone numbers from training data — multiple independent incidents; add PII extraction eval (canary insertion + divergence attacks) to CI before next release cut

    Gemini is the latest model to surface PII

◆ Bottom line

The take.

Anthropic killed the flat-rate Claude discount overnight while admitting an 8x capacity-planning miss, 59% of production tokens are now agentic traces your single-turn eval harness can't score, and three training-efficiency results (2-360x) mean the unit economics of your next run are materially different from your last one — the teams that re-price, re-instrument, and re-baseline this sprint will be the ones still shipping when the invoice arrives.

— Promit, reading as Data Science ·

Frequently asked

How do I quickly tell if my Claude programmatic spend is overrunning the new credit cap?
Pull the last 30 days of token burn across Agent SDK, GitHub Actions, and any third-party harnesses (Zed, Conductor, OpenCode, T3 Code), price it at API list rates, and compare to your dollar-matched subscription credits. If list-rate spend exceeds credits, the delta is your projected monthly overrun. Anthropic ships no native per-user telemetry, so the reconciliation has to come from your gateway logs or provider invoices.
Why are single-turn eval scores misleading when most production traffic is agentic?
Single-turn harnesses score final-answer accuracy against a reference, which stays high even when a planner burns 40,000 tokens looping before it converges. With 59% of tokens now in multi-turn tool-calling traces, the dominant failure mode is cost and step-count blowup, not wrong answers. Trajectory-level metrics — tool-call F1, steps-to-completion, and cost-per-successful-task — are required to see it.
Which training-efficiency result is safest to pilot first, and why?
Token Superposition Training is the safest pilot because it is a pretraining recipe change with no inference-side impact. The serving architecture is unchanged, so a failed replication costs only the spike run, while a successful one delivers 2–3x wall-clock at matched FLOPs. It has also been validated from 270M up to a 10B-A1B MoE, which covers most continued-pretraining workloads.
Why does the Mozilla vs. curl bug-yield gap matter for agent design?
Same model weights produced 271 actionable Firefox bugs versus one low-severity curl CVE with four false positives — the only variable was the harness. Mozilla wrapped the model in a fuzzer-integrated agentic loop with reproducible test-case emission; the curl run was an out-of-box scan. The lesson is that eval and harness investment dominates model selection by roughly 50x on yield.
What's the immediate blast radius of the Iceberg and Argo CD CVEs for a data team?
The Iceberg CVE (9.9) lets an attacker redirect table metadata writes to an S3 prefix they control, so the next query or training run silently ingests poisoned Parquet — a feature-store and model-training corruption vector. The Argo CD CVE (9.6) lets read-only users extract plaintext Kubernetes Secrets, which typically include model-registry tokens, object-store keys, and cloud credentials. Both require patching and credential rotation this week, not next sprint.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.