Data Science daily

Edition 2026-05-27 · read as Data Science

AnthropicEndsClaudeSubscriptionSubsidyforAgentPipelines

Sources
36
Words
1,553
Read
8min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

Vercel's production traces show 59% of tokens are now agentic, and agentic traces compound 5-15x per task against single-shot baselines. Anthropic picked this week to convert Claude subscriptions into dollar-matched API credits across the Agent SDK, GitHub Actions, and third-party harnesses, which removes the 70-90% effective subsidy those pipelines were quietly running on. Third-party tool credits split off further on June 15, with no rollover. Any pipeline still budgeted on flat-subscription economics is carrying an overrun it has not measured yet.

◆ INTELLIGENCE MAP

  1. 01

    Anthropic's Triple Squeeze: Metering + Capacity + June 15 Deadline

    act now

    Anthropic metered all programmatic Claude usage (killing 70-90% alt-harness discount), admitted 80x growth vs 10x plan, leased xAI's entire 220K-GPU Colossus 1 cluster, and scheduled June 15 third-party credit separation. ServiceNow burned its full-year budget by May. Single-provider risk now has a number: 8x capacity miss.

    80x
    growth vs planned capacity
    9
    sources
    • Growth vs plan
    • Colossus GPUs leased
    • Third-party deadline
    • Enterprise share
    • Opus image cost hike
    1. Planned growth10
    2. Actual growth80
  2. 02

    59% Agentic Threshold: Your Eval Harness Measures the Minority

    act now

    Vercel's AI Gateway (200K teams, 7 months) shows 59% of production tokens are multi-turn agentic traces. Anthropic captures 61% of spend via Opus, Google captures 38% of volume via Flash. Single-turn eval harnesses and single-model routing are now measuring and optimizing the minority of production traffic.

    59%
    tokens are agentic
    5
    sources
    • Agentic token share
    • Anthropic spend share
    • Google volume share
    • No vendor loyalty
    • MCP token overhead
    1. Agentic tokens59
    2. Single-turn tokens41
  3. 03

    Training Efficiency Step-Change: 2-360x Cost Reductions

    monitor

    Three independent results this week move pre-training and post-training unit economics: Nous TST delivers 2-3x wall-clock speedup at matched FLOPs (validated 270M→10B MoE), Datology beats InternVL3.5-2B by 10pts at 17x less compute via curation alone, and NVIDIA Star Elastic produces model-size families at 360x less cost than pretraining each.

    17x
    compute reduction (VLM)
    2
    sources
    • TST speedup
    • Datology compute cut
    • Star Elastic cost cut
    • TST scale validated
    • Datology benchmark gain
    1. Nous TST3
    2. Datology VLM17
    3. Star Elastic360
  4. 04

    Lakehouse & Agent Framework CVEs Cascade

    monitor

    Apache Iceberg (CVSS 9.9) lets attackers redirect table metadata to poisoned S3 prefixes. Polaris (CVSS 9.9) broadens credentials cross-tenant. Argo CD (CVSS 9.6) exposes K8s Secrets to low-priv users. PraisonAI was weaponized within 4 hours of disclosure. Draw your data architecture and every layer has a 9.0+ CVE this cycle.

    4hrs
    PraisonAI time-to-exploit
    4
    sources
    • Iceberg CVSS
    • Polaris CVSS
    • Argo CD CVSS
    • PraisonAI exploit time
    • NGINX RCE age
    1. 01Iceberg metadata redirect9.9
    2. 02Polaris credential broadening9.9
    3. 03Argo CD secret exposure9.6
    4. 04n8n SQLi + OAuth theft9.8
    5. 05NGINX rewrite RCE9.1
  5. 05

    Compute Supply Crunch: 4:1 Demand Ratio, $56B Cerebras IPO

    background

    Nebius reports 4+ customers per GPU brought online, 684% YoY revenue growth, and $3-3.4B 2026 guidance. Cerebras IPO'd at $56B with OpenAI's $20B commitment validating non-Nvidia inference silicon. Cisco AI orders jumping from $5B to $9B. Memory hardware shortage driving product redesigns. Reserved capacity is the only hedge.

    $56B
    Cerebras IPO valuation
    5
    sources
    • Nebius demand ratio
    • Nebius YoY growth
    • Cerebras valuation
    • OpenAI commitment
    • Cisco AI order growth
    1. Nebius 2025 revenue530
    2. Nebius 2026 guide3200

◆ DEEP DIVES

  1. 01

    Anthropic's Triple Squeeze: Re-Price Everything Before June 15

    The New Economics

    The Claude pricing changes this week compound; the metering shift is the load-bearing one, and three things landed together:

    1. Programmatic metering: Claude subscriptions now convert to dollar-matched API credits for Agent SDK, claude-p, GitHub Actions, and third-party harnesses. The implicit 70-90% subsidy on alternative-harness usage is gone.
    2. June 15 third-party credit separation: Usage through Conductor, Zed, OpenCode, and T3 Code gets a separate credit bucket. No rollover, overflow at API rates. Weekly limits bumped 50% as a two-month softener.
    3. Capacity-driven degradation: Dario admitted planning for 10x and getting 80x. Rate limits doubled this week, peak-hours throttling removed for Pro/Max, Opus limits 'substantially raised' — all signs that the serving fleet was oversubscribed.

    The Capacity Fix That Changes Your Baseline

    Anthropic is absorbing xAI's Colossus 1 cluster: 220,000+ GPUs spanning H100, H200, and GB200. That is roughly 45% of xAI's current capacity moving providers overnight. The integration will produce heterogeneous serving infrastructure, which means p95/p99 latency will get weirder before it stabilizes.

    Any Claude benchmark from before May 7 is stale. The serving conditions just shifted again, and the serving conditions will shift once more when Colossus integration completes.

    ServiceNow's CDIO burned through the full-year Claude budget by May. National Life Group's CIO called Claude 'great for consumer usage but not great for companies' wanting per-user monitoring. Anthropic provides no native per-user telemetry, no SLAs on latency or availability, and no budget alerts. The thing benchmark numbers don't tell you is whether you can attribute spend to a team; here you cannot. The observability gap is structural, not a bug to be fixed next sprint.

    The Competitive Counter

    OpenAI dropped a 2-month-free Codex enterprise switch promo the same day Anthropic metered usage. Ramp's April data shows Anthropic 34.4% vs OpenAI 32.3% — the first apparent lead change in business adoption. OpenAI is explicitly pricing against the developers Anthropic just alienated.

    The asymmetry is the part that decides this. OpenAI's promo is a free evaluation window with zero downside risk. Anthropic's change is an immediate cost increase for any team that was leveraging the subsidy. A two-month free trial is cheap signal; not running it leaves the comparison unmeasured.

    Action items

    • Audit every Claude-backed workload (Agent SDK, GitHub Actions, batch evals, third-party IDEs) and project token burn against new credit caps by end of this sprint
    • Deploy an LLM gateway (LiteLLM/Portkey) with per-user, per-feature token tagging and daily budget alerts within 2 weeks
    • Run OpenAI's 2-month Codex promo through your own eval harness with matched prompts and tool schemas before the window closes
    • Re-baseline all Claude benchmarks (throughput, p95 latency, rate-limit headroom) after Colossus 1 integration stabilizes in ~4-6 weeks

    Sources:Claude just metered your agent SDK calls · Claude Code latency on long-context requests drifted upward · Anthropic ships no per-user usage telemetry · Anthropic passes OpenAI in B2B · Vercel published a number worth sitting with

  2. 02

    59% Agentic: Your Eval Harness and Cost Model Are Both Wrong

    The Production Data

    Vercel's AI Gateway telemetry — 200,000 teams, 7 months — now puts agentic workloads at 59% of token volume. Six months ago that share was under 20%. The composition has shifted faster than anything since completion-to-chat, and most production stacks were built before the shift started.

    The spend-versus-volume gap is where the architecture leaks through:

    ProviderShare of SpendShare of VolumePrimary ModelImplied Role
    Anthropic61%OpusReasoning/planning nodes
    Google38%FlashHigh-throughput utility calls
    Others~39%~62%MixedMixed

    Read it as tiered routing already in the wild: expensive models for planning, cheap models for throughput. The code should reflect what the traffic is doing.


    Why Single-Turn Evals Are Now Dangerous

    Agentic traces are multi-turn, tool-calling, bursty, and dominated by long contexts with small output deltas. Cost models pinned to a 3:1 input-output ratio are off by roughly 5x. Real agentic traffic runs closer to 15:1 on input, with aggressive cache reuse on some providers and none on others. A forecast that averages over that produces asymmetric errors by vendor, not a uniform miss.

    If 59% of your tokens are agentic but 100% of your evals are single-turn, you're flying instruments-out.

    The CyberGym result lines up with this. Microsoft's MDASH (100+ agent ensemble) beat Anthropic's Mythos by decomposing tasks into scan, adversarial debate, and PoC stages. The multi-agent topology won the leaderboard. The thing this doesn't tell you is whether the lift came from the decomposition or the agent count, because no ablation has been published. At production inference budgets I expect about half the benchmark gain to survive, and would not commit to migration on more than that.

    MCP Overhead Is Real but Measurable

    Glean reports off-the-shelf MCP burning 30% more tokens and losing 2.5x head-to-head against a retrieval-tuned knowledge graph. The number is vendor-published with undisclosed methodology, so treat it as directional. The failure modes it points at are real: MCP tool listings inflate context, and naive tool outputs return verbose blobs where a reranked snippet would do. Measure on your own corpus before you price the gap.

    Action items

    • Add trajectory-level metrics to eval harness this sprint: tool-call precision/recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate
    • Instrument per-node token cost across your agent graph and route utility calls (extraction, summarization, routing) to Flash/Haiku-class models within 2 weeks
    • Run a 1-hour spike: replay 100 production agent traces under current MCP setup vs BM25+rerank baseline, measuring tokens and task-win-rate
    • Add provider-agnostic routing abstraction (LiteLLM, OpenRouter, or in-house) if not already present — the Gateway data shows zero vendor loyalty

    Sources:Agentic traffic crossed fifty-nine percent · Vercel published a number worth sitting with · The CyberGym result · MCP plus knowledge graphs · AI Gateway data puts agentic workloads at fifty-nine percent

  3. 03

    Training Efficiency Breakthroughs: The Build-vs-Buy Math Just Shifted

    Where the in-house/API crossover moves

    Three results landed this week, each touching a different stage of the training pipeline. Together they shift the point at which proprietary post-training beats frontier API calls.

    WorkClaimScale ValidatedInference ImpactReplication Risk
    Nous Research TST2-3x wall-clock at matched FLOPs270M → 10B-A1B MoENone — no architecture changeMedium; single-source, clean claim
    Datology VLM curation+11.7 pts on 20 benchmarks at 17x less compute2B and 4BLower response FLOPs — real serving winMedium; benchmark-selection risk
    NVIDIA Star Elastic360x cheaper model-size family; 7x better than SOTA compressionNot specifiedProduces family of sizes from one post-training runHigh; big headline, lab-reported

    TST is the one to replicate first

    Token Superposition Training is a pretraining recipe change with no inference-side cost. If it replicates at even 1.6x on a matched-FLOPs baseline, it pays for itself on the next full run. The mechanism, training on superimposed token representations, is validated across two orders of magnitude in model size. The thing this doesn't tell you is whether the speedup holds at the scale you actually train at; the encouraging signal is that the trend is consistent across the sizes they tested. No architectural change at inference means the deployment path is clean.

    Datology: curation displaces compute

    The Datology result is the clearest evidence this year that the marginal dollar in VLM training has moved from compute to curation. At 2B parameters, their curated dataset beats InternVL3.5-2B by about 10 points on 20 benchmarks using 17x less training compute. The near-frontier 4B model reports 3.3x lower response FLOPs than Qwen3-VL-4B. That last number is a serving-cost win, not just a training one, which is the number worth tracking because it recurs every request.

    Star Elastic: model families on one post-training run

    NVIDIA reports that one post-training run produces a family of reasoning model sizes at 360x lower cost than pretraining each individually. Headline numbers from labs usually compress under independent eval. Given the numbers in the paper, I expect this to hold at closer to 30x once someone outside NVIDIA runs it. 30x still restructures how teams produce size tiers for edge-vs-cloud deployment.

    If TST replicates at even half the claimed speedup, and Datology's curation-over-compute thesis holds, the case for in-house post-training on proprietary data gets materially stronger against frontier API rates.

    This connects to Abridge's production architecture: 80M+ clinical conversations, a constellation of models, a dedicated post-training team. The implicit claim is that at sufficient data scale, vertical post-training beats frontier-model calls on cost and latency for narrow tasks. The threshold question is volume. At low volume, the frontier API wins because you avoid maintaining the pipeline. At Abridge's scale the math is straightforward. These three results lower the crossover point for everyone in between.

    Action items

    • Spike Token Superposition Training on a 1B-param continued-pretraining run against a matched-FLOPs baseline this quarter
    • If you have proprietary task data >100K examples, scope a post-training experiment (SFT or DPO) on a 7-13B base vs current frontier API call
    • Pull SWE-ZERO-12M-trajectories (112B tokens, 12M trajectories, 3K repos) now for SFT/RM training corpus before licensing frictions appear
    • Audit your Iceberg/Delta tables for stats coverage (ANALYZE/compute-stats); add stats maintenance to table-level SLAs

    Sources:Claude just metered your agent SDK calls · DuckDB shipped a client-server mode this week · Abridge runs model routing across 100M conversations

  4. 04

    Lakehouse and Agent Framework Security: New 9.9 CVEs Across the Data Stack

    The Pattern

    This week's advisory is not generic patch noise. It is a direct hit on the modern data and ML architecture. Sketch the reference stack (lakehouse → orchestrator → inference gateway → agent framework → CD pipeline) and every layer shipped a CVSS 9.0+ vulnerability this cycle.

    Priority Patching Order

    ComponentCVE / CVSSImpact on Your StackStatus
    Apache IcebergCVE-2026-42812 / 9.9Metadata write redirect → poisoned Parquet → corrupted training dataDisclosed
    Apache PolarisCVE-2026-42809/10/11 / 9.9Credential broadening → S3/GCS creds, cross-tenant accessDisclosed
    Argo CD 3.2.x/3.3.xCVE-2026-42880 / 9.6Low-priv → plaintext K8s Secrets in all reachable namespacesDisclosed
    PraisonAICVE-2026-44338Auth bypass → agent runtime, secrets, connected DBsActive, <4hr from disclosure
    NGINX rewrite moduleUnauthenticated RCEModel-serving ingress → registry creds, cross-model lateral18yr latent, disclosure-fresh

    Why Iceberg/Polaris Is the Scariest

    CVE-2026-42812 lets an attacker with table-write permission point metadata at an attacker-controlled S3 prefix. The next query reads poisoned Parquet. The next training run ingests silently corrupted features. The thing default lakehouse observability doesn't tell you is whether a pointer mutated; it tracks row changes. Combined with Polaris credential-broadening, there is a plausible path from compromised analyst notebook to cross-tenant data theft to training data poisoning, with none of it tripping standard alerts.

    The 18-year NGINX RCE is a useful reminder that 'battle-tested' and 'audited' are not the same word. Every model-serving gateway behind NGINX is in the blast radius.

    PraisonAI: The 4-Hour Window

    PraisonAI was weaponized within 4 hours of CVE disclosure. That is not a patching window. It is a containment window. Agent frameworks inherit the attack surface of every tool they wire together, plus a new surface from prompt-mediated control flow, and maintainers are not staffed like the security teams at the vendors whose APIs they wrap. LangChain, CrewAI, and AutoGen sit in the same bucket until measured otherwise.

    Update on LiteLLM (covered Tuesday): remains on CISA KEV as actively exploited. If you patched, verify key rotation completed. If you didn't, assume exfiltration.

    Action items

    • Audit Iceberg/Polaris catalog configurations this week: enforce explicit storage credential scoping and add write-path allowlisting for table metadata locations
    • Patch Argo CD to ≥3.2.12 / ≥3.3.10 and rotate every K8s Secret in reachable namespaces by end of week
    • Inventory all agent frameworks (PraisonAI, LangChain, CrewAI, AutoGen) in production or staging; pin versions and subscribe to CVE feeds
    • Patch NGINX across all inference gateways and audit rewrite-module usage in model routing configs

    Sources:LiteLLM landed in the KEV catalog this week · An Ollama endpoint exposed to the public internet · PraisonAI, an open-source multi-agent framework · Agent stacks are now in scope for attackers

◆ QUICK HITS

  • DuckDB shipped Quack HTTP client-server mode — eliminates the 'one process at a time' constraint; spike one Glue/EMR job under 100GB onto ECS Fargate + DuckDB pattern

    DuckDB shipped a client-server mode this week

  • Kafka Share Groups decouple consumer parallelism from partition count with ~8x linear scaling at 32 instances on I/O-bound workloads — stop over-partitioning for scale

    DuckDB shipped a client-server mode this week

  • DeepSeek V4 Pro scored 77/100 on FlowGraph at $2.25/task, sitting between Opus and Kimi K2.6 — benchmark against your production model on cost-per-successful-task

    The CyberGym result

  • Mythos is the first model to clear both UK AISI simulated attack ranges; AISI already building harder tests because current ladder saturates

    Mythos cleared the AISI attack ranges this week

  • Mozilla's custom harness found 271 Firefox bugs with Claude Mythos; curl's out-of-box scan found 1 CVE — same model, 271x yield difference from harness engineering alone

    Mozilla shipped 271 bugs over the period in question

  • Duolingo publicly pegs AI-generated content slop at ~20% requiring human QC — rare production quality anchor; benchmark your acceptance rate against it

    Duolingo's twenty percent AI slop rate

  • Only 15% of orgs have data foundations for agentic AI; data quality/lineage cited by ~50% as the #1 blocker — scope agent projects as data-platform projects with an agent on top

    DuckDB shipped a client-server mode this week

  • TML-Interaction-Small reports 0.40s turn-taking latency vs 0.57s Gemini-flash-live and 1.18s GPT-Realtime — a 3x gap on the metric that determines voice-agent naturalness

    TML is reporting 0.40 seconds of full-duplex latency

  • PCAOB/COSO now require deterministic execution and tamper-evident audit trails for ML in regulated finance — stochastic decoding is non-compliant by design

    The transformer underwriting models are outperforming

  • LLM-as-a-Verifier (decomposed binary verifications with token-level scoring) beats LLM-as-a-Judge on tie-rate and decision accuracy — swap one eval pipeline as a day-long spike

    An Ollama endpoint exposed to the public internet

  • Update: Exposed AI endpoints (Ollama, LangServe, MCP) scanned within 3 hours; 23% of honeypot traffic targets AI-specific paths including /.well-known/mcp.json

    An Ollama endpoint exposed to the public internet

  • Persona drift in LLM agents measurable within 8 conversational turns (Li et al. COLM 2024) — add a verbal-tic canary to system prompts as a free drift detector

    AI personas drift within eight turns

◆ Bottom line

The take.

Anthropic metered the developer discount, Vercel confirmed 59% of production tokens are agentic, and the data stack shipped five CVSS 9.0+ CVEs in a single cycle. If you haven't deployed a multi-provider gateway with per-workload cost attribution, rebuilt your eval harness around trajectory-level metrics, and patched your Iceberg catalog this week, you're absorbing pricing decisions by default, measuring the minority of your traffic, and leaving your training data exposed to silent poisoning.

— Promit, reading as Data Science ·

Frequently asked

How do I quickly estimate the cost overrun on existing Claude agent pipelines?
Multiply your prior subscription-flat token burn by roughly 3-10x to approximate the now-unsubsidized API-credit cost, since the implicit subsidy on Agent SDK, GitHub Actions, and third-party harness usage was 70-90%. Then re-project against the dollar-matched credit cap and the June 15 third-party split (no rollover). Any pipeline whose monthly trace volume × per-trace token count exceeds the credit pool is in overrun starting now.
What trajectory-level metrics should replace single-turn pass@1 in an agent eval harness?
Track tool-call precision and recall, steps-to-completion, cost-per-successful-task, recovery-from-error rate, and per-node token cost across the agent graph. These capture the multi-turn, tool-calling, long-context behavior that dominates the 59% of production tokens now flowing through agentic traces. Pass@1 measures the minority of traffic and hides 5-15x compounding from retries and tool loops.
Where does the in-house post-training versus frontier-API crossover sit after this week's results?
It moved lower. Datology shows VLM curation beating a 2B baseline by ~10 points at 17x less compute, TST claims 2-3x wall-clock at matched FLOPs with no inference-side change, and NVIDIA's Star Elastic produces a model-size family from one post-training run. For teams with >100K proprietary task examples, SFT or DPO on a 7-13B base is now plausibly cheaper than frontier API calls at production volume.
Why is the Iceberg CVE more dangerous than typical lakehouse patch noise?
CVE-2026-42812 (CVSS 9.9) lets an attacker with table-write permission redirect table metadata to an attacker-controlled S3 prefix, so subsequent queries read poisoned Parquet and training runs silently ingest corrupted features. Default lakehouse observability tracks row changes, not metadata-pointer mutations, so the poisoning does not trip standard alerts. Combined with Apache Polaris credential-broadening, it opens a path from analyst notebook to cross-tenant theft and training-data poisoning.
Is OpenAI's two-month free Codex enterprise promo worth the integration effort to evaluate?
Yes, because it is a zero-cost asymmetric option. Run it through your own eval harness with matched prompts and tool schemas, scoring on trajectory-level metrics rather than pass@1. Even if you stay on Claude, the comparison gives you negotiating leverage and a measured fallback for the next capacity or pricing shock; skipping it leaves the comparison unmeasured at exactly the moment Anthropic's serving conditions are non-stationary due to the Colossus 1 integration.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.