Data Science daily

Edition 2026-05-01 · read as Data Science

Cost-Per-Correct-Answer:TheEvalMetricFinanceWillForce

Sources
40
Words
1,420
Read
7min

Topics Agentic AI AI Regulation LLM Inference

◆ The signal

The production question is tokens per correct answer, and accuracy-only evals don't measure it: at comparable quality, Granite 4.1 8B used 19.5× fewer tokens than Qwen3.5 9B, and on Factory AI's 13-model bakeoff a $1.25/PR model held up against ones costing 2×+. The Pragmatic Engineer's survey of 15 companies puts AI coding spend at $500/day per developer, up 10–15× in six months. Teams that aren't tracking cost-per-correct-answer tend to learn about it from finance.

◆ INTELLIGENCE MAP

  1. 01

    Token Efficiency Rewrites Model Selection Math

    act now

    Granite 4.1 8B spent 4M output tokens vs. 78M for Qwen3.5 9B — a 19.5× gap at similar capability. Factory AI found a $1.25/PR model beat a $3+/PR model on real code review. Meanwhile dev token spend is $500/day across 15 companies. Accuracy-only evals are now actively misleading for production selection.

    19.5×
    token efficiency gap
    7
    sources
    • Granite vs Qwen tokens
    • Dev daily token burn
    • Token spend growth
    • Opus→Sonnet savings
    1. Granite 4.1 8B4
    2. Qwen3.5 9B78
  2. 02

    ML Serving Stack Exploited in Hours, Not Days

    act now

    LiteLLM's pre-auth SQLi (CVE-2026-42208) was exploited in <36h. LMDeploy's SSRF was weaponized in 12.5h with no public PoC. Three ML frameworks shipped pickle-deserialization RCEs at CVSS 9.8 in the same week. Attackers are feeding CVE advisories into LLMs to generate working exploits, compressing patch windows from days to hours.

    12.5h
    disclosure-to-exploit
    6
    sources
    • LiteLLM exploit time
    • LMDeploy exploit time
    • Pickle RCEs this week
    • Peak CVSS score
    1. LMDeploy SSRF12.5
    2. LiteLLM SQLi36
    3. Typical patch SLA720
  3. 03

    Multi-Model Routing Is Now the Reference Architecture

    monitor

    Microsoft confirmed Copilot runs on both OpenAI and Anthropic inside Word/Excel/Outlook — 20M seats, $9.25B quarterly AI revenue. Anthropic's annualized revenue jumped to ~$40B. Google is shipping TPUs externally for the first time. Combined hyperscaler Q1 capex hit $112B. Single-provider stacks are officially behind the reference architecture.

    $112B
    Q1 hyperscaler capex
    5
    sources
    • Copilot paid seats
    • M365 penetration
    • Google Cloud growth
    • Google Cloud backlog
    1. Google Cloud63
    2. Azure40
    3. AWS28
  4. 04

    Agent Governance: 93% Auto-Approval Meets Legal Liability

    monitor

    Anthropic reports 93% of agent actions are auto-approved — making human-in-the-loop a rubber stamp, not a safety control. A N.D. Cal. court ruled that when AI exercises 'ultimate authority' over output, the platform is liable under Rule 10b-5. Meanwhile MOAK exploited 174/178 post-cutoff KEVs using off-the-shelf models. Authority level is now a legal design parameter.

    93%
    auto-approval rate
    4
    sources
    • Auto-approved actions
    • KEVs exploited by agent
    • GPT-5.5 miss rate drop
    1. Agent auto-approval rate93
  5. 05

    Chinese Open-Weight Models Close the Gap at 1/10th the Price

    background

    MiniMax M2.7 hit 56.2% on SWE-Pro and 57.0% on Terminal Bench 2 as open source. Kimi-K2.6 shipped at 1T params open-weight. Qwen 3.5 Plus serves at $3/M tokens. Simultaneously, Anthropic logged 16M extraction queries across 24K fraudulent accounts, and Congress is probing Airbnb and Cursor's parent for using Chinese-origin models. Capability is converging; the compliance surface is widening.

    56.2%
    M2.7 SWE-Pro score
    4
    sources
    • Qwen 3.5 Plus price
    • MiMo-V2.5 Pro price
    • Extraction queries
    • Fraudulent accounts
    1. 01MiniMax M2.756.2
    2. 02Kimi-K2.6 (1T MoE)57
    3. 03Qwen 3.5 Plus3

◆ DEEP DIVES

  1. 01

    The Harness Quarter: Token Efficiency, Eval Economics, and the $500/Day Developer

    The New Axis Your Eval Harness Is Missing

    Accuracy-only model evaluation is now actively misleading for production decisions, and this week's evidence is specific enough to act on. The invoice is where model choice gets decided this quarter, not the leaderboard.

    IBM's Granite 4.1 8B spent 4M output tokens on the Artificial Analysis Intelligence Index against 78M for Qwen3.5 9B, a 19.5× efficiency gap at comparable capability. That gap does not appear on any accuracy leaderboard. It appears on the invoice. Factory AI's 13-model code-review bakeoff found a $1.25/PR model beat one costing more than 2× on real pull requests, with token spend uncorrelated to review quality on their harness. Harness engineering, iterating on prompts, tools, and middleware with the model held constant, lifted Terminal-Bench 2 pass@1 from 69.7% to 77.0% in 10 iterations, beating the human-designed Codex-CLI baseline at 71.9%.

    The harness, not the model, is where this quarter's agent economics live.

    The Cost Crisis Is Already Here

    Across 15 companies surveyed by The Pragmatic Engineer, AI coding token spend has grown 10–15× in six months. Individual developer burn is hitting $500/day, and one developer burned $10K in a week because of a caching bug. A 2,000-person SaaS cut costs 30% by changing the default model from Opus to Sonnet, which is a config change rather than an architecture change.

    On the eval side, evaluation compute has quietly crossed from overhead to bottleneck. Individual eval runs now cost tens of thousands of dollars. DeepMind's ProEval pattern, surrogate models scoring checkpoints cheaply with full evals reserved for release candidates, is the direction of travel. Netflix published the most portable eval blueprint of the quarter: ~600 expert-labeled golden examples with tiered reasoning and consensus scoring, landing at 83–92% accuracy across four quality dimensions.

    The awkward finding: a community benchmark showed the two-word prompt "be brief" matched a purpose-built compression plugin on both token reduction and output quality. Trivial baselines embarrass sophisticated prompt frameworks more often than practitioners admit. The thing this doesn't tell you is whether the plugin still wins on tasks the benchmark didn't cover, which is worth checking before ripping it out.

    Optimization LeverReported ImpactEffort
    Opus→Sonnet default swap30% cost cutConfig change
    Harness engineering (10 iters)+7.3pp accuracy, no model change1-2 sprints
    Token-efficient model swap (Granite)19.5× fewer tokensEval + migration
    Surrogate eval (ProEval pattern)50%+ eval compute savingsQuarter-long build
    "be brief" vs. pluginParity on token reductionTwo words

    What This Changes

    The binding constraint has shifted. Eval harnesses that report accuracy without token cost are no longer decision-grade instruments. Every model candidate should produce (accuracy, tokens-consumed) pairs plotted on a Pareto frontier. The Granite vs. Qwen gap only surfaces when token counts are logged per task. The Factory AI finding only surfaces when cost is a first-class eval dimension.

    Microsoft's cloud margin dropped 5 percentage points to 56%, attributed explicitly to inference-heavy AI workloads including GitHub Copilot. Consumption pricing makes token-per-request a product metric rather than an infra footnote. Teams without per-request token accounting will learn their cost structure through a Finance escalation, not a dashboard.

    Action items

    • Add $/correct-answer and tokens-per-task as first-class metrics in your model eval harness alongside accuracy
    • Run one iteration of agentic harness engineering on your top agent benchmark, holding the model constant
    • Build a 600-example golden eval set with tiered + consensus LLM-as-Judge scoring for your highest-stakes generative output
    • Implement surrogate-model eval (ProEval pattern) for continuous checkpoint scoring, reserving full benchmark runs for release candidates

    Sources:AINews · The Pragmatic Engineer · TLDR AI · TLDR Dev · ben's bites · Aaron Holmes

  2. 02

    Your ML Serving Stack Is Being Exploited Faster Than You Can Patch

    Disclosure-to-Exploit Has Collapsed to Hours

    One seven-day window is a small sample, but the signal is hard to ignore. LiteLLM's pre-auth SQL injection (CVE-2026-42208), which fronts most teams' OpenAI, Anthropic, and Bedrock keys, was exploited within 36 hours of disclosure. LMDeploy's SSRF went from advisory to working exploit in 12.5 hours, and no public PoC was available. The plausible mechanism is attackers feeding CVE advisories into LLMs and getting usable exploits back.

    Patch-window SLAs were calibrated for a world where writing the exploit was the slow step. On this data, it is not the slow step anymore.

    Pickle Deserialization Is the 2026 Pattern-of-the-Year

    Three independent ML frameworks shipped the same bug class this week, unsafe deserialization of attacker-controlled artifacts, each at CVSS 9.8:

    FrameworkCVECVSSAttack Surface
    LeRobot ≤0.5.1CVE-2026-258749.8HF's robotics stack; checkpoints shared broadly
    KTransformers ≤0.5.3CVE-2026-262109.8Local inference on GPU hosts with credentials
    PipecatCVE-2025-623739.8Voice/multimodal pipelines, network-facing

    On top of that: Claude Code's CVSS 10.0 sandbox escape via symlinks (CVE-2026-39861, patched in v2.1.64), NVFlare Dashboard's auth bypass (CVE-2026-24178, CVSS 9.8) in federated learning deployments, and 73 GlassWorm-linked fake extensions on Open VSX in April alone. That is the marketplace behind Cursor, VSCodium, and most VS Code forks.


    The Supply Chain Cascade Is the Structural Threat

    The TeamPCP/UNC6780 chain is worth tracing end to end. They poisoned checkmarx/kics:latest on Docker Hub. Bitwarden's Dependabot automation pulled it into CI. The malicious code shipped as @bitwarden/cli 2026.4.0. That is the exact automation topology most modern ML teams run: tag-based image pulls, bot-driven dependency updates, CI runners holding model registry credentials.

    In parallel, DPRK's HexagonalRodent is targeting ML and Web3 engineers through fake-recruiter coding assessments that abuse VSCode tasks.json auto-run. The scorecard so far: $12M exfiltrated across 2,726 developer machines in Q1 2026.

    LiteLLM deserves separate attention. Running it as a cost and routing gateway funnels every upstream provider key into a single database that sits behind a pre-auth endpoint. The working assumption should be that any keys transiting the service during the vulnerability window are burn-worthy. Rotate first, investigate second.


    Two Lever Fixes Cover Most of the Surface

    One policy change removes most of this week's blast radius: ban tag-based image pulls and pickle-loads-from-untrusted-sources in CI. Both are enforceable with a linter and a registry policy, and between them they would have blocked the majority of the ML-relevant exposure listed above. The thing this doesn't cover is the pre-auth SQL and SSRF class. That one is still a patch-speed problem.

    Action items

    • Rotate all provider keys (OpenAI, Anthropic, Bedrock) that transited LiteLLM and pull 30-day usage anomaly reports from each provider
    • Enforce torch.load(weights_only=True) or safetensors as a lint rule and block pickle artifacts from untrusted sources at the artifact-store layer
    • Pin all ML base images by sha256 digest, disable Dependabot auto-merge for container and pip dependencies, and set a <24h patch SLA for ML serving CVEs with technical detail
    • Disable VSCode 'folderOpen' auto-tasks org-wide and require external repos to run in disposable devcontainers with egress allowlisting

    Sources:SANS AtRisk · TLDR InfoSec · CSO Update · CSO First Look · Daniel Miessler · TLDR IT

  3. 03

    Agent Governance Has Legal Teeth Now — And Your Metrics Aren't Measuring the Right Thing

    The 93% Problem

    Anthropic disclosed that 93% of AI agent actions are auto-approved by human reviewers. At that rate, the approval signal is not a safety control. It is a gauge that happens to have a human attached. The signal carries near-zero mutual information with risk. An agent that only ever takes the same three actions in risky contexts is not safe. It is under-explored, and this metric cannot tell the difference.

    Anthropic's own recommendation is to move from per-action approval to continuous policy monitoring. The metrics that carry actual signal are override rate on high-risk actions, policy-violation count per 1,000 traces, and action entropy per agent. A new agent version that pushes auto-approval from 93% to 96% is not an improvement. It is the same broken gauge reading slightly higher.

    An agent eval harness without override rate, cost-per-task, and tool-boundary violations is measuring demos, not agents.

    Courts Just Made Autonomy a Design Parameter

    A Northern District of California court ruled under Rule 10b-5 that when a platform's AI exercises 'ultimate authority' over assembled content, the platform is a 'maker of fraudulent statements'. This is the first U.S. ruling to reject the premise that autonomous AI output is legally equivalent to user-generated content. Section 230-style shields do not automatically extend to agent decisions.

    Autonomy LevelControl FlowLiability PostureEngineering Signal to Log
    Tool-assistHuman → AI suggests → human commitsShield likely holdssuggestion_id, accept/reject
    Human-gated agentAI drafts → human approves → commitShield likely holdsapproval_latency, reviewer_id, diff
    Autonomous agentAI assembles and commits unreviewedPlatform exposed as 'maker'Full chain, model version, guardrails

    Most production agentic stacks today ship in the third row and log like the first. The thing this gap doesn't tell you, until discovery does, is whether a human actually closed the loop. Observability stacks capture latency, token spend, and output quality. They rarely capture a first-class authority_level field attesting to human approval.


    Offensive Capability Is Outpacing Defensive Measurement

    MOAK's agent built working exploits for 174 of 178 KEVs published after model training cutoffs, using publicly available Opus 4.6 and GPT-5.4 behind ordinary scaffolding. XBOW reports GPT-5.5 cut its miss rate from 40% to 10%, with black-box performance now exceeding what GPT-5 managed with source access. Persistence-on-failing-paths halved. That is a planning improvement, and planning improvements transfer to any long-horizon agent task.

    HackerOne paused its Internet Bug Bounty because AI-driven submission volume is outpacing remediation bandwidth. The same queue-overflow dynamic will replay inside every org: PR review, model-card approvals, security alerts. Any human-review queue in the ML stack is one capability jump from the same asymmetry.

    Action items

    • Add authority_level (tool_assist | human_gated | autonomous) and human_reviewer_id to every agentic decision record, persisted for artifact lifetime
    • Replace per-action approval metrics with a continuous-policy dashboard tracking override rate, policy-violation rate, and action entropy per agent
    • Gate destructive operations (DB writes, deploys, credential access) behind dry-run + human approval regardless of model confidence
    • Run FinBot CTF (OWASP) against your agent stack as a pre-production red-team gate, specifically testing MCP tool-description tampering and cross-tenant context bleed

    Sources:TLDR IT · Future Perfect · Clint Gibler · CyberScoop

◆ QUICK HITS

  • Pinterest abandoned click-proxy retrieval for a two-tower DCN v2 with advertiser-level conversion loss — a template for any ranking stack still optimizing engagement as a conversion proxy

    Pinterest's two-tower shift + the Linux 7.0 Postgres trap your DBs may be hitting

  • Linux 7.0 scheduling change halved PostgreSQL throughput via spinlock contention on page faults — fix is enabling huge pages (vm.nr_hugepages); audit any Postgres host on kernel ≥7.0

    Pinterest's two-tower shift + the Linux 7.0 Postgres trap your DBs may be hitting

  • Shopify's Flow agent: fine-tuned small OSS model beat frontier APIs on accuracy, latency, and cost simultaneously for NL-to-workflow — make distillation-then-fine-tuning the default for bounded-schema agent tasks

    Pinterest's two-tower shift + the Linux 7.0 Postgres trap your DBs may be hitting

  • Blockify claims 40× RAG corpus reduction via IdeaBlocks — open-source, LangChain/LlamaIndex compatible; run a one-day spike against your current chunker before re-index commitment

    Blockify claims a 40x reduction in RAG corpus size via IdeaBlocks

  • MCP confirmed as de facto agent-to-data standard: Google's Deep Research Max on Gemini 3.1 Pro now uses MCP alongside Anthropic's push — stand up MCP endpoints for feature store and warehouse before Q3

    Three new open-weight models landed this week

  • Deezer reports 44% of daily song uploads are AI-generated — any content-scraped dataset for fine-tuning on post-2024 data has material synthetic contamination; add provenance features to content classifiers

    Three new open-weight models landed this week

  • Update: Industrial-scale model extraction confirmed — Anthropic logged 16M queries across 24K fraudulent accounts (~667 queries each, engineered sub-threshold); per-account rate limits are insufficient; build cross-account behavioral clustering

    Distillation attacks hit sixteen million Claude queries

  • Mistral Medium 3.5 ships at 128B dense, 256k context, self-hostable with adjustable reasoning effort — benchmark against your current self-hosted baseline; if 256k holds, chunked-RAG pipelines may collapse

    Two empirical results landed in the same week

  • Factual knowledge scales log-linearly with parameter count (R²=0.917 across 1,400 questions / 188 models / 135M–1.6T params) — reasoning compresses, facts do not; route factual queries differently from reasoning queries if distilling

    AINews

  • Voice AI absorbed $7B+ in Q1 2026; Abridge won HonorHealth (200 centers, 17K staff) with proprietary domain models + self-hosted records, not frontier API wrappers — vertical fine-tunes + tenant isolation beat horizontal frontier

    Voice AI revenue crossed $7B/qtr

◆ Bottom line

The take.

Token efficiency just exposed a 19.5× gap between models that score identically on accuracy leaderboards, ML serving infrastructure is being exploited in 12 hours flat, and a federal court just ruled your autonomous agent makes you the legal 'maker' of its output — add $/correct-answer to every eval, rotate your LiteLLM keys today, and log authority_level on every agent decision before opposing counsel does it for you.

— Promit, reading as Data Science ·

Frequently asked

How do I add cost-per-correct-answer to an existing eval harness?
Log token counts per task alongside accuracy, then plot (accuracy, tokens-consumed) pairs on a Pareto frontier per model. Derive $/correct-answer by multiplying tokens by provider pricing and dividing by pass rate. Without this, gaps like Granite 4.1 8B using 19.5× fewer tokens than Qwen3.5 9B at comparable quality stay invisible until Finance flags the invoice.
Why isn't accuracy alone enough to pick a production model anymore?
Because token consumption at comparable accuracy now varies by more than an order of magnitude between models, and consumption pricing makes that variance the dominant cost driver. Factory AI's 13-model bakeoff showed a $1.25/PR model holding up against ones costing 2×+, with token spend uncorrelated to review quality. Accuracy-only leaderboards hide that entire axis.
What's the fastest lever to cut AI coding spend without changing architecture?
Swap the default model for routine work — one 2,000-person SaaS cut costs 30% by moving the default from Opus to Sonnet, a config change. Pair that with harness engineering on top agents, which lifted Terminal-Bench 2 pass@1 from 69.7% to 77.0% over 10 iterations at zero inference cost increase.
How many examples does a credible LLM-as-Judge eval set need?
Netflix's published blueprint uses ~600 expert-labeled golden examples with tiered reasoning and consensus scoring across multiple judge models, reaching 83–92% accuracy across four quality dimensions. Treat that as the empirical floor; smaller sets or single-judge setups generally aren't decision-grade for high-stakes generative output.
What's the ProEval pattern and when is it worth building?
ProEval uses cheap surrogate models to score checkpoints continuously, reserving full benchmark runs for release candidates. It's worth building once individual eval runs cost tens of thousands of dollars or eval compute starts rivaling training compute — a quarter-long investment that typically saves 50%+ of eval spend.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.