How do I add cost-per-correct-answer to an existing eval harness?

Log token counts per task alongside accuracy, then plot (accuracy, tokens-consumed) pairs on a Pareto frontier per model. Derive $/correct-answer by multiplying tokens by provider pricing and dividing by pass rate. Without this, gaps like Granite 4.1 8B using 19.5× fewer tokens than Qwen3.5 9B at comparable quality stay invisible until Finance flags the invoice.

Why isn't accuracy alone enough to pick a production model anymore?

Because token consumption at comparable accuracy now varies by more than an order of magnitude between models, and consumption pricing makes that variance the dominant cost driver. Factory AI's 13-model bakeoff showed a $1.25/PR model holding up against ones costing 2×+, with token spend uncorrelated to review quality. Accuracy-only leaderboards hide that entire axis.

What's the fastest lever to cut AI coding spend without changing architecture?

Swap the default model for routine work — one 2,000-person SaaS cut costs 30% by moving the default from Opus to Sonnet, a config change. Pair that with harness engineering on top agents, which lifted Terminal-Bench 2 pass@1 from 69.7% to 77.0% over 10 iterations at zero inference cost increase.

How many examples does a credible LLM-as-Judge eval set need?

Netflix's published blueprint uses ~600 expert-labeled golden examples with tiered reasoning and consensus scoring across multiple judge models, reaching 83–92% accuracy across four quality dimensions. Treat that as the empirical floor; smaller sets or single-judge setups generally aren't decision-grade for high-stakes generative output.

What's the ProEval pattern and when is it worth building?

ProEval uses cheap surrogate models to score checkpoints continuously, reserving full benchmark runs for release candidates. It's worth building once individual eval runs cost tens of thousands of dollars or eval compute starts rivaling training compute — a quarter-long investment that typically saves 50%+ of eval spend.

Edition 2026-05-01 · read as Data Science

Cost-Per-Correct-Answer:TheEvalMetricFinanceWillForce

Sources: 40
Words: 1,420
Read: 7min

Topics Agentic AI AI Regulation LLM Inference

◆ The signal

The production question is tokens per correct answer, and accuracy-only evals don't measure it: at comparable quality, Granite 4.1 8B used 19.5× fewer tokens than Qwen3.5 9B, and on Factory AI's 13-model bakeoff a $1.25/PR model held up against ones costing 2×+. The Pragmatic Engineer's survey of 15 companies puts AI coding spend at $500/day per developer, up 10–15× in six months. Teams that aren't tracking cost-per-correct-answer tend to learn about it from finance.

Key facts

IBM Granite 4.1 8B used 19.5× fewer output tokens than Qwen3.5 9B at comparable accuracy on the Artificial Analysis Intelligence Index (4M vs 78M tokens).
A Pragmatic Engineer survey of 15 companies found AI coding token spend grew 10–15× in six months, with per-developer burn reaching $500/day.
LiteLLM CVE-2026-42208, a pre-auth SQL injection exposing provider API keys, was exploited within 36 hours of disclosure; LMDeploy's SSRF was weaponized in 12.5 hours.
Three ML frameworks—LeRobot, KTransformers, and Pipecat—shipped CVSS 9.8 unsafe pickle deserialization CVEs in the same week.
Anthropic disclosed that 93% of AI agent actions are auto-approved by human reviewers, and a N.D. Cal. court ruled platforms with AI exercising 'ultimate authority' qualify as Rule 10b-5 'makers of fraudulent statements'.

◆ INTELLIGENCE MAP

01
Token Efficiency Rewrites Model Selection Math
act now
Granite 4.1 8B spent 4M output tokens vs. 78M for Qwen3.5 9B — a 19.5× gap at similar capability. Factory AI found a $1.25/PR model beat a $3+/PR model on real code review. Meanwhile dev token spend is $500/day across 15 companies. Accuracy-only evals are now actively misleading for production selection.
19.5×
token efficiency gap
7
sources
- Granite vs Qwen tokens
- Dev daily token burn
- Token spend growth
- Opus→Sonnet savings
1. Granite 4.1 8B4
2. Qwen3.5 9B78
02
ML Serving Stack Exploited in Hours, Not Days
act now
LiteLLM's pre-auth SQLi (CVE-2026-42208) was exploited in <36h. LMDeploy's SSRF was weaponized in 12.5h with no public PoC. Three ML frameworks shipped pickle-deserialization RCEs at CVSS 9.8 in the same week. Attackers are feeding CVE advisories into LLMs to generate working exploits, compressing patch windows from days to hours.
12.5h
disclosure-to-exploit
6
sources
- LiteLLM exploit time
- LMDeploy exploit time
- Pickle RCEs this week
- Peak CVSS score
1. LMDeploy SSRF12.5
2. LiteLLM SQLi36
3. Typical patch SLA720
03
Multi-Model Routing Is Now the Reference Architecture
monitor
Microsoft confirmed Copilot runs on both OpenAI and Anthropic inside Word/Excel/Outlook — 20M seats, $9.25B quarterly AI revenue. Anthropic's annualized revenue jumped to ~$40B. Google is shipping TPUs externally for the first time. Combined hyperscaler Q1 capex hit $112B. Single-provider stacks are officially behind the reference architecture.
$112B
Q1 hyperscaler capex
5
sources
- Copilot paid seats
- M365 penetration
- Google Cloud growth
- Google Cloud backlog
1. Google Cloud63
2. Azure40
3. AWS28
04
Agent Governance: 93% Auto-Approval Meets Legal Liability
monitor
Anthropic reports 93% of agent actions are auto-approved — making human-in-the-loop a rubber stamp, not a safety control. A N.D. Cal. court ruled that when AI exercises 'ultimate authority' over output, the platform is liable under Rule 10b-5. Meanwhile MOAK exploited 174/178 post-cutoff KEVs using off-the-shelf models. Authority level is now a legal design parameter.
93%
auto-approval rate
4
sources
- Auto-approved actions
- KEVs exploited by agent
- GPT-5.5 miss rate drop
1. Agent auto-approval rate93
05
Chinese Open-Weight Models Close the Gap at 1/10th the Price
background
MiniMax M2.7 hit 56.2% on SWE-Pro and 57.0% on Terminal Bench 2 as open source. Kimi-K2.6 shipped at 1T params open-weight. Qwen 3.5 Plus serves at $3/M tokens. Simultaneously, Anthropic logged 16M extraction queries across 24K fraudulent accounts, and Congress is probing Airbnb and Cursor's parent for using Chinese-origin models. Capability is converging; the compliance surface is widening.
56.2%
M2.7 SWE-Pro score
4
sources
- Qwen 3.5 Plus price
- MiMo-V2.5 Pro price
- Extraction queries
- Fraudulent accounts
1. 01MiniMax M2.756.2
2. 02Kimi-K2.6 (1T MoE)57
3. 03Qwen 3.5 Plus3

◆ DEEP DIVES

The Harness Quarter: Token Efficiency, Eval Economics, and the $500/Day Developer

The New Axis Your Eval Harness Is Missing

Accuracy-only model evaluation is now actively misleading for production decisions, and this week's evidence is specific enough to act on. The invoice is where model choice gets decided this quarter, not the leaderboard.

IBM's Granite 4.1 8B spent 4M output tokens on the Artificial Analysis Intelligence Index against 78M for Qwen3.5 9B, a 19.5× efficiency gap at comparable capability. That gap does not appear on any accuracy leaderboard. It appears on the invoice. Factory AI's 13-model code-review bakeoff found a $1.25/PR model beat one costing more than 2× on real pull requests, with token spend uncorrelated to review quality on their harness. Harness engineering, iterating on prompts, tools, and middleware with the model held constant, lifted Terminal-Bench 2 pass@1 from 69.7% to 77.0% in 10 iterations, beating the human-designed Codex-CLI baseline at 71.9%.

The harness, not the model, is where this quarter's agent economics live.

The Cost Crisis Is Already Here

Across 15 companies surveyed by The Pragmatic Engineer, AI coding token spend has grown 10–15× in six months. Individual developer burn is hitting $500/day, and one developer burned $10K in a week because of a caching bug. A 2,000-person SaaS cut costs 30% by changing the default model from Opus to Sonnet, which is a config change rather than an architecture change.

On the eval side, evaluation compute has quietly crossed from overhead to bottleneck. Individual eval runs now cost tens of thousands of dollars. DeepMind's ProEval pattern, surrogate models scoring checkpoints cheaply with full evals reserved for release candidates, is the direction of travel. Netflix published the most portable eval blueprint of the quarter: ~600 expert-labeled golden examples with tiered reasoning and consensus scoring, landing at 83–92% accuracy across four quality dimensions.

The awkward finding: a community benchmark showed the two-word prompt "be brief" matched a purpose-built compression plugin on both token reduction and output quality. Trivial baselines embarrass sophisticated prompt frameworks more often than practitioners admit. The thing this doesn't tell you is whether the plugin still wins on tasks the benchmark didn't cover, which is worth checking before ripping it out.

Optimization Lever	Reported Impact	Effort
Opus→Sonnet default swap	30% cost cut	Config change
Harness engineering (10 iters)	+7.3pp accuracy, no model change	1-2 sprints
Token-efficient model swap (Granite)	19.5× fewer tokens	Eval + migration
Surrogate eval (ProEval pattern)	50%+ eval compute savings	Quarter-long build
"be brief" vs. plugin	Parity on token reduction	Two words

What This Changes

The binding constraint has shifted. Eval harnesses that report accuracy without token cost are no longer decision-grade instruments. Every model candidate should produce (accuracy, tokens-consumed) pairs plotted on a Pareto frontier. The Granite vs. Qwen gap only surfaces when token counts are logged per task. The Factory AI finding only surfaces when cost is a first-class eval dimension.

Microsoft's cloud margin dropped 5 percentage points to 56%, attributed explicitly to inference-heavy AI workloads including GitHub Copilot. Consumption pricing makes token-per-request a product metric rather than an infra footnote. Teams without per-request token accounting will learn their cost structure through a Finance escalation, not a dashboard.

Action items

Add $/correct-answer and tokens-per-task as first-class metrics in your model eval harness alongside accuracy
Run one iteration of agentic harness engineering on your top agent benchmark, holding the model constant
Build a 600-example golden eval set with tiered + consensus LLM-as-Judge scoring for your highest-stakes generative output
Implement surrogate-model eval (ProEval pattern) for continuous checkpoint scoring, reserving full benchmark runs for release candidates

Sources:AINews · The Pragmatic Engineer · TLDR AI · TLDR Dev · ben's bites · Aaron Holmes

Your ML Serving Stack Is Being Exploited Faster Than You Can Patch

Disclosure-to-Exploit Has Collapsed to Hours

One seven-day window is a small sample, but the signal is hard to ignore. LiteLLM's pre-auth SQL injection (CVE-2026-42208), which fronts most teams' OpenAI, Anthropic, and Bedrock keys, was exploited within 36 hours of disclosure. LMDeploy's SSRF went from advisory to working exploit in 12.5 hours, and no public PoC was available. The plausible mechanism is attackers feeding CVE advisories into LLMs and getting usable exploits back.

Patch-window SLAs were calibrated for a world where writing the exploit was the slow step. On this data, it is not the slow step anymore.

Pickle Deserialization Is the 2026 Pattern-of-the-Year

Three independent ML frameworks shipped the same bug class this week, unsafe deserialization of attacker-controlled artifacts, each at CVSS 9.8:

Framework	CVE	CVSS	Attack Surface
LeRobot ≤0.5.1	CVE-2026-25874	9.8	HF's robotics stack; checkpoints shared broadly
KTransformers ≤0.5.3	CVE-2026-26210	9.8	Local inference on GPU hosts with credentials
Pipecat	CVE-2025-62373	9.8	Voice/multimodal pipelines, network-facing

On top of that: Claude Code's CVSS 10.0 sandbox escape via symlinks (CVE-2026-39861, patched in v2.1.64), NVFlare Dashboard's auth bypass (CVE-2026-24178, CVSS 9.8) in federated learning deployments, and 73 GlassWorm-linked fake extensions on Open VSX in April alone. That is the marketplace behind Cursor, VSCodium, and most VS Code forks.

The Supply Chain Cascade Is the Structural Threat

The TeamPCP/UNC6780 chain is worth tracing end to end. They poisoned checkmarx/kics:latest on Docker Hub. Bitwarden's Dependabot automation pulled it into CI. The malicious code shipped as @bitwarden/cli 2026.4.0. That is the exact automation topology most modern ML teams run: tag-based image pulls, bot-driven dependency updates, CI runners holding model registry credentials.

In parallel, DPRK's HexagonalRodent is targeting ML and Web3 engineers through fake-recruiter coding assessments that abuse VSCode tasks.json auto-run. The scorecard so far: $12M exfiltrated across 2,726 developer machines in Q1 2026.

LiteLLM deserves separate attention. Running it as a cost and routing gateway funnels every upstream provider key into a single database that sits behind a pre-auth endpoint. The working assumption should be that any keys transiting the service during the vulnerability window are burn-worthy. Rotate first, investigate second.

Two Lever Fixes Cover Most of the Surface

One policy change removes most of this week's blast radius: ban tag-based image pulls and pickle-loads-from-untrusted-sources in CI. Both are enforceable with a linter and a registry policy, and between them they would have blocked the majority of the ML-relevant exposure listed above. The thing this doesn't cover is the pre-auth SQL and SSRF class. That one is still a patch-speed problem.

Action items

Rotate all provider keys (OpenAI, Anthropic, Bedrock) that transited LiteLLM and pull 30-day usage anomaly reports from each provider
Enforce torch.load(weights_only=True) or safetensors as a lint rule and block pickle artifacts from untrusted sources at the artifact-store layer
Pin all ML base images by sha256 digest, disable Dependabot auto-merge for container and pip dependencies, and set a <24h patch SLA for ML serving CVEs with technical detail
Disable VSCode 'folderOpen' auto-tasks org-wide and require external repos to run in disposable devcontainers with egress allowlisting

Sources:SANS AtRisk · TLDR InfoSec · CSO Update · CSO First Look · Daniel Miessler · TLDR IT

Agent Governance Has Legal Teeth Now — And Your Metrics Aren't Measuring the Right Thing

The 93% Problem

Anthropic disclosed that 93% of AI agent actions are auto-approved by human reviewers. At that rate, the approval signal is not a safety control. It is a gauge that happens to have a human attached. The signal carries near-zero mutual information with risk. An agent that only ever takes the same three actions in risky contexts is not safe. It is under-explored, and this metric cannot tell the difference.

Anthropic's own recommendation is to move from per-action approval to continuous policy monitoring. The metrics that carry actual signal are override rate on high-risk actions, policy-violation count per 1,000 traces, and action entropy per agent. A new agent version that pushes auto-approval from 93% to 96% is not an improvement. It is the same broken gauge reading slightly higher.

An agent eval harness without override rate, cost-per-task, and tool-boundary violations is measuring demos, not agents.

Courts Just Made Autonomy a Design Parameter

A Northern District of California court ruled under Rule 10b-5 that when a platform's AI exercises 'ultimate authority' over assembled content, the platform is a 'maker of fraudulent statements'. This is the first U.S. ruling to reject the premise that autonomous AI output is legally equivalent to user-generated content. Section 230-style shields do not automatically extend to agent decisions.

Autonomy Level	Control Flow	Liability Posture	Engineering Signal to Log
Tool-assist	Human → AI suggests → human commits	Shield likely holds	suggestion_id, accept/reject
Human-gated agent	AI drafts → human approves → commit	Shield likely holds	approval_latency, reviewer_id, diff
Autonomous agent	AI assembles and commits unreviewed	Platform exposed as 'maker'	Full chain, model version, guardrails

Most production agentic stacks today ship in the third row and log like the first. The thing this gap doesn't tell you, until discovery does, is whether a human actually closed the loop. Observability stacks capture latency, token spend, and output quality. They rarely capture a first-class authority_level field attesting to human approval.

Offensive Capability Is Outpacing Defensive Measurement

MOAK's agent built working exploits for 174 of 178 KEVs published after model training cutoffs, using publicly available Opus 4.6 and GPT-5.4 behind ordinary scaffolding. XBOW reports GPT-5.5 cut its miss rate from 40% to 10%, with black-box performance now exceeding what GPT-5 managed with source access. Persistence-on-failing-paths halved. That is a planning improvement, and planning improvements transfer to any long-horizon agent task.

HackerOne paused its Internet Bug Bounty because AI-driven submission volume is outpacing remediation bandwidth. The same queue-overflow dynamic will replay inside every org: PR review, model-card approvals, security alerts. Any human-review queue in the ML stack is one capability jump from the same asymmetry.

Action items

Add authority_level (tool_assist | human_gated | autonomous) and human_reviewer_id to every agentic decision record, persisted for artifact lifetime
Replace per-action approval metrics with a continuous-policy dashboard tracking override rate, policy-violation rate, and action entropy per agent
Gate destructive operations (DB writes, deploys, credential access) behind dry-run + human approval regardless of model confidence
Run FinBot CTF (OWASP) against your agent stack as a pre-production red-team gate, specifically testing MCP tool-description tampering and cross-tenant context bleed

Sources:TLDR IT · Future Perfect · Clint Gibler · CyberScoop

◆ QUICK HITS

Pinterest abandoned click-proxy retrieval for a two-tower DCN v2 with advertiser-level conversion loss — a template for any ranking stack still optimizing engagement as a conversion proxy
Pinterest's two-tower shift + the Linux 7.0 Postgres trap your DBs may be hitting
Linux 7.0 scheduling change halved PostgreSQL throughput via spinlock contention on page faults — fix is enabling huge pages (vm.nr_hugepages); audit any Postgres host on kernel ≥7.0
Pinterest's two-tower shift + the Linux 7.0 Postgres trap your DBs may be hitting
Shopify's Flow agent: fine-tuned small OSS model beat frontier APIs on accuracy, latency, and cost simultaneously for NL-to-workflow — make distillation-then-fine-tuning the default for bounded-schema agent tasks
Pinterest's two-tower shift + the Linux 7.0 Postgres trap your DBs may be hitting
Blockify claims 40× RAG corpus reduction via IdeaBlocks — open-source, LangChain/LlamaIndex compatible; run a one-day spike against your current chunker before re-index commitment
Blockify claims a 40x reduction in RAG corpus size via IdeaBlocks
MCP confirmed as de facto agent-to-data standard: Google's Deep Research Max on Gemini 3.1 Pro now uses MCP alongside Anthropic's push — stand up MCP endpoints for feature store and warehouse before Q3
Three new open-weight models landed this week
Deezer reports 44% of daily song uploads are AI-generated — any content-scraped dataset for fine-tuning on post-2024 data has material synthetic contamination; add provenance features to content classifiers
Three new open-weight models landed this week
Update: Industrial-scale model extraction confirmed — Anthropic logged 16M queries across 24K fraudulent accounts (~667 queries each, engineered sub-threshold); per-account rate limits are insufficient; build cross-account behavioral clustering
Distillation attacks hit sixteen million Claude queries
Mistral Medium 3.5 ships at 128B dense, 256k context, self-hostable with adjustable reasoning effort — benchmark against your current self-hosted baseline; if 256k holds, chunked-RAG pipelines may collapse
Two empirical results landed in the same week
Factual knowledge scales log-linearly with parameter count (R²=0.917 across 1,400 questions / 188 models / 135M–1.6T params) — reasoning compresses, facts do not; route factual queries differently from reasoning queries if distilling
AINews
Voice AI absorbed $7B+ in Q1 2026; Abridge won HonorHealth (200 centers, 17K staff) with proprietary domain models + self-hosted records, not frontier API wrappers — vertical fine-tunes + tenant isolation beat horizontal frontier
Voice AI revenue crossed $7B/qtr

◆ Bottom line

The take.

Token efficiency just exposed a 19.5× gap between models that score identically on accuracy leaderboards, ML serving infrastructure is being exploited in 12 hours flat, and a federal court just ruled your autonomous agent makes you the legal 'maker' of its output — add $/correct-answer to every eval, rotate your LiteLLM keys today, and log authority_level on every agent decision before opposing counsel does it for you.

Frequently asked

How do I add cost-per-correct-answer to an existing eval harness?: Log token counts per task alongside accuracy, then plot (accuracy, tokens-consumed) pairs on a Pareto frontier per model. Derive $/correct-answer by multiplying tokens by provider pricing and dividing by pass rate. Without this, gaps like Granite 4.1 8B using 19.5× fewer tokens than Qwen3.5 9B at comparable quality stay invisible until Finance flags the invoice.
Why isn't accuracy alone enough to pick a production model anymore?: Because token consumption at comparable accuracy now varies by more than an order of magnitude between models, and consumption pricing makes that variance the dominant cost driver. Factory AI's 13-model bakeoff showed a $1.25/PR model holding up against ones costing 2×+, with token spend uncorrelated to review quality. Accuracy-only leaderboards hide that entire axis.
What's the fastest lever to cut AI coding spend without changing architecture?: Swap the default model for routine work — one 2,000-person SaaS cut costs 30% by moving the default from Opus to Sonnet, a config change. Pair that with harness engineering on top agents, which lifted Terminal-Bench 2 pass@1 from 69.7% to 77.0% over 10 iterations at zero inference cost increase.
How many examples does a credible LLM-as-Judge eval set need?: Netflix's published blueprint uses ~600 expert-labeled golden examples with tiered reasoning and consensus scoring across multiple judge models, reaching 83–92% accuracy across four quality dimensions. Treat that as the empirical floor; smaller sets or single-judge setups generally aren't decision-grade for high-stakes generative output.
What's the ProEval pattern and when is it worth building?: ProEval uses cheap surrogate models to score checkpoints continuously, reserving full benchmark runs for release candidates. It's worth building once individual eval runs cost tens of thousands of dollars or eval compute starts rivaling training compute — a quarter-long investment that typically saves 50%+ of eval spend.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Cost-Per-Correct-Answer:TheEvalMetricFinanceWillForce

◆ INTELLIGENCE MAP

◆ DEEP DIVES

The New Axis Your Eval Harness Is Missing

The Cost Crisis Is Already Here

What This Changes

Disclosure-to-Exploit Has Collapsed to Hours

Pickle Deserialization Is the 2026 Pattern-of-the-Year

The Supply Chain Cascade Is the Structural Threat

Two Lever Fixes Cover Most of the Surface

The 93% Problem

Courts Just Made Autonomy a Design Parameter

Offensive Capability Is Outpacing Defensive Measurement

◆ QUICK HITS

The take.

Frequently asked

◆ RELATED THREADS