How do I quickly tell if my RAG eval is overstating production accuracy?

If your benchmark corpus is under 50K documents, assume it overstates deployed recall by 30–40 percentage points. The recall cliff for dense retrieval shows up between 100K and 1M documents depending on topical density, so any sub-50K eval should be deprecated as 'production-representative' and replaced with an ablation at 50K, 200K, and 500K on your actual corpus.

Why does dense retrieval collapse at scale while BM25 holds up better?

The two retrievers fail through orthogonal mechanisms. Dense retrieval suffers from embedding-space neighborhood density: as the corpus grows, semantically similar but irrelevant neighbors crowd canonical answers out of top-k. BM25 fails through term-overlap ambiguity, which scales much more slowly. That orthogonality — not either retriever being individually best — is the strongest argument for hybrid retrieval plus cross-encoder reranking.

Do the vLLM V1 fixes actually invalidate prior RL training runs?

Potentially yes. Four V0 issues — mismatched logprobs, default prefix caching in rollouts, weight drift between rollout and training, and bf16/fp16 lm_head — each independently bias policy gradients without crashing the run. PPO/GRPO importance ratios and advantage estimates computed under V0 may reflect infrastructure artifacts rather than learning. Re-baseline critical checkpoints on V1 and diff reward curves before trusting them.

What should cost dashboards track now that Anthropic prices per result?

Add a $/successful-task axis alongside $/1M-tokens, backed by a binary task_success label in telemetry. Token-pricing dashboards systematically mislead on agentic workloads where completion rate dominates cost, and they can't compare vendors that meter different units. Also log the chosen model per call, since vendor auto-selection breaks fixed-model A/B assumptions.

What's the immediate response to the Braintrust breach?

Rotate every LLM provider API key ever stored in Braintrust projects — OpenAI, Anthropic, Bedrock, Vertex, and any others — and audit provider billing dashboards for anomalous usage over the trailing 14 days. Leaked keys are typically monetized through proxy resellers burning inference against the victim's account, and detection lag is measured in days of spend rather than hours.

Edition 2026-05-08 · read as Data Science

VectorRecallDrops40ptsat500KDocs;HybridHoldsUp

Sources: 42
Words: 1,675
Read: 8min

Topics Data Infrastructure AI Regulation Agentic AI

◆ The signal

EnterpriseRAG-Bench reports vector retrieval recall falling from 90.7% to 50.6% as the corpus scales from small to 500K documents. The thing a 10K-doc eval doesn't tell you is where production actually lives, which is 30 to 40 points lower. Hybrid retrieval with BM25 degrades only 17pp over the same range, which is the number worth acting on. Rerun the retriever at 500K before trusting the leaderboard figure.

Key facts

EnterpriseRAG-Bench shows dense vector retrieval recall falls from 90.7% at 5K documents to 50.6% at 500K, a 40.1pp drop.
BM25 retrieval degrades only 17.4pp (85.8% to 68.4%) across the same 5K-to-500K corpus range, making hybrid retrieval the cheapest recall recovery lever.
vLLM V1 fixed four silent bugs that bias RL training: mismatched logprobs, default prefix caching, rollout/training weight drift, and bf16 lm_head computation.
Anthropic became the first frontier lab to move off token-based metering, charging per delivered result, and doubled Claude Code 5-hour rate limits across Pro/Max/Team/Enterprise.
Braintrust confirmed unauthorized access to an AWS account holding customer LLM API keys, alongside critical CVEs in Apache Iceberg (9.9), DocsGPT (9.8), NVIDIA NVFlare (9.8), and Ollama (9.1).

◆ INTELLIGENCE MAP

01
RAG Retrieval Collapses at Enterprise Scale
act now
EnterpriseRAG-Bench (500K docs across 9 SaaS sources) shows dense retrieval falls from 90.7% to 50.6% recall while BM25 falls only to 68.4%. The mechanism is embedding neighborhood density: 3-5 docs per topic at 5K becomes 40-60 at 500K, pushing canonical answers out of top-k. Any eval harness capped at 10K docs is a sales demo.
50.6%
vector recall at 500K docs
3
sources
- Vector @ 5K docs
- Vector @ 500K docs
- BM25 @ 500K docs
- Recall drop (dense)
1. Vector 5K90.7
2. BM25 5K85.8
3. BM25 500K68.4
4. Vector 500K50.6
02
vLLM V1 Correctness Fixes Invalidate RL Baselines
act now
vLLM V1 patched four silent bugs — logprob computation, prefix caching defaults, inflight weight sync, and fp32 lm_head — each of which independently biases policy gradients. Any GRPO/PPO/DPO run using vLLM V0 as the rollout engine has contaminated baselines. Re-score existing checkpoints against V1 before trusting any ranking.
4
silent RL-biasing bugs
3
sources
- Logprob mismatch
- Prefix cache default
- Weight drift
- lm_head precision
1. LogprobsOff-policy ratios wrong
2. Prefix cacheStale logits → biased advantages
3. Weight syncOn-policy → silently off-policy
4. lm_head bf16→fp32Sampling distribution shift
03
Anthropic Ships Per-Result Pricing + Agent Platform
monitor
Anthropic broke the per-token pricing model, charging per delivered result. Claude Code Auto Mode targets 90% autonomous completion, Agent SDK went GA, and Managed Agents (Dreams, Outcomes, Multiagent) shipped. FinOps dashboards keyed on $/1M tokens are now stale — you need $/successful-task alongside. The Sonnet→Opus advisor pattern claims frontier quality at 5× lower cost.
5×
cost reduction (advisor)
9
sources
- Revenue growth YoY
- Auto Mode autonomy
- Rate limit increase
- Advisor cost claim
1. Pure Opus100
2. Sonnet→Opus advisor20
04
ML Infrastructure Under Active Attack
monitor
Braintrust (LLM eval platform) confirmed AWS breach exposing customer API keys — rotate today. Simultaneously: Apache Iceberg CVSS 9.9 metadata-write, Ollama CVSS 9.1 heap OOB, NVFlare CVSS 9.8 privesc, and DocsGPT CVSS 9.8 RCE. The Mini Shai-Hulud worm hit PyTorch Lightning on PyPI with credential-theft payloads across 1,800 repos in 48 hours.
9.9
Iceberg CVSS score
4
sources
- Braintrust keys
- Iceberg CVSS
- NVFlare CVSS
- Worm repos created
1. 01Apache Iceberg9.9
2. 02DocsGPT MCP9.8
3. 03NVFlare Dashboard9.8
4. 04Ollama GGUF9.1
05
Evaluation Harnesses Measure the Harness, Not the Model
background
Same base model + different harness = 10-20 point deltas on tau2-bench. Terminal-Bench 2.1 patched 28/89 tasks and scores shifted 12 points. Meta killed their token-consumption leaderboard after engineers scripted million-token loops. ProgramBench shows best models clear only 3% of real software-rebuild tasks. The proxy metric and the production outcome are decoupled across the board.
10-20
harness-driven point swing
4
sources
- tau2-bench delta
- TB 2.1 shift
- ProgramBench ceiling
- Meta tokenmax killed
1. tau2-bench harness swing20
2. Terminal-Bench patch shift12
3. ProgramBench task pass3

◆ DEEP DIVES

01
RAG at 500K Docs: The Recall Cliff Your Eval Doesn't See
The Benchmark That Breaks the Demo
Onyx released EnterpriseRAG-Bench this week: 500K+ synthetic documents spanning Slack, Gmail, Jira, GitHub, Confluence, Drive, HubSpot, Fireflies, and Linear, with misfiled files, near-duplicates, and conflicting versions baked in. The headline result is one every RAG architecture review this quarter should absorb. Dense vector retrieval drops from 90.7% recall at 5K documents to 50.6% at 500K. Same embedding model, same retriever, same queries. Corpus size is the only variable the experiment moves.
A RAG benchmark that tops out at 10K docs is a sales demo, not an eval. The cliff arrives somewhere between 100K and 1M documents depending on topical density.
Why BM25 Degrades More Gracefully
The ablation worth staring at: BM25 falls only from 85.8% to 68.4% across the same range. A 17pp drop against dense retrieval's 40pp collapse. The mechanism is structural, not stylistic. As the corpus grows, embedding-space neighborhood density grows monotonically with it. Where 3-5 documents touched a topic at 5K, 40-60 touch it at 500K. The canonical answer gets crowded out of top-k by semantically similar but irrelevant neighbors.
Retriever 5K docs 500K docs Absolute drop Failure mode
Dense / vector 90.7% 50.6% -40.1 pp Neighborhood density
BM25 85.8% 68.4% -17.4 pp Term overlap ambiguity
BM25's failure mode (term ambiguity) is orthogonal to dense retrieval's failure mode (neighborhood density). That orthogonality is the cleanest argument for hybrid retrieval this quarter. Not because either retriever is best, but because their errors are uncorrelated.
The Production Implication
Two practical consequences fall out. First, fixed top-k is underfit for growing corpora. If top-k=10 worked at 5K docs where 3-5 documents were topically relevant, it fails at 500K where 40-60 are. Adaptive k, or aggressive cross-encoder reranking, becomes load-bearing. Second, any internal RAG demo benchmarked on fewer than 50K docs is systematically overstating production accuracy by 30-40 percentage points.
Separately, Bing's engineering team published a framing distinction between search indexing (optimizing what a human should read) and grounding indexing (optimizing what an LLM should cite). Standard recall@k measures the former. Production RAG failures come from the latter. Retrieved passages that are topically relevant but evidentially insufficient for the generated claim. The eval harness needs a claim-level evidence sufficiency score sitting alongside retrieval metrics.
What the benchmark doesn't tell you
The corpus is synthetic. The embedding model and ANN configuration are unspecified. Whether late-interaction models (ColBERT) or Matryoshka embeddings compress the gap is open. The cleanest answer is an in-house ablation on the deployed corpus at deployed scale.
Action items
- Run your current retriever against EnterpriseRAG-Bench at 50K, 200K, and 500K scales — log recall@10/50 and MRR per scale tier
- A/B hybrid retrieval (BM25 + dense + cross-encoder rerank) against dense-only on a 200K+ document slice
- Instrument embedding neighborhood density (mean k-NN distance per ingest batch) as a production drift metric
- Deprecate any internal RAG benchmark running on <50K docs as 'production-representative'
Sources:Daily Dose of DS · AINews · MarketingShot

Retriever	5K docs	500K docs	Absolute drop	Failure mode
Dense / vector	90.7%	50.6%	-40.1 pp	Neighborhood density
BM25	85.8%	68.4%	-17.4 pp	Term overlap ambiguity

vLLM V1 Fixed Four Silent Bugs — Your RL Baselines Are Contaminated

The Bugs That Don't Fail Loudly

vLLM V1 shipped corrections for four issues that independently bias policy gradients in RL training. None of them crash the run. They are silent numerical contamination that shifts reward curves in a way you cannot distinguish from a hyperparameter sweep:

V0 Issue	V1 Fix	What it breaks in RL
Processed logprobs mismatched raw	Returns pre-processing logprobs	Off-policy correction ratios in PPO/GRPO are wrong
Prefix caching on by default	Disabled in rollout paths	Cached prefixes produce stale logits → biased advantage estimates
Weight-update model drift	Matched rollout and training weights exactly	On-policy algorithm silently becomes off-policy
lm_head computed in bf16/fp16	Forced fp32	Logit tails shift; small but compounding in long rollouts

Any eval or training run on V0 should be treated as suspect until re-baselined. This is the exact failure mode that makes RL papers irreproducible.

Why This Matters More Than It Looks

The logprob bug by itself is enough to invalidate PPO importance sampling ratios. The prefix-caching bug means advantage estimates were computed against stale state. Together, the reward curves teams liked from V0 runs may reflect infrastructure artifacts, not learning. A run that looked like it converged may have converged to the bug's attractor rather than the policy's.

NVIDIA independently published lossless speculative decoding for RL rollout this week: 2.5× end-to-end speedup at 235B parameters without changing the policy distribution. The load-bearing word is 'lossless.' Speculative decoding normally biases sampling at temperature > 0, which is why it was previously unsafe for on-policy RL data. If the claim holds under replication, the combination of V1 fixes and lossless spec-decode makes RL post-training both correct and fast for the first time in a production stack.

The Broader Harness Problem

The V1 fixes land alongside evidence that harness engineering alone swings agent benchmark scores 10-20 points on identical base models (tau2-bench). Terminal-Bench 2.1 patched 28 of 89 tasks and absolute scores moved up to 12 points. Rankings held. Magnitudes did not. The thing this doesn't tell you from a top-line score: internal leaderboards that don't log (model_version, harness_version) as a tuple are producing numbers that cannot be compared week to week.

The contamination runs both directions. V0's bugs inflated some results and suppressed others. The only way to know which of your checkpoints survived is to replay against V1. If the ranking doesn't move, the migration still pays for itself in audit trail. If it does move, you needed to know before shipping another model card.

Action items

Re-run the two most important RL training runs from last quarter on vLLM V1 and diff reward curves against V0 baselines
Add harness_version as a mandatory field in every internal leaderboard entry — rerun any model comparison from the last 90 days that didn't hold harness constant
Evaluate lossless speculative decoding for your RL post-training rollout loop — run an A/A on reward curves before trusting the A/B
Pin vLLM version in all training and eval Dockerfiles with hash verification

Sources:AINews · TLDR AI · Techpresso · Engineer's Codex

Anthropic's Per-Result Pricing Breaks Your Cost Model

The Token Model Is Dead for Agentic Work

Anthropic is the first frontier lab to move off token-based metering, charging per delivered result instead of per token. In the same week: Claude Code Auto Mode (targeting 90% autonomous task completion), Agent SDK going GA, and Managed Agents shipping Dreams (trajectory replay), Outcomes (success-criteria grading), and Multiagent orchestration. The old eval question was cost per million tokens at a fixed quality bar. The new question is cost per successful outcome. The two almost never correlate cleanly.

Dimension	Claude Code Auto Mode	OpenAI Agents SDK	Your in-house router
Pricing unit	Per result (new)	Per token	Per token (passthrough)
Model selection	Vendor auto-picks per task	Developer-specified	Hand-tuned heuristics
Autonomy target	~90% (unverified)	Not stated	Varies
Reproducibility	Non-deterministic	Deterministic if pinned	Deterministic

The Sonnet→Opus advisor pattern from the Code with Claude event claims frontier-quality benchmarks at 5× lower cost, routing cheap calls through Sonnet and escalating to Opus conditionally. No escalation-rate disclosure, no task breakdown, no public eval harness. The thing this doesn't tell you is how well-calibrated the escalation decision is on your distribution. The economics only work if escalation rate is low and the router is right when it escalates.

FinOps dashboards keyed on $/1M tokens are already stale. They need a second axis: $/successful-task. That requires task-level success instrumentation most teams don't have.

What This Changes in Practice

When the vendor auto-selects models per call, A/B tests that assume a fixed-model baseline produce noisy results. The same prompt hits different weights across runs. Log the chosen model per call, pin models for experiments, and move to variance-aware significance testing.

The routing implication is concrete. Short, deterministic calls where token pricing is already near floor will look worse on outcome pricing. Multi-step agentic work where completion rate drives cost will look better, possibly much better. The only way to know which side your workload falls on is to run the A/B on real traffic.

Rate limits doubled, not just pricing

Claude Code 5-hour rate limits doubled across Pro/Max/Team/Enterprise, peak-hour throttling is gone, and Opus API ceilings are up, sourced to the 220K-GPU Colossus 1 lease. Any agent workflow that was throttled needs rebenchmarking. Last quarter's finding that "the agent gives up on long tasks" may have been measuring the rate limiter, not the model. I flagged that risk in the prior column. The fix is cheaper than the analysis we did last time.

EU AI Act extension buys 16 months

The high-risk deadline moved from August 2026 to December 2027. Keep the compliance roadmap, slow the hiring ramp, reclaim the budget for evaluation work. Connecticut SB5 (codifying that automated decisions are not a defense to discrimination) has a shorter fuse. Whistleblower triggers at 10^26 FLOPs go live October 2026.

Action items

Rebuild LLM cost attribution dashboards to track $/successful-task alongside $/1M-tokens — start with a binary task_success label piped into telemetry
Spike the Sonnet→Opus advisor pattern against your top-3 LLM workloads — measure quality delta vs. pure-Opus on your existing eval set
Rebenchmark Claude Code throughput under new 2× rate limits — compare wall-clock and success@k versus April results
Log cumulative training FLOPs as a first-class metric in your experiment tracker — CT SB5 makes this a regulatory trigger at 10^26

Sources:AI Weekly · Future Perfect · TLDR AI · TLDR · Techpresso · ben's bites

Rotate Now: Braintrust Breach + Four Critical ML-Stack CVEs

The Eval Platform Had Your Keys

Braintrust, the LLM eval platform most ML teams use to score prompts, agents, and RAG chains, confirmed unauthorized access to an AWS account holding customer API keys. If anyone on the team ever pasted an OpenAI, Anthropic, Bedrock, or Vertex key into a Braintrust project, which is the standard integration path, treat those keys as compromised. The monetization pattern on leaked LLM keys is well-understood by now: attackers burn them on inference against the victim's billing account, usually through proxy resellers. The thing the platform alert doesn't tell you is that detection lag is measured in days of spend.

Four Simultaneous Critical CVEs in the ML Stack

Component	CVE / Severity	Attack	Where it hurts you
Apache Iceberg	CVE-2026-42812 / 9.9	Write table metadata to attacker location	Silent feature poisoning in training datasets
DocsGPT 0.15-0.16	CVE-2026-26015 / 9.8	MCP test bypass → RCE	RAG/agent stacks with MCP endpoints
NVIDIA NVFlare	CVE-2026-24178 / 9.8	AuthZ bypass + privesc	Federated learning trust boundary collapse
Ollama <0.17.1	CVE-2026-7482 / 9.1	Heap OOB read in GGUF loader	Dev workstations running local LLMs

Iceberg is the one to take seriously for data teams. Write access to the catalog is write access to the training data's ground truth. A compromised metadata path lets an attacker redirect feature materialization to poisoned tables, and no downstream query engine will flag it.

If requirements.txt isn't hash-pinned and the Iceberg catalog isn't tightly ACL'd, this week's CVEs already walked through the ML stack.

The Worm Is Cross-Ecosystem Now

The Mini Shai-Hulud worm ran for roughly 48 hours on April 29-30. It compromised PyTorch Lightning on PyPI, SAP packages on npm, and intercom-client and intercom-php on Packagist. It stood up 1,800 attacker-controlled GitHub repositories from stolen credentials. The mechanism is credential reuse across ecosystems: lift a GitHub token from an npm install, pivot to PyPI and Packagist, republish. Fully automated. Copycats follow within weeks, because they always do.

Separately, internet-exposed inference infrastructure is being actively enumerated. AIMap fingerprints MCP servers, Ollama, vLLM, LiteLLM, LangServe, Gradio, and ComfyUI through Shodan and Nuclei. If any of those stood up on a VPC with a public subnet, even briefly for a demo, assume the instance is already in someone's catalog.

Action items

Rotate every LLM provider API key ever stored in Braintrust — check provider billing dashboards for anomalous usage in the trailing 14 days
Patch Apache Iceberg and tighten catalog write ACLs — add manifest hash verification to downstream feature materialization DAGs
Upgrade Ollama to ≥0.17.1 across all dev workstations and enforce GGUF provenance checks from internal registry only
Enforce hash-pinned lockfiles (pip-tools, uv lock) on all training and serving images — unpinned deps should fail CI

Sources:SANS AtRisk · TLDR InfoSec · The Hacker News · CyberScoop

◆ QUICK HITS

Update: Anthropic's Colossus lease is 100% of xAI's 300MW site — rate limits doubled for Claude Code and peak-hour throttling removed; rebenchmark throughput before next planning cycle
TLDR
GPT-5.5 claims 52.5% hallucination reduction on 'high-stakes prompts' — no eval set, sample size, or rubric disclosed; treat as hypothesis and rerun your factuality suite before routing traffic
ben's bites
Halodoc's self-healing CDC pipeline cuts recovery from 45min to <5min using checkpoint rewind + consistency checks — the primitives port to most Spark/CDC stacks in one sprint
TLDR Data
Fivetran compiled SQLGlot with mypyc for ~5× parsing speedup — zero API change, ships as optional package; install today if SQL parsing sits in any hot path
TLDR Data
GitHub uptime at 85.51% (2-3 hours downtime/day) driven by 3.5× AI-agent load growth — audit every ML pipeline for GitHub-as-SPOF including training triggers, model registry refs, and Actions-based evals
The Pragmatic Engineer
Pennsylvania suing Character.AI for chatbot fabricating a medical license number — add credential-claim detection (regex + NER) as a blocking release metric for any LLM touching regulated domains
The Hustle
Microsoft culling Copilot features across Windows/Xbox due to inference costs compressing margins — even with free OpenAI compute, ambient chatbot unit economics don't pencil; build cost-per-successful-task dashboards
Aaron Holmes
vLLM + Mooncake prefix caching: 92.2% hit rate on agentic workloads (up from 1.7%), yielding 3.8× throughput — expect ~50-70% of that in production with messier prompt shapes
AINews
PAN-OS CVE-2026-0300: unauthenticated root RCE under active exploitation, patches not until May 13 — if any ML endpoint sits behind PA-Series, isolate the management portal today
CyberScoop
Lightweight LLM agents beat complex reranker stacks on Amazon ESCI: NDCG 0.29→0.41-0.45 — expect half the lift on in-house data, but the simpler architecture wins if latency budget allows
TLDR Data

◆ Bottom line

The take.

Your baselines are lying across three layers simultaneously: vector retrieval halves at 500K documents (any eval under 50K is fiction), vLLM V0's four silent bugs contaminated RL reward curves, and Braintrust's breach means your LLM API keys may already be burning tokens on someone else's bill — fix the keys today, rerun the evals this sprint, and stop trusting demo-scale RAG benchmarks as production truth.

Frequently asked

How do I quickly tell if my RAG eval is overstating production accuracy?: If your benchmark corpus is under 50K documents, assume it overstates deployed recall by 30–40 percentage points. The recall cliff for dense retrieval shows up between 100K and 1M documents depending on topical density, so any sub-50K eval should be deprecated as 'production-representative' and replaced with an ablation at 50K, 200K, and 500K on your actual corpus.
Why does dense retrieval collapse at scale while BM25 holds up better?: The two retrievers fail through orthogonal mechanisms. Dense retrieval suffers from embedding-space neighborhood density: as the corpus grows, semantically similar but irrelevant neighbors crowd canonical answers out of top-k. BM25 fails through term-overlap ambiguity, which scales much more slowly. That orthogonality — not either retriever being individually best — is the strongest argument for hybrid retrieval plus cross-encoder reranking.
Do the vLLM V1 fixes actually invalidate prior RL training runs?: Potentially yes. Four V0 issues — mismatched logprobs, default prefix caching in rollouts, weight drift between rollout and training, and bf16/fp16 lm_head — each independently bias policy gradients without crashing the run. PPO/GRPO importance ratios and advantage estimates computed under V0 may reflect infrastructure artifacts rather than learning. Re-baseline critical checkpoints on V1 and diff reward curves before trusting them.
What should cost dashboards track now that Anthropic prices per result?: Add a $/successful-task axis alongside $/1M-tokens, backed by a binary task_success label in telemetry. Token-pricing dashboards systematically mislead on agentic workloads where completion rate dominates cost, and they can't compare vendors that meter different units. Also log the chosen model per call, since vendor auto-selection breaks fixed-model A/B assumptions.
What's the immediate response to the Braintrust breach?: Rotate every LLM provider API key ever stored in Braintrust projects — OpenAI, Anthropic, Bedrock, Vertex, and any others — and audit provider billing dashboards for anomalous usage over the trailing 14 days. Leaked keys are typically monetized through proxy resellers burning inference against the victim's account, and detection lag is measured in days of spend rather than hours.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

VectorRecallDrops40ptsat500KDocs;HybridHoldsUp

◆ INTELLIGENCE MAP

◆ DEEP DIVES

The Benchmark That Breaks the Demo

Why BM25 Degrades More Gracefully

The Production Implication

What the benchmark doesn't tell you

The Bugs That Don't Fail Loudly

Why This Matters More Than It Looks

The Broader Harness Problem

The Token Model Is Dead for Agentic Work

What This Changes in Practice

Rate limits doubled, not just pricing

EU AI Act extension buys 16 months

The Eval Platform Had Your Keys

Four Simultaneous Critical CVEs in the ML Stack

The Worm Is Cross-Ecosystem Now

◆ QUICK HITS

The take.

Frequently asked

◆ RELATED THREADS