Data Science daily

Edition 2026-05-08 · read as Data Science

VectorRecallDrops40ptsat500KDocs;HybridHoldsUp

Sources
42
Words
1,675
Read
8min

Topics Data Infrastructure AI Regulation Agentic AI

◆ The signal

EnterpriseRAG-Bench reports vector retrieval recall falling from 90.7% to 50.6% as the corpus scales from small to 500K documents. The thing a 10K-doc eval doesn't tell you is where production actually lives, which is 30 to 40 points lower. Hybrid retrieval with BM25 degrades only 17pp over the same range, which is the number worth acting on. Rerun the retriever at 500K before trusting the leaderboard figure.

◆ INTELLIGENCE MAP

  1. 01

    RAG Retrieval Collapses at Enterprise Scale

    act now

    EnterpriseRAG-Bench (500K docs across 9 SaaS sources) shows dense retrieval falls from 90.7% to 50.6% recall while BM25 falls only to 68.4%. The mechanism is embedding neighborhood density: 3-5 docs per topic at 5K becomes 40-60 at 500K, pushing canonical answers out of top-k. Any eval harness capped at 10K docs is a sales demo.

    50.6%
    vector recall at 500K docs
    3
    sources
    • Vector @ 5K docs
    • Vector @ 500K docs
    • BM25 @ 500K docs
    • Recall drop (dense)
    1. Vector 5K90.7
    2. BM25 5K85.8
    3. BM25 500K68.4
    4. Vector 500K50.6
  2. 02

    vLLM V1 Correctness Fixes Invalidate RL Baselines

    act now

    vLLM V1 patched four silent bugs — logprob computation, prefix caching defaults, inflight weight sync, and fp32 lm_head — each of which independently biases policy gradients. Any GRPO/PPO/DPO run using vLLM V0 as the rollout engine has contaminated baselines. Re-score existing checkpoints against V1 before trusting any ranking.

    4
    silent RL-biasing bugs
    3
    sources
    • Logprob mismatch
    • Prefix cache default
    • Weight drift
    • lm_head precision
    1. LogprobsOff-policy ratios wrong
    2. Prefix cacheStale logits → biased advantages
    3. Weight syncOn-policy → silently off-policy
    4. lm_head bf16→fp32Sampling distribution shift
  3. 03

    Anthropic Ships Per-Result Pricing + Agent Platform

    monitor

    Anthropic broke the per-token pricing model, charging per delivered result. Claude Code Auto Mode targets 90% autonomous completion, Agent SDK went GA, and Managed Agents (Dreams, Outcomes, Multiagent) shipped. FinOps dashboards keyed on $/1M tokens are now stale — you need $/successful-task alongside. The Sonnet→Opus advisor pattern claims frontier quality at 5× lower cost.

    cost reduction (advisor)
    9
    sources
    • Revenue growth YoY
    • Auto Mode autonomy
    • Rate limit increase
    • Advisor cost claim
    1. Pure Opus100
    2. Sonnet→Opus advisor20
  4. 04

    ML Infrastructure Under Active Attack

    monitor

    Braintrust (LLM eval platform) confirmed AWS breach exposing customer API keys — rotate today. Simultaneously: Apache Iceberg CVSS 9.9 metadata-write, Ollama CVSS 9.1 heap OOB, NVFlare CVSS 9.8 privesc, and DocsGPT CVSS 9.8 RCE. The Mini Shai-Hulud worm hit PyTorch Lightning on PyPI with credential-theft payloads across 1,800 repos in 48 hours.

    9.9
    Iceberg CVSS score
    4
    sources
    • Braintrust keys
    • Iceberg CVSS
    • NVFlare CVSS
    • Worm repos created
    1. 01Apache Iceberg9.9
    2. 02DocsGPT MCP9.8
    3. 03NVFlare Dashboard9.8
    4. 04Ollama GGUF9.1
  5. 05

    Evaluation Harnesses Measure the Harness, Not the Model

    background

    Same base model + different harness = 10-20 point deltas on tau2-bench. Terminal-Bench 2.1 patched 28/89 tasks and scores shifted 12 points. Meta killed their token-consumption leaderboard after engineers scripted million-token loops. ProgramBench shows best models clear only 3% of real software-rebuild tasks. The proxy metric and the production outcome are decoupled across the board.

    10-20
    harness-driven point swing
    4
    sources
    • tau2-bench delta
    • TB 2.1 shift
    • ProgramBench ceiling
    • Meta tokenmax killed
    1. tau2-bench harness swing20
    2. Terminal-Bench patch shift12
    3. ProgramBench task pass3

◆ DEEP DIVES

  1. 01

    RAG at 500K Docs: The Recall Cliff Your Eval Doesn't See

    The Benchmark That Breaks the Demo

    Onyx released EnterpriseRAG-Bench this week: 500K+ synthetic documents spanning Slack, Gmail, Jira, GitHub, Confluence, Drive, HubSpot, Fireflies, and Linear, with misfiled files, near-duplicates, and conflicting versions baked in. The headline result is one every RAG architecture review this quarter should absorb. Dense vector retrieval drops from 90.7% recall at 5K documents to 50.6% at 500K. Same embedding model, same retriever, same queries. Corpus size is the only variable the experiment moves.

    A RAG benchmark that tops out at 10K docs is a sales demo, not an eval. The cliff arrives somewhere between 100K and 1M documents depending on topical density.

    Why BM25 Degrades More Gracefully

    The ablation worth staring at: BM25 falls only from 85.8% to 68.4% across the same range. A 17pp drop against dense retrieval's 40pp collapse. The mechanism is structural, not stylistic. As the corpus grows, embedding-space neighborhood density grows monotonically with it. Where 3-5 documents touched a topic at 5K, 40-60 touch it at 500K. The canonical answer gets crowded out of top-k by semantically similar but irrelevant neighbors.

    Retriever5K docs500K docsAbsolute dropFailure mode
    Dense / vector90.7%50.6%-40.1 ppNeighborhood density
    BM2585.8%68.4%-17.4 ppTerm overlap ambiguity

    BM25's failure mode (term ambiguity) is orthogonal to dense retrieval's failure mode (neighborhood density). That orthogonality is the cleanest argument for hybrid retrieval this quarter. Not because either retriever is best, but because their errors are uncorrelated.


    The Production Implication

    Two practical consequences fall out. First, fixed top-k is underfit for growing corpora. If top-k=10 worked at 5K docs where 3-5 documents were topically relevant, it fails at 500K where 40-60 are. Adaptive k, or aggressive cross-encoder reranking, becomes load-bearing. Second, any internal RAG demo benchmarked on fewer than 50K docs is systematically overstating production accuracy by 30-40 percentage points.

    Separately, Bing's engineering team published a framing distinction between search indexing (optimizing what a human should read) and grounding indexing (optimizing what an LLM should cite). Standard recall@k measures the former. Production RAG failures come from the latter. Retrieved passages that are topically relevant but evidentially insufficient for the generated claim. The eval harness needs a claim-level evidence sufficiency score sitting alongside retrieval metrics.

    What the benchmark doesn't tell you

    The corpus is synthetic. The embedding model and ANN configuration are unspecified. Whether late-interaction models (ColBERT) or Matryoshka embeddings compress the gap is open. The cleanest answer is an in-house ablation on the deployed corpus at deployed scale.

    Action items

    • Run your current retriever against EnterpriseRAG-Bench at 50K, 200K, and 500K scales — log recall@10/50 and MRR per scale tier
    • A/B hybrid retrieval (BM25 + dense + cross-encoder rerank) against dense-only on a 200K+ document slice
    • Instrument embedding neighborhood density (mean k-NN distance per ingest batch) as a production drift metric
    • Deprecate any internal RAG benchmark running on <50K docs as 'production-representative'

    Sources:Daily Dose of DS · AINews · MarketingShot

  2. 02

    vLLM V1 Fixed Four Silent Bugs — Your RL Baselines Are Contaminated

    The Bugs That Don't Fail Loudly

    vLLM V1 shipped corrections for four issues that independently bias policy gradients in RL training. None of them crash the run. They are silent numerical contamination that shifts reward curves in a way you cannot distinguish from a hyperparameter sweep:

    V0 IssueV1 FixWhat it breaks in RL
    Processed logprobs mismatched rawReturns pre-processing logprobsOff-policy correction ratios in PPO/GRPO are wrong
    Prefix caching on by defaultDisabled in rollout pathsCached prefixes produce stale logits → biased advantage estimates
    Weight-update model driftMatched rollout and training weights exactlyOn-policy algorithm silently becomes off-policy
    lm_head computed in bf16/fp16Forced fp32Logit tails shift; small but compounding in long rollouts
    Any eval or training run on V0 should be treated as suspect until re-baselined. This is the exact failure mode that makes RL papers irreproducible.

    Why This Matters More Than It Looks

    The logprob bug by itself is enough to invalidate PPO importance sampling ratios. The prefix-caching bug means advantage estimates were computed against stale state. Together, the reward curves teams liked from V0 runs may reflect infrastructure artifacts, not learning. A run that looked like it converged may have converged to the bug's attractor rather than the policy's.

    NVIDIA independently published lossless speculative decoding for RL rollout this week: 2.5× end-to-end speedup at 235B parameters without changing the policy distribution. The load-bearing word is 'lossless.' Speculative decoding normally biases sampling at temperature > 0, which is why it was previously unsafe for on-policy RL data. If the claim holds under replication, the combination of V1 fixes and lossless spec-decode makes RL post-training both correct and fast for the first time in a production stack.


    The Broader Harness Problem

    The V1 fixes land alongside evidence that harness engineering alone swings agent benchmark scores 10-20 points on identical base models (tau2-bench). Terminal-Bench 2.1 patched 28 of 89 tasks and absolute scores moved up to 12 points. Rankings held. Magnitudes did not. The thing this doesn't tell you from a top-line score: internal leaderboards that don't log (model_version, harness_version) as a tuple are producing numbers that cannot be compared week to week.

    The contamination runs both directions. V0's bugs inflated some results and suppressed others. The only way to know which of your checkpoints survived is to replay against V1. If the ranking doesn't move, the migration still pays for itself in audit trail. If it does move, you needed to know before shipping another model card.

    Action items

    • Re-run the two most important RL training runs from last quarter on vLLM V1 and diff reward curves against V0 baselines
    • Add harness_version as a mandatory field in every internal leaderboard entry — rerun any model comparison from the last 90 days that didn't hold harness constant
    • Evaluate lossless speculative decoding for your RL post-training rollout loop — run an A/A on reward curves before trusting the A/B
    • Pin vLLM version in all training and eval Dockerfiles with hash verification

    Sources:AINews · TLDR AI · Techpresso · Engineer's Codex

  3. 03

    Anthropic's Per-Result Pricing Breaks Your Cost Model

    The Token Model Is Dead for Agentic Work

    Anthropic is the first frontier lab to move off token-based metering, charging per delivered result instead of per token. In the same week: Claude Code Auto Mode (targeting 90% autonomous task completion), Agent SDK going GA, and Managed Agents shipping Dreams (trajectory replay), Outcomes (success-criteria grading), and Multiagent orchestration. The old eval question was cost per million tokens at a fixed quality bar. The new question is cost per successful outcome. The two almost never correlate cleanly.

    DimensionClaude Code Auto ModeOpenAI Agents SDKYour in-house router
    Pricing unitPer result (new)Per tokenPer token (passthrough)
    Model selectionVendor auto-picks per taskDeveloper-specifiedHand-tuned heuristics
    Autonomy target~90% (unverified)Not statedVaries
    ReproducibilityNon-deterministicDeterministic if pinnedDeterministic

    The Sonnet→Opus advisor pattern from the Code with Claude event claims frontier-quality benchmarks at 5× lower cost, routing cheap calls through Sonnet and escalating to Opus conditionally. No escalation-rate disclosure, no task breakdown, no public eval harness. The thing this doesn't tell you is how well-calibrated the escalation decision is on your distribution. The economics only work if escalation rate is low and the router is right when it escalates.

    FinOps dashboards keyed on $/1M tokens are already stale. They need a second axis: $/successful-task. That requires task-level success instrumentation most teams don't have.

    What This Changes in Practice

    When the vendor auto-selects models per call, A/B tests that assume a fixed-model baseline produce noisy results. The same prompt hits different weights across runs. Log the chosen model per call, pin models for experiments, and move to variance-aware significance testing.

    The routing implication is concrete. Short, deterministic calls where token pricing is already near floor will look worse on outcome pricing. Multi-step agentic work where completion rate drives cost will look better, possibly much better. The only way to know which side your workload falls on is to run the A/B on real traffic.

    Rate limits doubled, not just pricing

    Claude Code 5-hour rate limits doubled across Pro/Max/Team/Enterprise, peak-hour throttling is gone, and Opus API ceilings are up, sourced to the 220K-GPU Colossus 1 lease. Any agent workflow that was throttled needs rebenchmarking. Last quarter's finding that "the agent gives up on long tasks" may have been measuring the rate limiter, not the model. I flagged that risk in the prior column. The fix is cheaper than the analysis we did last time.

    EU AI Act extension buys 16 months

    The high-risk deadline moved from August 2026 to December 2027. Keep the compliance roadmap, slow the hiring ramp, reclaim the budget for evaluation work. Connecticut SB5 (codifying that automated decisions are not a defense to discrimination) has a shorter fuse. Whistleblower triggers at 10^26 FLOPs go live October 2026.

    Action items

    • Rebuild LLM cost attribution dashboards to track $/successful-task alongside $/1M-tokens — start with a binary task_success label piped into telemetry
    • Spike the Sonnet→Opus advisor pattern against your top-3 LLM workloads — measure quality delta vs. pure-Opus on your existing eval set
    • Rebenchmark Claude Code throughput under new 2× rate limits — compare wall-clock and success@k versus April results
    • Log cumulative training FLOPs as a first-class metric in your experiment tracker — CT SB5 makes this a regulatory trigger at 10^26

    Sources:AI Weekly · Future Perfect · TLDR AI · TLDR · Techpresso · ben's bites

  4. 04

    Rotate Now: Braintrust Breach + Four Critical ML-Stack CVEs

    The Eval Platform Had Your Keys

    Braintrust, the LLM eval platform most ML teams use to score prompts, agents, and RAG chains, confirmed unauthorized access to an AWS account holding customer API keys. If anyone on the team ever pasted an OpenAI, Anthropic, Bedrock, or Vertex key into a Braintrust project, which is the standard integration path, treat those keys as compromised. The monetization pattern on leaked LLM keys is well-understood by now: attackers burn them on inference against the victim's billing account, usually through proxy resellers. The thing the platform alert doesn't tell you is that detection lag is measured in days of spend.

    Four Simultaneous Critical CVEs in the ML Stack

    ComponentCVE / SeverityAttackWhere it hurts you
    Apache IcebergCVE-2026-42812 / 9.9Write table metadata to attacker locationSilent feature poisoning in training datasets
    DocsGPT 0.15-0.16CVE-2026-26015 / 9.8MCP test bypass → RCERAG/agent stacks with MCP endpoints
    NVIDIA NVFlareCVE-2026-24178 / 9.8AuthZ bypass + privescFederated learning trust boundary collapse
    Ollama <0.17.1CVE-2026-7482 / 9.1Heap OOB read in GGUF loaderDev workstations running local LLMs

    Iceberg is the one to take seriously for data teams. Write access to the catalog is write access to the training data's ground truth. A compromised metadata path lets an attacker redirect feature materialization to poisoned tables, and no downstream query engine will flag it.

    If requirements.txt isn't hash-pinned and the Iceberg catalog isn't tightly ACL'd, this week's CVEs already walked through the ML stack.

    The Worm Is Cross-Ecosystem Now

    The Mini Shai-Hulud worm ran for roughly 48 hours on April 29-30. It compromised PyTorch Lightning on PyPI, SAP packages on npm, and intercom-client and intercom-php on Packagist. It stood up 1,800 attacker-controlled GitHub repositories from stolen credentials. The mechanism is credential reuse across ecosystems: lift a GitHub token from an npm install, pivot to PyPI and Packagist, republish. Fully automated. Copycats follow within weeks, because they always do.

    Separately, internet-exposed inference infrastructure is being actively enumerated. AIMap fingerprints MCP servers, Ollama, vLLM, LiteLLM, LangServe, Gradio, and ComfyUI through Shodan and Nuclei. If any of those stood up on a VPC with a public subnet, even briefly for a demo, assume the instance is already in someone's catalog.

    Action items

    • Rotate every LLM provider API key ever stored in Braintrust — check provider billing dashboards for anomalous usage in the trailing 14 days
    • Patch Apache Iceberg and tighten catalog write ACLs — add manifest hash verification to downstream feature materialization DAGs
    • Upgrade Ollama to ≥0.17.1 across all dev workstations and enforce GGUF provenance checks from internal registry only
    • Enforce hash-pinned lockfiles (pip-tools, uv lock) on all training and serving images — unpinned deps should fail CI

    Sources:SANS AtRisk · TLDR InfoSec · The Hacker News · CyberScoop

◆ QUICK HITS

  • Update: Anthropic's Colossus lease is 100% of xAI's 300MW site — rate limits doubled for Claude Code and peak-hour throttling removed; rebenchmark throughput before next planning cycle

    TLDR

  • GPT-5.5 claims 52.5% hallucination reduction on 'high-stakes prompts' — no eval set, sample size, or rubric disclosed; treat as hypothesis and rerun your factuality suite before routing traffic

    ben's bites

  • Halodoc's self-healing CDC pipeline cuts recovery from 45min to <5min using checkpoint rewind + consistency checks — the primitives port to most Spark/CDC stacks in one sprint

    TLDR Data

  • Fivetran compiled SQLGlot with mypyc for ~5× parsing speedup — zero API change, ships as optional package; install today if SQL parsing sits in any hot path

    TLDR Data

  • GitHub uptime at 85.51% (2-3 hours downtime/day) driven by 3.5× AI-agent load growth — audit every ML pipeline for GitHub-as-SPOF including training triggers, model registry refs, and Actions-based evals

    The Pragmatic Engineer

  • Pennsylvania suing Character.AI for chatbot fabricating a medical license number — add credential-claim detection (regex + NER) as a blocking release metric for any LLM touching regulated domains

    The Hustle

  • Microsoft culling Copilot features across Windows/Xbox due to inference costs compressing margins — even with free OpenAI compute, ambient chatbot unit economics don't pencil; build cost-per-successful-task dashboards

    Aaron Holmes

  • vLLM + Mooncake prefix caching: 92.2% hit rate on agentic workloads (up from 1.7%), yielding 3.8× throughput — expect ~50-70% of that in production with messier prompt shapes

    AINews

  • PAN-OS CVE-2026-0300: unauthenticated root RCE under active exploitation, patches not until May 13 — if any ML endpoint sits behind PA-Series, isolate the management portal today

    CyberScoop

  • Lightweight LLM agents beat complex reranker stacks on Amazon ESCI: NDCG 0.29→0.41-0.45 — expect half the lift on in-house data, but the simpler architecture wins if latency budget allows

    TLDR Data

◆ Bottom line

The take.

Your baselines are lying across three layers simultaneously: vector retrieval halves at 500K documents (any eval under 50K is fiction), vLLM V0's four silent bugs contaminated RL reward curves, and Braintrust's breach means your LLM API keys may already be burning tokens on someone else's bill — fix the keys today, rerun the evals this sprint, and stop trusting demo-scale RAG benchmarks as production truth.

— Promit, reading as Data Science ·

Frequently asked

How do I quickly tell if my RAG eval is overstating production accuracy?
If your benchmark corpus is under 50K documents, assume it overstates deployed recall by 30–40 percentage points. The recall cliff for dense retrieval shows up between 100K and 1M documents depending on topical density, so any sub-50K eval should be deprecated as 'production-representative' and replaced with an ablation at 50K, 200K, and 500K on your actual corpus.
Why does dense retrieval collapse at scale while BM25 holds up better?
The two retrievers fail through orthogonal mechanisms. Dense retrieval suffers from embedding-space neighborhood density: as the corpus grows, semantically similar but irrelevant neighbors crowd canonical answers out of top-k. BM25 fails through term-overlap ambiguity, which scales much more slowly. That orthogonality — not either retriever being individually best — is the strongest argument for hybrid retrieval plus cross-encoder reranking.
Do the vLLM V1 fixes actually invalidate prior RL training runs?
Potentially yes. Four V0 issues — mismatched logprobs, default prefix caching in rollouts, weight drift between rollout and training, and bf16/fp16 lm_head — each independently bias policy gradients without crashing the run. PPO/GRPO importance ratios and advantage estimates computed under V0 may reflect infrastructure artifacts rather than learning. Re-baseline critical checkpoints on V1 and diff reward curves before trusting them.
What should cost dashboards track now that Anthropic prices per result?
Add a $/successful-task axis alongside $/1M-tokens, backed by a binary task_success label in telemetry. Token-pricing dashboards systematically mislead on agentic workloads where completion rate dominates cost, and they can't compare vendors that meter different units. Also log the chosen model per call, since vendor auto-selection breaks fixed-model A/B assumptions.
What's the immediate response to the Braintrust breach?
Rotate every LLM provider API key ever stored in Braintrust projects — OpenAI, Anthropic, Bedrock, Vertex, and any others — and audit provider billing dashboards for anomalous usage over the trailing 14 days. Leaked keys are typically monetized through proxy resellers burning inference against the victim's account, and detection lag is measured in days of spend rather than hours.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

Keep reading.