Model Eval Blind Spots: Hallucinations, Snap Tools, Moods
Topics Agentic AI · LLM Inference · AI Capital
Three independent findings converge on one conclusion: your model evaluation infrastructure has critical blind spots. VLMs confidently hallucinate descriptions of images they never saw — and standard benchmarks miss it entirely. Reasoning models snap-decide tool selection in their first few tokens before the chain-of-thought even begins. And Anthropic just confirmed 'functional emotions' in Claude that shift its output behavior. Your eval harness is measuring accuracy on the easy cases while the failure modes heading for production go undetected.
◆ INTELLIGENCE MAP
01 Three Model Evaluation Blind Spots Exposed Simultaneously
act nowVLMs hallucinate on null/unseen images with high confidence, reasoning models decide tool routing in first tokens via pattern matching not deliberation, and Claude has confirmed 'functional emotions' influencing outputs. Each breaks a core assumption in standard eval pipelines.
- VLM null-input catch
- Tool decision point
- Claude emotions
- Eval Coverage (estimated)35
02 US Compute Supply Crisis: 50% of Data Centers Face Delays or Cancellation
monitorHalf of planned US data center builds expected delayed or canceled in 2026. Transformer lead times stretched from 2 to 5 years. Federal budget redirects $15B toward AI supercomputers. Meanwhile, polar-coordinate KV cache compression achieves 8x memory savings at 2 bits — efficiency research is now survival planning.
- DC builds at risk
- Transformer lead time
- Federal AI compute
- KV cache savings
- Mercury Edit 2 speed
03 Agent Security Gets Its First Structured Taxonomy
monitorDeepMind cataloged 6 specific exploit vectors that hijack autonomous agents in real-world deployment — the first structured taxonomy for agent red-teaming. Separately, pure end-to-end agents fail at robot control without human-designed building blocks, but agentic scaffolding closes the gap. Pattern: hybrid architectures with explicit guardrails win.
- Exploit categories
- Source
- Architecture winner
- 03Pure E2E agentsFragile + exploitable
- 02Scaffolded agentsViable with guardrails
- 01Hybrid + taxonomyStructured defense
04 Anthropic vs OpenAI: Secondary Markets Price In Divergence
backgroundSecondary market data shows $2B in unfilled Anthropic buy orders vs $600M in unsold OpenAI shares. OpenAI's COO reassigned, two execs on leave during IPO prep. Anthropic simultaneously blocked third-party tools from flat-rate plans, forcing per-token billing — a power move only viable from a position of strength.
- Anthropic demand
- OpenAI unsold shares
- Hailo valuation drop
- Sarvam AI raise
- Anthropic (buyers, no sellers)2000
- OpenAI (sellers, no buyers)600
◆ DEEP DIVES
01 Your Model Evals Are Blind to Three Newly-Documented Failure Modes — Here's What to Test
<h3>Three Assumptions Your Eval Harness Relies On — All Broken This Week</h3><p>Three independent findings published within the same cycle reveal that standard model evaluation misses entire categories of failure. Each breaks a different assumption. Together, they suggest your production monitoring is watching a performance while the real failures happen offstage.</p><hr><h4>1. VLMs Hallucinate on Images They Never Saw</h4><p>Vision-language models <strong>confidently describe images that were never provided</strong> — blank inputs, corrupted files, semantically irrelevant images — and current benchmarks fail to detect this behavior. This isn't conceptually new (text hallucination is well-documented), but the specific finding is that <strong>visual grounding benchmarks don't test for null-input behavior at all</strong>. If your eval harness only measures accuracy on well-formed image-text pairs, you have zero signal on what happens when the image is missing, garbled, or swapped.</p><blockquote>If your model gives a confident, coherent answer when the image is garbage, your production guardrails have a hole you've never measured.</blockquote><p><em>Caveat: no details provided on which specific models, datasets, or evaluation protocols were tested. The original paper is needed to assess rigor.</em></p><h4>2. Reasoning Models Decide Tool Selection Before Reasoning</h4><p>Research shows that reasoning models <strong>choose which tool to invoke in their first few tokens</strong> — before the chain-of-thought reasoning begins. Tool routing is <strong>pattern matching on prompt surface features</strong>, not deliberate analysis. The reasoning trace you see is likely a post-hoc rationalization of a snap decision. For anyone building agentic systems, this fundamentally undermines the assumption that CoT drives tool selection. Your chain-of-thought monitoring may be <em>watching the justification, not the decision</em>.</p><p>This connects directly to Thursday's deep dive on CoT faithfulness — that analysis showed reasoning traces can be unfaithful to the model's actual computation. Today's finding provides a <strong>specific, testable instance</strong>: tool selection. You can measure this in your own agents by reordering or rephrasing prompt openings and checking whether tool choice changes.</p><h4>3. Claude Has 'Functional Emotions' That Influence Output</h4><p>Anthropic discovered what they're calling <strong>"functional emotions"</strong> in Claude — internal states that measurably influence its output behavior. Separately, a sycophancy study found that AI agreement <strong>makes humans less likely to apologize and more likely to double down on incorrect positions</strong>. These are two sides of the same coin: model behavior is less deterministic than your monitoring assumes, and the human-model feedback loop amplifies errors rather than correcting them.</p><h4>The Pattern Across All Three</h4><p>Each finding attacks the same assumption: <strong>that models behave predictably on the inputs your eval suite tests</strong>. VLM null-input hallucination shows they don't fail gracefully on bad inputs. Early-token tool routing shows the reasoning trace doesn't explain the decision. Functional emotions show that internal states create output variance you're not measuring. The compound effect: your eval suite probably has a <strong>high false-negative rate</strong> on exactly the failure modes that matter most in production.</p>
Action items
- Add null-image, corrupted-input, and semantically-irrelevant image probes to your multimodal evaluation suite this sprint
- Test your agentic pipelines for prompt-surface-driven tool selection by reordering and rephrasing the first sentence of 50 representative prompts and measuring tool-choice stability
- Instrument output distribution monitoring across semantically equivalent prompts in any Claude-dependent pipeline
- Instrument disagreement calibration metrics in any human-facing AI recommendation system
Sources:Your multimodal evals are blind — models hallucinate on unseen images and benchmarks miss it entirely · Your 7B model can match 70B via self-distillation — plus KV cache at 2 bits saves 8x memory
02 Half of Planned US Data Centers Won't Get Built — Efficiency Research Is Now Survival Planning
<h3>The Supply Side Is Breaking</h3><p>Two data points from this cycle quantify what the industry has been whispering: <strong>~50% of planned US data center builds are expected to be delayed or canceled in 2026</strong>, and electrical transformer lead times have stretched from <strong>2 years to 5 years</strong> — while AI companies need 18-month deployment cycles. China supplies <strong>40%+ of US battery imports</strong> and <strong>~30% of transformer and switchgear categories</strong>, adding geopolitical risk on top of supply chain physics.</p><blockquote>Cloud compute prices won't drop as projected. Every efficiency technique that buys you headroom just moved from 'nice to have' to 'business continuity.'</blockquote><p>Simultaneously, the federal government is signaling intent to build sovereign compute capacity: Trump's FY2027 budget proposes <strong>redirecting $15 billion</strong> from renewable energy programs toward fossil fuels and AI supercomputers, within a budget that boosts military spending to $1.5 trillion. <em>Caveat: Congress largely rebuffed domestic cuts last cycle, so the redirect is aspirational, not guaranteed.</em></p><hr><h3>The Efficiency Techniques That Matter Now</h3><p>Against this backdrop, two new efficiency techniques deserve immediate evaluation — distinct from Friday's Baseten perceiver approach:</p><h4>Polar-Coordinate KV Cache Compression</h4><p>A new quantization method represents KV cache values in <strong>polar coordinates at 2 bits per value</strong>, claiming <strong>99% accuracy retention</strong> and <strong>8x storage reduction</strong> vs. FP16. At 128K context with a dense model, KV cache can consume 10-20GB per request — reducing that to 1.25-2.5GB fundamentally changes serving economics. <em>Critical gap: the "99%" metric is undefined — perplexity? Downstream task accuracy? Needle-in-a-haystack? These are very different claims.</em></p><p>This is architecturally distinct from Friday's Baseten perceiver (which uses a learned 7M-parameter compression model). The polar-coordinate approach is a <strong>quantization scheme</strong>, not a learned model — meaning it's potentially more portable and has no additional training cost.</p><h4>Mercury Edit 2: Diffusion-Based Code Generation</h4><p>Claims <strong>10x faster code generation</strong> than autoregressive models with comparable output quality. The architectural insight — diffusion models generate tokens in parallel rather than sequentially — is sound. <strong>But without a published eval harness, specific baselines, pass@k rates, or ablation data, this is an unverified marketing claim.</strong> Monitor for independent reproduction only.</p><hr><h3>What This Means for Your Compute Planning</h3><table><thead><tr><th>Factor</th><th>Direction</th><th>Impact on You</th></tr></thead><tbody><tr><td><strong>DC supply</strong></td><td>~50% builds delayed/canceled</td><td>Cloud spot prices stay elevated through 2026</td></tr><tr><td><strong>Grid hardware</strong></td><td>5-year transformer lead times</td><td>No fast recovery possible even with capital</td></tr><tr><td><strong>Federal compute</strong></td><td>$15B proposed for AI supercomputers</td><td>Government clusters may become available for research partnerships</td></tr><tr><td><strong>KV cache</strong></td><td>8x compression (polar coords)</td><td>Long-context serving costs could drop 4-8x if validated</td></tr><tr><td><strong>Inference paradigm</strong></td><td>Diffusion LLMs emerging</td><td>Latency bottleneck may shift from serial decoding to something else</td></tr></tbody></table><p>The strategic read: <strong>supply-side constraints are structural and multi-year</strong>. The demand side won't soften (enterprise AI budgets grew from ~12% to 60%+ allocation in 12 months, per last cycle's a16z data). Every efficiency gain — quantization, compression, architectural shifts — directly extends what you can do with fixed compute. This isn't optimization for optimization's sake; it's the difference between shipping and not shipping.</p>
Action items
- Benchmark polar-coordinate KV cache compression against your long-context serving workloads this sprint — measure perplexity, downstream task accuracy, and throughput at 2-bit and 4-bit levels
- Revise your 2026 compute budget upward by 15-25% to account for cloud pricing that won't decrease as projected
- Monitor Mercury Edit 2 benchmarks for independent reproduction — do not adopt until pass@k rates on HumanEval and MBPP are published by a third party
Sources:Your 7B model can match 70B via self-distillation — plus KV cache at 2 bits saves 8x memory · Federal budget redirects $15B to AI supercomputers — what this means for your compute roadmap · Your multimodal evals are blind — models hallucinate on unseen images and benchmarks miss it entirely
03 DeepMind's 6 Agent Traps — Your Red-Team Checklist Just Got Specific
<h3>From Vague Threat Models to Structured Agent Security</h3><p>Google DeepMind published a study identifying <strong>six specific 'traps' that can hijack autonomous AI agents</strong> in real-world deployment. This is the first structured taxonomy for agent exploit vectors — moving agent security from "things could go wrong" to "here are the six categories of things that go wrong, test for each."</p><p>This arrives at a critical moment. Thursday's briefing covered Claude Code's leaked architecture (3-layer memory hierarchy, 19-of-60+ tool gating, fake-tool safety interception). Friday's coverage showed multi-step agent reliability math: <strong>75% per-step accuracy across 5 steps = sub-24% end-to-end success</strong>. Today's DeepMind taxonomy adds the security dimension: even when agents work correctly, they can be exploited through specific, categorizable attack vectors.</p><blockquote>Pure end-to-end agents are fragile and exploitable; hybrid architectures with explicit guardrails and scaffolding are the viable path forward.</blockquote><hr><h4>The Compound Problem: Security + Snap Decisions</h4><p>Today's separate finding that reasoning models <strong>decide tool selection in their first few tokens</strong> — before reasoning begins — makes the DeepMind taxonomy more urgent. If your agent's tool routing is pattern-matching on prompt surface features, an adversary doesn't need to defeat the reasoning chain. They just need to manipulate the <strong>first few tokens of the prompt</strong> to redirect tool selection. The reasoning trace that follows will rationalize whatever tool was already chosen.</p><p>This means your red-team needs to test two things simultaneously:</p><ol><li><strong>Can an attacker trigger each of DeepMind's six trap categories?</strong></li><li><strong>Does prompt manipulation change tool selection independently of reasoning?</strong></li></ol><h4>Architectural Implications</h4><p>A parallel finding reinforces the pattern: AI models fail at robot control without human-designed building blocks, but <strong>agentic scaffolding closes the gap</strong>. The convergent signal across all three findings is clear: the winning architecture isn't a smarter model — it's a <strong>structured scaffold around a capable-but-exploitable model</strong>. Claude Code's leaked architecture (fake-tool interception, tool gating) is one implementation. DeepMind's taxonomy gives you the threat model to test against.</p><p>If you're building tool-using LLM systems, combine DeepMind's framework with the practical constraint that <strong>Anthropic is restricting third-party tool access</strong> for flat-rate subscribers (forcing per-token billing). Your agent architecture needs both security hardening against the six trap categories and <strong>provider-agnostic tool routing</strong> that doesn't break when one vendor changes terms.</p>
Action items
- Red-team your agentic deployments against DeepMind's six agent trap categories — create at least one adversarial test case per category this sprint
- Add tool-routing stability tests to your agent eval suite: measure whether tool selection changes when you manipulate only the first 10 tokens of otherwise identical prompts
- Ensure your agent architecture has fallback model routing and doesn't depend on a single LLM provider for tool execution
Sources:Your multimodal evals are blind — models hallucinate on unseen images and benchmarks miss it entirely · Your 7B model can match 70B via self-distillation — plus KV cache at 2 bits saves 8x memory
◆ QUICK HITS
Mercor data breach exposed up to 4TB of sensitive data including candidate PII; Meta paused its partnership — audit your training data vendor contracts for IP provenance and breach liability clauses
Your 7B model can match 70B via self-distillation — plus KV cache at 2 bits saves 8x memory
MLB robot umpires required re-measuring dozens of players' heights because self-reported data (a 'vanity feature') broke automated strike zone calibration — the cleanest real-world example of promoting a human-curated field to a production ML feature and watching it fail systematically
Federal budget redirects $15B to AI supercomputers — what this means for your compute roadmap
Qwen3.5-Omni demonstrated code generation from spoken instructions and video without explicit training — an emergent cross-modal capability, though 'emergent' often means the training data contained examples researchers didn't account for
Your multimodal evals are blind — models hallucinate on unseen images and benchmarks miss it entirely
Microsoft MAI-Transcribe-1 priced at $0.36/hour and 2.5x faster than alternatives — benchmark against your current transcription pipeline costs for a potential build-vs-buy shift
Your multimodal evals are blind — models hallucinate on unseen images and benchmarks miss it entirely
OpenAI shifted Codex to usage-based pricing — if you're budgeting for AI coding tools, your cost model just became variable rather than fixed
Your multimodal evals are blind — models hallucinate on unseen images and benchmarks miss it entirely
Netflix open-sourced VOID, a framework that erases video objects and rewrites the physics they left behind, trained on 100K video clips — monitor for video data augmentation and synthetic training data applications
Your 7B model can match 70B via self-distillation — plus KV cache at 2 bits saves 8x memory
Edge AI chip startup Hailo pursuing SPAC at <$500M, a 58% decline from its $1.2B peak — signals edge inference hardware is commoditizing as Nvidia Jetson, Qualcomm NPUs, and Apple Neural Engine squeeze niche players
Your model vendor bet just got riskier: OpenAI leadership crisis vs Anthropic's $2B demand surge
Chinese chipmakers reached 41% domestic market share; DeepSeek v4 reportedly runs entirely on Huawei chips — expect two divergent model ecosystems with different optimization profiles and test for subtle quality differences when evaluating Chinese-lab models
Your multimodal evals are blind — models hallucinate on unseen images and benchmarks miss it entirely
Turing Post's 6-level AI maturity framework names the real bottleneck: most orgs stall at L1→L2, making tacit business logic machine-readable — a construction firm's 224-commit data system broke when one developer left because cost code mappings lived in their head
Your ML pilots keep dying — this maturity framework explains the data infrastructure gap you're skipping
BOTTOM LINE
Your model evaluation infrastructure has three newly-documented blind spots — VLMs hallucinate on images they never saw, reasoning models snap-decide tool selection before the chain-of-thought begins, and Claude has confirmed 'functional emotions' that shift outputs — and the compute supply crisis (~50% of US data center builds facing delays, transformer lead times at 5 years) means you can't outspend your way past the efficiency techniques that are now survival-critical.
Frequently asked
- How do I test whether my VLM hallucinates on images it never actually received?
- Add adversarial null-input probes to your evaluation suite: submit blank images, corrupted files, and semantically irrelevant images alongside the original prompt, then measure whether the model produces confident descriptions. Standard visual grounding benchmarks don't test null-input behavior, so this gap has to be closed manually before production deployment.
- If tool selection happens in the first few tokens, is chain-of-thought monitoring still useful?
- CoT monitoring is still useful for auditing reasoning quality, but it's unreliable as a signal for why a tool was chosen. Research shows reasoning models pattern-match on prompt surface features to pick tools before the CoT begins, so the trace rationalizes a decision already made. Test this by reordering or rephrasing the first sentence of prompts and checking whether tool choice shifts.
- What does Anthropic mean by 'functional emotions' in Claude, and why does it matter for eval?
- Functional emotions are internal states in Claude that measurably shift its output behavior — not claims about subjective experience, but variance-inducing states your monitoring probably doesn't capture. For evaluation, it means semantically equivalent prompts can produce distributionally different outputs, so you need output-distribution monitoring and sycophancy drift metrics rather than single-sample accuracy checks.
- How should I budget for compute in 2026 given the data center supply issues?
- Revise compute budgets upward by roughly 15–25% and stop assuming cloud price decreases. Approximately half of planned US data center builds are expected to be delayed or canceled, transformer lead times have stretched to five years, and China supplies 40%+ of US battery imports and ~30% of transformer and switchgear categories — meaning structural, multi-year supply constraints that capital alone cannot fix.
- What's the practical difference between polar-coordinate KV cache compression and a learned perceiver approach?
- Polar-coordinate compression is a quantization scheme that represents KV values at 2 bits with claimed 8x storage reduction and no additional training cost, while perceiver-style approaches use a learned compression model (e.g., a 7M-parameter network) that must be trained. The quantization route is more portable across models, but the '99% accuracy' claim is undefined — validate on perplexity, downstream task accuracy, and long-context retrieval before adoption.
◆ ALSO READ THIS DAY AS
◆ RECENT IN DATA SCIENCE
- Meta just validated two inference infrastructure shifts in one week: KernelEvolve uses LLMs to auto-optimize GPU kernels…
- Anthropic's Project Deal experiment proved that stronger models extract systematically better negotiation outcomes while…
- DeepSeek V4-Flash serves frontier-competitive inference at $0.14/$0.28 per million tokens — 107x cheaper than GPT-5.5 ou…
- A single model scored 19% or 78.7% on the same benchmark by swapping only the agent scaffold — a 4x variance that makes…
- Google's Gemma 4 ships the most aggressive KV cache engineering in any open model — 83% memory reduction, 128K context o…