How should evaluation harnesses change in response to the capability-reliability divergence?

Split evals into two independent axes: agentic capability (task execution, tool use, code) and factual reliability (hallucination rate on your domain). A single leaderboard score now hides the gap — V4 Pro ranks #1 on GDPval-AA while hallucinating 94% on AA-Omniscience. Measure both on representative workloads before model selection.

Is it safe to deploy DeepSeek V4 Pro or Flash in production?

Only for execution-oriented tasks with grounded inputs — never for open-domain factual claims. V4 Pro hallucinates 94% and V4 Flash 96% on AA-Omniscience, so any factual output must be gated by RAG, tool-verified retrieval, or citation checks. They're viable for code generation, orchestration, and tool calls where correctness is externally verifiable.

What does Project Deal actually prove, given its sample size and low stakes?

It provides directional — not statistically definitive — evidence that stronger models extract better negotiation outcomes while the losing side perceives fairness as equivalent. With n=69, play money, and no reported effect sizes, it's not conclusive, but the asymmetry-invisibility finding is alarming enough to motivate adversarial cross-tier testing in any competitive agent deployment.

Why does the Google-Anthropic deal matter for model and cloud selection?

The $40B investment (with $30B milestone-contingent) ties Claude's optimization path increasingly to GCP infrastructure, meaning cross-cloud parity may quietly degrade over time. Combined with Anthropic's revenue surge from ~$9B to $30B+ annualized in four months, expect capacity pressure and tier restructuring. Model selection and cloud selection are now effectively the same decision.

What's the concrete risk from Soumith Chintala leaving Meta for Thinking Machines Lab?

PyTorch itself is safe due to open governance, but Meta-maintained bleeding-edge subsystems — torch.compile, Dynamo, FSDP, DTensor — face velocity risk while Meta backfills leadership. Audit your stack for dependencies on these specific components versus broadly community-maintained ones, and watch Thinking Machines Lab for potentially competing tooling.

PROMIT NOW · DATA SCIENCE DAILY · 2026-04-26

Anthropic Project Deal Exposes Capability as Hidden Weapon

2026-04-26 · Data Science · 8 sources · 950 words · 5 min

Topics LLM Inference · AI Capital · Agentic AI

Anthropic's Project Deal experiment proved that stronger models extract systematically better negotiation outcomes while the losing side perceives the deal as perfectly fair — the first empirical evidence that model capability is an invisible competitive weapon. Combine this with DeepSeek V4 Pro scoring #1 on agentic benchmarks while hallucinating 94% of the time on factual tasks, and the message is clear: your evaluation harness needs separate axes for 'can it do things' and 'does it know things,' because frontier models are diverging on these two dimensions faster than your benchmarks can track.

Key facts

Anthropic's Project Deal experiment with 69 employees and 186 deals found Opus-backed agents systematically extracted better negotiation outcomes than Haiku-backed agents, while losing participants rated fairness identically.
DeepSeek V4 Pro scored #1 among open-weight models on GDPval-AA at 1554 while posting a 94% hallucination rate on AA-Omniscience; V4 Flash hallucinates 96% of the time.
Frontier multimodal models lose approximately 50% performance when chart complexity increases, indicating a structural gap between simple and complex visual reasoning.
PyTorch creator Soumith Chintala left Meta for Thinking Machines Lab along with Piotr Dollár, Weiyao Wang, and Kenneth Li, threatening velocity on Meta-maintained subsystems like torch.compile and FSDP.
Google's $40B investment in Anthropic at a $350B valuation includes $30B contingent on milestones, coinciding with Anthropic's annualized revenue jumping from ~$9B to $30B+ in roughly four months.

◆ INTELLIGENCE MAP

01
The Invisible Capability Gap: Doing vs. Knowing Diverge
act now
Anthropic's Project Deal shows Opus agents beat Haiku agents across 186 deals while losers rated fairness identically. V4 Pro leads agentic benchmarks (1554 GDPval-AA) but hallucinates 94% on factual tasks. A new benchmark shows all frontier models lose ~50% on complex charts. Capability and reliability are now separate dimensions.
94%
hallucination rate
3
sources
- V4 Pro hallucination
- V4 Flash hallucination
- Project Deal deals
- Chart perf drop
1. V4 Pro Factual94
2. V4 Flash Factual96
3. Complex Charts50
02
Vendor & Framework Dependencies Under Stress
monitor
PyTorch creator Soumith Chintala left Meta for Thinking Machines Lab. OpenAI faces a $100B+ lawsuit starting Monday that could disrupt API service. Google's $40B Anthropic investment bundles compute lock-in on GCP. Three independent risk vectors hit your provider stack simultaneously — abstraction layers are no longer optional.
$40B
Google → Anthropic
4
sources
- Google investment
- OpenAI lawsuit
- Anthropic valuation
- Oracle DC push
1. Google→Anthropic40
2. OpenAI Lawsuit Risk100
3. Oracle DC Capex300
03
Three New Efficiency Techniques for Your Pipeline
monitor
Tool Attention cuts function-calling overhead 95% (47.3k→2.4k tokens/turn). Selective on-policy distillation on <10% 'confident-wrong' tokens matches full training with 47% memory savings. High-dimensional embeddings show interference — not time — drives forgetting, with direct RAG retrieval implications. Each is deployable without retraining your base model.
95%
tool-token reduction
2
sources
- Tool Attention savings
- Distill memory saved
- Token subset needed
- Hyperloop param cut
1. Tool Attention95
2. Selective Distill.47
3. Hyperloop Params50
04
Physical Infrastructure as Binding Constraint
background
Oracle's ~$300B data center push is straining Wall Street's ability to syndicate AI debt. Nvidia's Vera CPU alone requires 1.5TB RAM per server (4,600 Galaxy S26 equivalents), creating structural DRAM shortages. Maine nearly passed the first statewide data center moratorium. Compute pricing volatility is a function of physical constraints, not model efficiency.
1.5TB
RAM per Vera server
3
sources
- Vera CPU RAM
- Phone equivalents
- Oracle DC capex
- X-energy IPO
1. Vera CPU Server1500
2. Galaxy S26 Ultra0.33

◆ DEEP DIVES

01
The Invisible Capability Gap — When 'Better at Doing' and 'Worse at Knowing' Coexist in the Same Model
<h3>Three Independent Signals Point to the Same Problem</h3><p>In December 2025, <strong>69 Anthropic employees</strong> let Claude agents negotiate on their behalf in a controlled marketplace for a week. Over <strong>186 deals</strong> worth ~$4,000, Opus-backed agents systematically extracted better outcomes than Haiku-backed agents — Opus sellers earned more, Opus buyers paid less. The critical finding: <strong>participants using Haiku rated deal fairness almost identically to Opus users</strong>. The losing side had no idea they were losing. Anthropic called this an "uncomfortable implication."</p><p>On the same day, DeepSeek V4 Pro posted the <strong>#1 score among open-weight models on GDPval-AA</strong> (1554), an agentic benchmark — while scoring a <strong>94% hallucination rate on AA-Omniscience</strong>. V4 Flash is worse at 96%. And a separate benchmark found that <strong>all frontier models lose approximately 50% performance when chart complexity increases</strong>.</p><blockquote>Models are getting dramatically better at executing tasks while remaining unreliable at stating facts — and the gap between these capabilities is now invisible to end users.</blockquote><h3>Why This Is a New Evaluation Crisis</h3><p>GPT-5.5 tops benchmarks including CursorBench (72.8%) and Terminal-Bench (82.7%) while <strong>hallucination frequency remains high</strong> — multiple sources confirm this independently. The pattern across V4 Pro, V4 Flash, and GPT-5.5 is consistent: <strong>benchmark performance and production reliability are decoupling</strong>, not converging. Your single-score leaderboard ranking is now actively misleading.</p><h4>The Cross-Source Evidence</h4><table><thead><tr><th>Model</th><th>Capability Signal</th><th>Reliability Signal</th><th>Gap</th></tr></thead><tbody><tr><td><strong>V4 Pro</strong></td><td>#1 open agentic (1554 GDPval-AA)</td><td>94% hallucination (AA-Omniscience)</td><td>Extreme</td></tr><tr><td><strong>V4 Flash</strong></td><td>$0.14/M tok, competitive benchmarks</td><td>96% hallucination rate</td><td>Extreme</td></tr><tr><td><strong>GPT-5.5</strong></td><td>Tops CursorBench, Terminal-Bench</td><td>"Still hallucinates frequently"</td><td>Unknown (unquantified)</td></tr><tr><td><strong>All frontier</strong></td><td>Improving on simple tasks</td><td>~50% drop on complex charts</td><td>Structural</td></tr></tbody></table><hr><h3>Architectural Implications</h3><p>V4 Pro's divergence has a <strong>specific architectural explanation</strong>: it excels at tool use and code generation (agentic tasks that don't require factual recall) while failing catastrophically at open-domain knowledge retrieval. This maps directly to how you should architect agent pipelines — <strong>use V4 Pro for execution, never for unsupported factual claims</strong>.</p><p>Project Deal's finding compounds the risk. If you deploy agents in any adversarial or competitive context — <em>procurement, negotiation, pricing, bidding</em> — the model tier becomes a hidden variable that determines outcome quality. And the counterparty <strong>cannot perceive the disadvantage</strong>. This has immediate implications for multi-agent systems, automated marketplaces, and any pipeline where LLM agents interact with opposing objectives.</p><p><em>Methodology caveat on Project Deal: n=69, low stakes (play money), no reported confidence intervals or effect sizes. The finding is directional, not statistically definitive. But the direction is alarming enough to act on.</em></p>
Action items
- Add separate 'agentic capability' and 'factual reliability' axes to your model evaluation harness, with explicit hallucination rate measurement on your domain
- Design adversarial cross-tier model pairing tests for any agentic workflow involving negotiation, procurement, or competitive interaction
- Add chart complexity tiers to any multimodal evaluation pipeline processing financial dashboards, scientific figures, or data visualizations
- Do NOT deploy V4 Pro or Flash for knowledge-intensive tasks without a retrieval/grounding layer — mandate RAG or tool-verified outputs for any factual claims
Sources:DeepSeek V4's KV cache trick cuts 1M-context memory 8.7x · GPT-5.5 tops benchmarks at 2x cost but still hallucinates · Opus beats Haiku in blind negotiation
02
Your Framework and Provider Dependencies All Got Riskier This Week — Three Vectors at Once
<h3>Vector 1: PyTorch's Architect Left the Building</h3><p><strong>Soumith Chintala</strong> — creator of PyTorch — left Meta for <strong>Thinking Machines Lab</strong>, taking <strong>Piotr Dollár</strong> (Detectron, FAIR research lead), Weiyao Wang, and Kenneth Li with him. This is not a routine departure. Chintala has been the spiritual architect of PyTorch since inception, and Dollár shaped Meta's computer vision research for years.</p><p>PyTorch is open-source with broad community governance, so this doesn't kill the framework. But Meta has been the <strong>primary steward of PyTorch's bleeding-edge features</strong> — torch.compile, Dynamo, FSDP, DTensor. The risk isn't that PyTorch disappears; it's that <strong>development velocity on these subsystems slows</strong> while Meta backfills. And we have zero visibility into what Thinking Machines Lab is building — competing tooling is a real possibility.</p><hr><h3>Vector 2: OpenAI in Court Monday</h3><p>The Musk v. Altman trial begins <strong>Monday, April 28</strong>, seeking <strong>$100B+ in damages</strong> and reversal of OpenAI's for-profit conversion. Probability-weighted scenarios:</p><table><thead><tr><th>Scenario</th><th>Estimated Likelihood</th><th>Impact on Your Stack</th></tr></thead><tbody><tr><td>OpenAI wins decisively</td><td>~50%</td><td>Low — business as usual</td></tr><tr><td>Settlement before verdict</td><td>~25%</td><td>Low-Medium — possible structural concessions</td></tr><tr><td>Partial Musk win</td><td>~20%</td><td>High — IPO delayed, capital constrained, release cadence slows</td></tr><tr><td>Full Musk win ($100B+)</td><td>~5%</td><td>Critical — API pricing, availability, roadmap disrupted</td></tr></tbody></table><p><em>These are editorial estimates, not quantitative predictions.</em> But even low-probability, high-impact scenarios demand mitigation when they affect production infrastructure.</p><hr><h3>Vector 3: Google Locks Claude into GCP</h3><p>Google's <strong>$40B Anthropic investment</strong> at $350B valuation includes massive compute supply bundled alongside cash. $30B is contingent on performance milestones, creating <strong>aggressive capability release pressure</strong>. The structural implication: Claude will be increasingly optimized for GCP TPUs/GPUs, and cross-cloud parity may quietly degrade. <em>This isn't hypothetical — Azure already gets preferred OpenAI access patterns.</em></p><p>Sources offer a telling data point on the money flow: <strong>Anthropic's annualized revenue surged from ~$9B to $30B+ in roughly four months</strong> (233% growth), driven primarily by coding tools. This much demand chasing one provider's capacity means potential rate limit tightening or tier restructuring before infrastructure catches up.</p><blockquote>The cloud-model provider coupling now means your 'model selection' and 'cloud selection' decisions are the same decision. Maintaining multi-provider flexibility is a strategic infrastructure choice, not a feature preference.</blockquote><h3>The Convergence That Matters</h3><p>These three vectors aren't independent. If PyTorch development slows, alternatives optimized for specific clouds (JAX for GCP/TPU, for instance) gain relative appeal. If OpenAI's litigation constrains capital, GPT model release cadence may slow, pushing more workloads toward Anthropic — <strong>whose infrastructure is now GCP-coupled</strong>. The web of dependencies is tightening at every layer.</p>
Action items
- Audit your PyTorch stack for dependencies on Meta-maintained subsystems (torch.compile, FSDP, DTensor) vs. community-maintained components by end of this sprint
- Implement a model-provider abstraction layer (LiteLLM, Portkey, or custom gateway) for all production LLM calls before the Musk v. Altman trial potentially disrupts OpenAI
- Run baseline Claude evaluations on your top-3 production tasks this week to enable a data-driven switch if Google's $40B accelerates Anthropic's next model generation
Sources:Soumith Chintala left Meta · GPT-5.5 tops benchmarks at 2x cost but still hallucinates · Your OpenAI API dependency just became a litigation risk · Opus beats Haiku in blind negotiation

◆ QUICK HITS

Update: GPT-5.5 delivers 45-56% token reduction over 5.4, tops CursorBench at 72.8% and Terminal-Bench at 82.7 — migrate existing GPT-5.4 API workloads this sprint for a free efficiency gain
DeepSeek V4's KV cache trick cuts 1M-context memory 8.7x
Tool Attention reduces function-calling token overhead 95% (47.3k → 2.4k tokens/turn) via dynamic gating and lazy schema loading — audit your agentic tool-use prompts for token waste this sprint
DeepSeek V4's KV cache trick cuts 1M-context memory 8.7x
Selective on-policy distillation: training on <10% 'confident-wrong' tokens nearly matches full distillation with 47% memory savings — log per-token teacher confidence and student correctness to identify the subset
DeepSeek V4's KV cache trick cuts 1M-context memory 8.7x
Embedding spaces naturally reproduce human forgetting curves with interference (not time) driving recall degradation — rethink your vector DB strategy: prioritize deduplication and embedding diversity over freshness
Opus beats Haiku in blind negotiation
Trojanized Bitwarden CLI published to npm registry in escalating supply chain campaign — audit and pin npm/PyPI dependencies in your ML pipelines, verify hashes, and run dependency scanning in CI
Your ML pipeline's npm deps just got riskier
Anthropic revenue surged from $9B to $30B+ annualized in ~4 months, driven by coding tools — expect rate limit tightening or tier restructuring as demand outpaces capacity
Opus beats Haiku in blind negotiation
US privacy bills (SECURE Data Act, GUARD Financial Data Act) now explicitly target 'AI profiling' and data minimization — document which of your models score, rank, or classify users as proactive compliance
Your ML pipeline's npm deps just got riskier
MiMo-V2.5 Voice: open-source 8B speech model handling Mandarin, English, 8 Chinese dialects, and code-switching — benchmark against Whisper if you have multilingual ASR requirements
Opus beats Haiku in blind negotiation

BOTTOM LINE

Frontier models are getting dramatically better at executing tasks while remaining catastrophically unreliable at stating facts — V4 Pro is #1 on agentic benchmarks and hallucinates 94% of the time on knowledge tasks — and Anthropic's Project Deal experiment proved the capability gap is invisible to the disadvantaged party. If your eval harness uses a single score, you're measuring the wrong thing. If your production stack depends on a single provider, the PyTorch creator's departure, OpenAI's $100B lawsuit starting Monday, and Google locking Claude into GCP mean your provider abstraction layer is overdue.

Frequently asked

How should evaluation harnesses change in response to the capability-reliability divergence?: Split evals into two independent axes: agentic capability (task execution, tool use, code) and factual reliability (hallucination rate on your domain). A single leaderboard score now hides the gap — V4 Pro ranks #1 on GDPval-AA while hallucinating 94% on AA-Omniscience. Measure both on representative workloads before model selection.
Is it safe to deploy DeepSeek V4 Pro or Flash in production?: Only for execution-oriented tasks with grounded inputs — never for open-domain factual claims. V4 Pro hallucinates 94% and V4 Flash 96% on AA-Omniscience, so any factual output must be gated by RAG, tool-verified retrieval, or citation checks. They're viable for code generation, orchestration, and tool calls where correctness is externally verifiable.
What does Project Deal actually prove, given its sample size and low stakes?: It provides directional — not statistically definitive — evidence that stronger models extract better negotiation outcomes while the losing side perceives fairness as equivalent. With n=69, play money, and no reported effect sizes, it's not conclusive, but the asymmetry-invisibility finding is alarming enough to motivate adversarial cross-tier testing in any competitive agent deployment.
Why does the Google-Anthropic deal matter for model and cloud selection?: The $40B investment (with $30B milestone-contingent) ties Claude's optimization path increasingly to GCP infrastructure, meaning cross-cloud parity may quietly degrade over time. Combined with Anthropic's revenue surge from ~$9B to $30B+ annualized in four months, expect capacity pressure and tier restructuring. Model selection and cloud selection are now effectively the same decision.
What's the concrete risk from Soumith Chintala leaving Meta for Thinking Machines Lab?: PyTorch itself is safe due to open governance, but Meta-maintained bleeding-edge subsystems — torch.compile, Dynamo, FSDP, DTensor — face velocity risk while Meta backfills leadership. Audit your stack for dependencies on these specific components versus broadly community-maintained ones, and watch Thinking Machines Lab for potentially competing tooling.

Anthropic Project Deal Exposes Capability as Hidden Weapon

◆ INTELLIGENCE MAP

The Invisible Capability Gap: Doing vs. Knowing Diverge

Vendor & Framework Dependencies Under Stress

Three New Efficiency Techniques for Your Pipeline

Physical Infrastructure as Binding Constraint

◆ DEEP DIVES

The Invisible Capability Gap — When 'Better at Doing' and 'Worse at Knowing' Coexist in the Same Model

Your Framework and Provider Dependencies All Got Riskier This Week — Three Vectors at Once

◆ QUICK HITS

BOTTOM LINE

Frequently asked

◆ ALSO READ THIS DAY AS

◆ RECENT IN DATA SCIENCE

Anthropic Project Deal Exposes Capability as Hidden Weapon

◆ INTELLIGENCE MAP

◆ DEEP DIVES

◆ QUICK HITS

BOTTOM LINE

Frequently asked

◆ ALSO READ THIS DAY AS

◆ RELATED THREADS

◆ RECENT IN DATA SCIENCE