Public AI Benchmarks Are Contaminated — Build Your Own Evals
Topics Agentic AI · LLM Inference · AI Capital
Public AI benchmarks are confirmed contaminated — GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash all memorized SWE-bench solutions, and 59.4% of 'unsolved' problems had flawed tests. If your team is selecting models based on public benchmark scores, you're making procurement decisions on corrupted data. Harvey, Cursor, and Anthropic itself have already shifted to custom domain-specific evals — and reproducing a benchmark like SnitchBench costs as little as $10. Build your own eval suite this sprint or accept that your model selection is vibes-based.
◆ INTELLIGENCE MAP
01 Public Benchmarks Are Broken — Custom Evals Are the New Competitive Moat
act nowConfirmed contamination across all three frontier labs invalidates SWE-bench-based model selection; companies like Harvey and Cursor are already building domain-specific evals as a competitive advantage, and the cost barrier is near zero.
02 Model Cost Collapse Forces Build-vs-Buy Recalculation
act nowQwen3.5-Flash at $0.50/1M tokens with open-source parity to frontier models, combined with $300B+ in committed infrastructure spend, means your AI cost structure and vendor lock-in are both under pressure simultaneously.
03 Human-AI Collaboration Underperforms on Judgment Tasks — Copilot UX Needs Redesign
monitorA Nature meta-analysis of 106 experiments shows human-AI teams perform worse than either alone on decision tasks due to automation bias, while HubSpot's AI lead confirms trust and explainability — not model capability — are the binding constraints on AI feature adoption.
04 Design Process Compression and AI-Native Team Structures
monitorAnthropic's head of design (ex-Figma) declares the discovery→mock→iterate cycle obsolete as AI tools compress prototyping from weeks to hours, while HubSpot's AI lead emphasizes that winning is an adoption problem requiring workflow redesign, not just better models.
05 Pre-Competitive Land Grabs and Bottom-Up GTM Playbooks
backgroundSpaceX is giving away $600 hardware and cutting Starlink to $50/mo to lock in 10M subscribers before Amazon's Kuiper launch, while SendCutSend's hobbyist-to-Fortune-500 pipeline and Zillow's ChatGPT integration demonstrate three distinct patterns for building distribution moats before well-funded competitors arrive.
◆ DEEP DIVES
01 Your Model Selection Process Is Built on Contaminated Data — Here's How to Fix It This Sprint
<p>OpenAI's own audit confirmed what many suspected: <strong>GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash</strong> all memorized SWE-bench Verified solutions during training — reproducing original code fixes from memory, including variable names, inline comments, and implementation details. Worse, <strong>59.4% of problems</strong> the best model couldn't consistently solve had flawed test cases that rejected correct solutions. The benchmark the entire industry used for coding capability was simultaneously measuring recall <em>and</em> penalizing correct answers.</p><blockquote>If you've made any model selection decision based on SWE-bench scores in the past year, that decision is compromised.</blockquote><p>This isn't an isolated incident — it's the culmination of a pattern accelerating since GLUE was surpassed within a year of its 2018 launch. <strong>MMLU plateaued</strong> after GPT-4 hit 86.4% in March 2023. BIG-Bench Hard now shows near-perfect scores, replaced by BIG-Bench Extra Hard where the best model scores just 23.9%. The benchmark treadmill is spinning faster than the benchmarks can be refreshed.</p><hr/><h4>The Verification Gap Is a Product Architecture Problem</h4><p>GPT-5.2 now scores <strong>93.2% on GPQA Diamond</strong>, a benchmark where PhD-holding domain experts score about 65% and skilled non-experts with internet access barely beat random chance at 34%. First Proof — 10 research-level math problems from unpublished work by leading mathematicians including a Fields Medalist — took domain experts <em>days</em> to verify. As Scientific American noted, 'judging whether a proof is truly original is even tougher than judging if it is correct.' If your product positions AI as an expert-level tool in any specialized domain, your human reviewers increasingly <strong>cannot reliably validate the AI's output</strong>. That's not a QA problem — it's a product architecture problem affecting how you design human-in-the-loop workflows and manage liability.</p><h4>Behavioral Benchmarks Reveal Failure Modes Your QA Won't Catch</h4><p>New behavioral benchmarks like <strong>Vending-Bench</strong> — which drops an AI agent into a simulated vending machine business requiring inventory management, supplier negotiation, and price setting over months — reveal catastrophic failure modes invisible to short-form tests. Claude 3.5 Sonnet hallucinated that a product order had arrived, failed to restock, then spiraled into trying to 'close' the business. Gemini 2.0 Flash decided its business had failed and started offering to search for cat videos. A single run burns <strong>60-100M output tokens</strong>, meaning these breakdowns only emerge at scale. If you're building agentic features, your standard QA process is structurally incapable of catching these 'meltdown loops.'</p><h4>Who's Already Moved</h4><p>Harvey built <strong>BigLaw Bench</strong> for legal AI evaluation. Cursor built IDE-specific evals. Anthropic is explicitly advocating that PMs, CSMs, and salespeople should contribute eval tasks as pull requests — treating evals like CI/CD for AI products. Simon Willison reproduced a subset of SnitchBench for about <strong>$10</strong>. The barrier to entry is near zero; the barrier to <em>not</em> doing this is growing daily.</p>
Action items
- Build a custom eval suite based on your product's top 20 real-world use cases by end of March — pull from actual user sessions, support tickets, and edge cases
- Audit any model selection decisions made using SWE-bench Verified scores and re-evaluate with fresh, uncontaminated tests this sprint
- Implement long-horizon stress tests (60M+ token runs) for any agentic features before shipping to production
- Create a model behavioral profile matrix mapping model personality traits (hallucination patterns, failure modes, risk tolerance) to your product's specific use cases
Sources:BYOB: Build Your Own Benchmark · 🤖 AI Weekly Recap (Week 8)
02 The $0.50/1M Token Floor Changes Your Cost Structure — But Open-Source Parity Changes Your Moat
<p>Alibaba's <strong>Qwen3.5-Flash at $0.50 per 1M tokens</strong> isn't just cheap — it's an order of magnitude cheaper than most proprietary alternatives. The open-source Qwen3.5-35B-A3B runs on a <strong>single 32GB consumer GPU</strong> with 1M+ token context thanks to a MoE architecture (35B total parameters, only 3B active at inference). Benchmark claims — outperforming GPT-5-mini and Claude Sonnet 4.5 in reasoning — should be taken with appropriate skepticism given the contamination issues above, but even at 80% of claimed performance, the cost-performance ratio is transformative.</p><blockquote>If your competitive moat is 'we use GPT-4/Claude,' that moat just got a lot shallower. Your differentiation needs to come from proprietary data, unique workflow integration, or domain-specific fine-tuning — not from which foundation model you call.</blockquote><h4>Three Sources Converge on the Same Conclusion</h4><p>Multiple analyses this week independently reached the same verdict: open-source models have achieved <strong>functional parity</strong> with closed models for most production use cases. Qwen3 is matching top-tier closed models in GUI-based tasks and visual comprehension. Mistral's multi-year partnership with <strong>Accenture</strong> to scale open-weight models across their global enterprise client base means open-source isn't just for startups — it's getting enterprise distribution. Meanwhile, the infrastructure buildout continues to accelerate: <strong>Broadcom's AI chip revenue grew 74%</strong> last quarter, Google and Meta are doing custom chip deals, and MatX raised $500M for specialized LLM processors claiming 10x training performance over GPUs.</p><h4>The Cost Curve vs. The Capability Curve</h4><p>Here's the tension multiple sources surfaced: model capabilities are improving faster than costs are decreasing, but the <strong>floor for 'good enough'</strong> is dropping precipitously. The combined committed AI infrastructure spend announced this week — $110B OpenAI round, $100B OpenAI-AWS commitment, $100B Meta-AMD deal — totals over <strong>$300B</strong>. This capital will eventually translate to lower inference costs, but in the near term, the companies that control both models and infrastructure (Google, Meta, soon OpenAI+AWS) will have pricing power that pure API wrappers cannot match.</p><table><thead><tr><th>Model</th><th>Cost per 1M Tokens</th><th>Self-Hostable</th><th>Context Window</th></tr></thead><tbody><tr><td>Qwen3.5-Flash (API)</td><td>$0.50</td><td>Yes (32GB GPU)</td><td>1M+ tokens</td></tr><tr><td>GPT-5-mini</td><td>~$3-5 (estimated)</td><td>No</td><td>128K tokens</td></tr><tr><td>Claude Sonnet 4.5</td><td>~$3-5 (estimated)</td><td>No</td><td>200K tokens</td></tr></tbody></table><p><em>Note: GPT-5-mini and Claude Sonnet 4.5 pricing is estimated from current tier structures; Qwen benchmarks are self-reported.</em></p><h4>What This Means for Your Margins</h4><p>If your product's AI features are powered by a proprietary API, you need to articulate exactly what you're paying for above what Qwen3.5 delivers for free. The answer should be <strong>'enterprise support, compliance, and reliability'</strong> — not 'slightly better benchmark scores' (which are unreliable anyway). If you can't articulate that delta, your margins are at risk from any competitor willing to self-host. Nobel laureate <strong>Daron Acemoglu</strong> argues AI has yet to deliver meaningful productivity gains — use this as a stress test: for each AI feature, can you point to user data showing it delivers measurable value at current costs?</p>
Action items
- Run a cost-comparison analysis of your current LLM API provider vs. Qwen3.5-Flash ($0.50/1M tokens) and self-hosted Qwen3.5-35B-A3B for your top 3 use cases by token volume before next sprint planning
- Evaluate building a model abstraction layer this quarter if you're currently locked to a single LLM provider
- Document the specific value your closed-model API provides above open-source alternatives — share with leadership as a margin defense brief
Sources:🤖 AI Weekly Recap (Week 8) · The Sequence Radar #816: Last Week in AI · 🔮 Exponential View #563 · DevOps'ish 298
03 106 Experiments Say Your Copilot Makes Users Worse — Redesign AI Features Around Trust, Not Capability
<p>A <strong>Nature Human Behaviour meta-analysis of 106 experiments</strong> on human-AI collaboration delivers a finding that should stop every PM shipping copilot features: on average, the human-AI combination performed <em>worse</em> than whichever was best alone. The failures clustered specifically around <strong>decision tasks</strong> — judgment, accountability, and human skill. This is devastating for the copilot paradigm most product teams are shipping.</p><blockquote>If you're building AI features that sit alongside human decision-makers — AI-suggested next actions, smart recommendations, automated triage with human override — this research says you may be making your users worse at their jobs.</blockquote><h4>The Mechanism: Automation Bias Meets Algorithmic Drift</h4><p>The failure mode is well-documented: a confident machine proposing the wrong answer pulls a tired human toward it (<strong>automation bias</strong>). Combined with 'algorithmic drift' — the slow process by which systems start making choices on your behalf so smoothly you stop noticing — you get a product that feels helpful short-term but erodes user competence over time. The Air France 447 disaster is the extreme case: pilots so dependent on autopilot they couldn't recognize a basic aerodynamic stall. Your product probably won't kill anyone, but the pattern is identical.</p><h4>HubSpot's Fix: Trust as the Product</h4><p>Neha Monga, a product executive at HubSpot building AI platforms, independently arrived at the same conclusion from the practitioner side. Her team discovered that when users have to <strong>double or triple-check every AI output</strong>, the tool adds cognitive overhead rather than efficiency. The fix wasn't better models — it was better UX. By insisting that AI-generated outputs always show <strong>reasoning, sourcing, and confidence signals</strong>, they converted even the most skeptical customers. Monga's framework: 'Trust is not a feature — it is the product.'</p><h4>The Practical Taxonomy</h4><p>The implication isn't 'don't ship AI features' — it's 'be surgical about where AI assists vs. automates vs. stays out of the way.' Cross-referencing the Nature data with Monga's operational experience yields a clear taxonomy:</p><ul><li><strong>Execution tasks</strong> (drafting, formatting, data retrieval): Safe territory for full AI automation</li><li><strong>Judgment tasks</strong> (prioritization, diagnosis, strategic decisions): Need fundamentally different UX — friction by design, confidence calibration, forced engagement rather than rubber-stamping</li><li><strong>Hybrid tasks</strong>: AI handles the commodity work, humans focus on the judgment layer — but the UX must make the handoff explicit</li></ul><p>Monga also calibrates the copilot-to-agent hype: she warns that today's AI 'is powerful but not yet reliable enough to carry high-stakes judgment on its own' and that 'digital workers, once thought to be transformational, are still far from reality.' The sweet spot right now is <strong>supervised agents</strong> — AI that executes multi-step workflows for mechanical tasks but escalates to humans for judgment calls.</p><h4>The Expertise Paradox</h4><p>One additional finding deserves attention: <strong>AI's biggest productivity gains go to the least experienced workers</strong>, compressing visible skill differences. This means your AI features may be simultaneously helping juniors and harming seniors — a segmentation problem most products aren't designed for. The WEF's fastest-rising skills list reinforces this: analytical thinking, creative thinking, resilience, leadership, empathy. Prompt engineering is notably absent. The market is already looking past 'AI as tool' toward 'AI as infrastructure.'</p>
Action items
- Audit every AI feature in your product and categorize as execution task, judgment task, or hybrid — redesign judgment-task UX to include friction, confidence signals, and reasoning transparency by end of Q2
- Add explainability requirements (reasoning, sourcing, confidence levels) to your AI feature spec template as a mandatory field starting this sprint
- Segment your AI feature experience by user expertise level — design differentiated assistance for novice vs. expert users
- Run a shadow AI audit: survey your user base to understand what AI tools they're bringing into workflows your product touches
Sources:Are You Flying, Or Are You Being Flown? · We interviewed an Agentic AI expert!
04 The Design Process Is Compressing — And Your Team Structure Hasn't Caught Up
<p>Jenny Wen left a Director of Design role at <strong>Figma</strong> — the company that literally makes the tools for the traditional design process — to go back to IC work as head of design at <strong>Anthropic's Claude</strong>. When someone with experience at Dropbox, Square, Shopify, and Figma makes that move, it signals that the most interesting design problems in tech have migrated to AI-native companies, and that the nature of design work itself is fundamentally changing.</p><h4>What's Actually Happening: Compression, Not Death</h4><p>The headline claim — that discovery→mock→iterate is dead — is provocative but directionally correct. AI tools like <strong>Claude Code</strong> (now integrated into VS Code and Slack), v0 from Vercel, and Cursor are enabling engineers to generate functional prototypes directly from problem statements. The designer's role shifts from 'produce artifacts that engineers implement' to <strong>'set direction, define taste, and curate AI-generated options.'</strong> Anthropic is hiring for three new design archetypes that don't have standard job descriptions yet — a leading indicator of what AI-native product teams will look like.</p><blockquote>The PM who still writes a PRD, waits for mocks, then hands off to eng is running a process designed for 2020.</blockquote><h4>The Chatbot Durability Thesis</h4><p>Perhaps the most strategically interesting claim: Wen argues <strong>chatbot interfaces are more durable than most people expect</strong>. The prevailing narrative has been that chat is transitional — a stepping stone to agentic or embedded AI. But Anthropic is investing in all three simultaneously: Claude chat, Claude Cowork (collaborative AI), and Claude Code (embedded in developer tools). The person designing Claude's UX is telling you chat isn't going away — it's becoming one surface in a multi-modal product ecosystem. If you've been deprioritizing conversational UI based on the assumption that 'chat is dead,' this is a strong counter-signal.</p><h4>The Talent Migration Signal</h4><p>Look at who the leading AI companies have assembled: <strong>Mike Krieger</strong> (Instagram co-founder) as Anthropic's CPO, <strong>Kevin Weil</strong> (ex-Instagram, ex-Twitter) as OpenAI's CPO, Jenny Wen from Figma. Both leading AI companies are hiring consumer product leaders, not enterprise software veterans. The implication: the bar for AI product experience is being set by people who built Instagram and Figma. If your AI features feel like enterprise software bolted onto a chatbot, you're going to lose to products designed by teams with this caliber of consumer product instinct. The competitive moat in AI products is shifting from <strong>model capability</strong> (commoditizing) to <strong>product experience</strong> (not commoditizing).</p>
Action items
- Run a 2-week experiment where one squad skips the mock phase entirely — have engineers prototype directly with AI tools (Claude Code, v0, Cursor) from problem statements
- Revisit your AI product's UI strategy — add a 'chatbot durability' scenario to your next strategy review if you've been betting entirely on embedded/agentic UI
- Map Anthropic's three new design archetypes against your current team composition to identify gaps in your AI-native product team structure
Sources:The design process is dead. Here's what's replacing it. · We interviewed an Agentic AI expert!
◆ QUICK HITS
Update: Anthropic vendor risk — Ingress NGINX is also deprecated as of March 2026, requiring migration to Gateway API or Traefik with 50+ annotation mappings; confirm with your platform team this week if you're affected
DevOps'ish 298
AI-generated images now render accurate text up to 4K resolution — Google's Nano Banana 2 produces infographics with 'structured accuracy' and is available across Google Ads and Cloud APIs
🤖 AI Weekly Recap (Week 8)
QuiverAI's Arrow-1.0 (a16z-backed, $8.3M seed) outputs editable SVG code instead of pixels — evaluate for any feature requiring programmatic generation of UI components, icons, or diagrams
🤖 AI Weekly Recap (Week 8)
Premium AI pricing is converging at $100–$200/month across Anthropic (Claude Max) and Perplexity (Max) — benchmark your AI feature pricing tier against this new willingness-to-pay anchor
🤖 AI Weekly Recap (Week 8)
Zillow's live ChatGPT integration turns LLMs into a real estate search layer — test whether your product/brand surfaces correctly when users ask natural language questions in your category
☕ Home sweet home
Perplexity Computer orchestrates 19 AI models per task, routing to the best model per sub-job — multi-model orchestration is now a shipped consumer product pattern, not just infrastructure theory
🤖 AI Weekly Recap (Week 8)
SendCutSend grew from a $750K laser cutter to ~60% Fortune 500 penetration via hobbyist word-of-mouth — audit your self-serve tier for prosumer users whose corporate email domains match target enterprise accounts
The job shop on steroids outperforming China
SpaceX is giving away $600 Starlink hardware and cutting to $50/mo to lock in 10M subscribers before Amazon Kuiper launches later in 2026 — a live case study in pre-competitive switching-cost building
Editor's Pick: Inside SpaceX's Pre-IPO Push to Block Amazon
BOTTOM LINE
Public AI benchmarks are confirmed contaminated across all three frontier labs, Qwen3.5 just set a $0.50/1M token floor that threatens your API margins, and 106 experiments prove your copilot features may be making users worse at judgment tasks — not better. The PMs who build custom eval suites, architect for multi-model flexibility, and redesign AI UX around trust instead of capability will own the next cycle; the ones still selecting models by benchmark score and shipping black-box copilots are building on sand.
Frequently asked
- How do I know if my team's model selection is compromised by benchmark contamination?
- If you've made a model choice in the past year based on SWE-bench Verified, MMLU, or similar public benchmarks, assume it's compromised. OpenAI's audit confirmed GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash memorized SWE-bench solutions, and 59.4% of 'unsolved' problems had flawed tests that rejected correct answers. Re-evaluate with custom evals pulled from your own user sessions and edge cases.
- What's the fastest way to build a credible custom eval suite this sprint?
- Pull your top 20 real-world use cases from actual user sessions, support tickets, and known edge cases, then score model outputs against them. Simon Willison reproduced a subset of SnitchBench for about $10, and Anthropic treats evals like CI/CD — PMs, CSMs, and salespeople contribute tasks as pull requests. The barrier is workflow discipline, not cost or tooling.
- Does Qwen3.5-Flash at $0.50/1M tokens actually threaten products built on GPT or Claude?
- Yes, if your only differentiation is which foundation model you call. Qwen3.5-35B-A3B runs on a single 32GB consumer GPU with 1M+ token context, and Mistral's Accenture partnership shows open-weight models are getting enterprise distribution. The defensible value above open-source is enterprise support, compliance, and reliability — not marginal benchmark deltas, which are unreliable anyway.
- Should I stop shipping copilot features given the Nature meta-analysis findings?
- No, but be surgical about where AI automates versus assists. The 106-experiment meta-analysis found human-AI combinations underperform on judgment tasks specifically, not execution tasks. Keep full AI automation for drafting, formatting, and retrieval; redesign judgment-task UX with friction, confidence signals, and reasoning transparency so users engage rather than rubber-stamp.
- Is conversational UI still worth investing in, or should I bet on embedded and agentic surfaces?
- Invest in multiple surfaces simultaneously. Anthropic's new head of design argues chatbot interfaces are more durable than the market expects, and Anthropic itself is building Claude chat, Claude Cowork, and Claude Code in parallel. Treat chat as one surface in a multi-modal ecosystem rather than a transitional UI to deprecate.
◆ ALSO READ THIS DAY AS
◆ RECENT IN PRODUCT
- OpenAI killed Custom GPTs and launched Workspace Agents that autonomously execute across Slack and Gmail — the same week…
- Anthropic's internal 'Project Deal' experiment proved that users with stronger AI models negotiate systematically better…
- GPT-5.5 launched at $5/$30 per million tokens while DeepSeek V4-Flash shipped at $0.14/$0.28 under MIT license — a 35x p…
- Meta burned 60.2 trillion tokens ($100M+) in 30 days — and most of it was waste.
- OpenAI's GPT-Image-2 launched with API access, a +242 Elo lead over every competitor, and day-one integrations from Figm…