PROMIT NOW · DATA SCIENCE DAILY · 2026-02-25

Frontier Models Fracture: Route Tasks, Trust Your Evals

· Data Science · 49 sources · 1,590 words · 8 min

Topics Data Infrastructure · LLM Inference · Agentic AI

The frontier model landscape fractured into task-specific dominance this week — Gemini 3.1 Pro hits 77.1% on ARC-AGI-2 (2.5x its predecessor), Sonnet 4.6 sets records on OS World with a 1M-token context window at unchanged pricing, and GPT-5.3-Codex leads SWE-Bench Pro at 56.8%. Meanwhile, SWE-Bench Verified is officially broken (OpenAI abandoned it, citing flawed tests and contamination), and Anthropic disclosed that 24,000 fake accounts ran 16M exchanges to distill Claude's agentic reasoning capabilities. Your model selection should now be a routing system, not a vendor choice — and your evaluation harness is the only benchmark you can trust.

◆ INTELLIGENCE MAP

  1. 01

    Task-Fragmented Frontier: Model Routing Replaces Model Selection

    act now

    No single model dominates all benchmarks — Gemini 3.1 Pro leads abstract reasoning, GPT-5.3-Codex leads code generation, Sonnet 4.6 leads agentic computer use — while SWE-Bench Verified is now unreliable and OpenAI is deprecating 5 models, making internal evaluation harnesses the only trustworthy selection mechanism.

    5
    sources
  2. 02

    Industrial-Scale Model Distillation as Attack Vector

    act now

    Anthropic disclosed that DeepSeek, Moonshot, and MiniMax used 24,000 fake accounts to generate 16M exchanges targeting agentic reasoning, tool orchestration, and RL reward model grading — revealing both the specific capabilities hardest to replicate and the inadequacy of standard rate-limiting defenses.

    12
    sources
  3. 03

    Inference Hardware Disruption: Weight-Etched Silicon and Cost Uncertainty

    monitor

    Taalas's HC1 chip hits ~17,000 tok/s by etching Llama 3.1 8B into silicon (8.5x Cerebras) but with acknowledged quality degradation and zero model flexibility, while OpenAI's inference costs quadrupled to $8.4B (missing forecasts by $1.8B) and data center capacity is critically constrained at 1% vacancy with $96B in projects blocked.

    5
    sources
  4. 04

    Recall Bottleneck and RAG Architecture Shifts

    monitor

    Google research shows recall — not knowledge encoding — is the factual performance bottleneck in parametric models, while Sonnet 4.6's 1M-token context at unchanged pricing and EPFL's finding that representation convergence happens in local neighborhood structure both reshape the RAG-vs-fine-tuning and embedding quality evaluation calculus.

    3
    sources
  5. 05

    Databricks Vendor Lock-in via Iceberg Neutralization

    background

    Databricks' managed Iceberg tables deliberately lack hidden partitioning, manual file compaction, and snapshot management — forcing dependency on proprietary Predictive Optimization and threatening ML pipeline reproducibility and multi-engine portability.

    1
    sources

◆ DEEP DIVES

  1. 01

    The Frontier Fragmented: Why Your Model Selection Must Become a Routing System

    <h3>Three Providers, Zero Overlap at the Top</h3><p>February 2026 delivered benchmark results that shatter any remaining case for a single-model strategy. The performance gaps are large enough to be operationally significant:</p><table><thead><tr><th>Benchmark</th><th>Leader</th><th>Score</th><th>Runner-Up</th><th>Gap</th></tr></thead><tbody><tr><td><strong>ARC-AGI-2</strong></td><td>Gemini 3.1 Pro</td><td>77.1%</td><td>Claude Opus 4.6 (68.8%)</td><td>+8.3pp</td></tr><tr><td><strong>SWE-Bench Pro</strong></td><td>GPT-5.3-Codex</td><td>56.8%</td><td>GPT-5.2 (55.6%)</td><td>+1.2pp</td></tr><tr><td><strong>GPQA Diamond</strong></td><td>Gemini 3.1 Pro</td><td>94.3%</td><td>—</td><td>—</td></tr><tr><td><strong>OS World</strong></td><td>Claude Sonnet 4.6</td><td>Record*</td><td>—</td><td>—</td></tr><tr><td><strong>Humanity's Last Exam</strong></td><td>Gemini 3.1 Pro</td><td>44.4%</td><td>—</td><td>—</td></tr></tbody></table><p><em>*Exact Sonnet 4.6 scores on OS World and SWE-Bench Verified described as "new records" without specific numbers.</em></p><h3>SWE-Bench Is Dead — Build Your Own</h3><p>Compounding the routing problem: <strong>SWE-Bench Verified is no longer reliable</strong>. Multiple sources confirm two compounding failures — flawed test cases that reject correct fixes, and training data contamination making exposure a significant scoring factor. OpenAI has <strong>officially abandoned it</strong>. If you're still using SWE-Bench scores to select coding models, you're optimizing for a broken signal.</p><h3>Sonnet 4.6: The Cost-Performance Calculus Shifts</h3><p>Sonnet 4.6 ships with a <strong>1M-token context window</strong> (4x previous, currently in beta) at <strong>unchanged pricing</strong>. Early testers preferred it over Sonnet 4.5 ~70% of the time and over Opus 4.5 ~60% of the time. This fundamentally changes the RAG tradeoff — for document QA workloads where your corpus fits in context, you can potentially eliminate your entire chunking → embedding → vector DB → retrieval pipeline. <em>Caveat: preference rates lack reported sample sizes or statistical significance. The 1M window is in beta — validate latency and accuracy at extreme context lengths before migrating production workloads.</em></p><h3>The New Iteration Cadence</h3><p>Sonnet 4.6 shipped <strong>12 days after Opus 4.6</strong>. Gemini 3.1 Pro more than doubled its predecessor's ARC-AGI-2 score in a single generation. OpenAI is deprecating <strong>5 models</strong> including GPT-5 and GPT-4.1. Quarterly model evaluation reviews are obsolete — you need automated eval triggers that fire within 48 hours of any frontier release.</p><blockquote>The era of a single frontier model is over. The era of task-specific model routing has arrived.</blockquote>

    Action items

    • Build or update your internal evaluation harness covering ARC-AGI-2, SWE-Bench Pro, GPQA Diamond, OS World, and your domain-specific tasks — then benchmark Gemini 3.1 Pro, Sonnet 4.6, and GPT-5.3-Codex head-to-head on your actual workload distribution
    • Run a cost-benefit analysis of RAG pipeline vs. long-context stuffing using Sonnet 4.6's 1M-token window on your top 3 document QA workloads by end of March
    • Add model abstraction layers and fallback routing to any production pipeline using OpenAI models — GPT-4o, GPT-5, GPT-4.1, GPT-4.1 mini, and o4-mini are all being deprecated
    • Set up CI/CD-style eval runs that trigger within 48 hours of any frontier model release announcement

    Sources:Last Week in AI #336 - Sonnet 4.6, Gemini 3.1 Pro, Anthropic vs Pentagon · Gemini tops benchmarks, again · ChatGPT Pro Lite 🤖, Anthropic distillation 🧪, Perplexity Messages credits 💬 · FOD#141: What Happens to Software Engineering When Anyone Can Build?

  2. 02

    Distillation Attacks at Industrial Scale: The Capability Map Hidden in Anthropic's Disclosure

    <h3>The Attack Surface Is Bigger Than Rate Limiting</h3><p>Anthropic's disclosure is the most detailed public account of systematic model extraction ever published. The numbers across 12+ sources are consistent: <strong>24,000 fraudulent accounts</strong>, <strong>16M+ exchanges</strong>, with MiniMax responsible for approximately <strong>13 million</strong> (81%) of those exchanges alone. DeepSeek targeted <strong>logic and alignment</strong>, Moonshot AI targeted <strong>agentic reasoning and coding</strong>, and MiniMax targeted <strong>agentic coding</strong>.</p><table><thead><tr><th>Attacker</th><th>Target Capabilities</th><th>Est. Volume</th><th>Share</th></tr></thead><tbody><tr><td>MiniMax</td><td>Agentic coding</td><td>~13M</td><td>81%</td></tr><tr><td>DeepSeek</td><td>Logic, alignment</td><td>~1.5M</td><td>~9%</td></tr><tr><td>Moonshot AI</td><td>Agentic reasoning, coding</td><td>~1.5M</td><td>~9%</td></tr></tbody></table><h3>The Target List Is Your Capability Priority Map</h3><p>The most strategically valuable insight isn't the attack itself — it's <strong>what they targeted</strong>. The distillation campaigns focused on: agentic reasoning, tool use/orchestration, coding and data analysis, computer-use agent development, computer vision, and <strong>rubric-based grading for RL reward models</strong>. That last item is particularly significant — it suggests these labs are trying to bootstrap their own RLHF/RLAIF pipelines by extracting Claude's reward signal. This is the clearest public signal of which capabilities represent genuine technical moats.</p><h3>Why Standard Defenses Failed</h3><p>At ~667 conversations per account on average, individual account behavior was <strong>well within normal usage patterns</strong>. Standard per-account rate limiting completely missed this. The campaigns used <strong>"hydra cluster" proxy networks</strong> to bypass regional restrictions. The detection signal lives in <strong>cross-account coordination</strong>: query embedding similarity across accounts, temporal burst correlation, IP/proxy overlap, and account creation velocity. This is a graph problem — you need to model relationships between accounts, not just individual account behavior.</p><h4>Sources Disagree on Detection Timing</h4><p>Multiple sources note that 16M exchanges running before disclosure suggests either delayed detection or delayed public response. Anthropic hasn't disclosed their detection methodology — whether this was real-time behavioral analysis or forensic reconstruction matters enormously for whether this is a repeatable defense. Google reported a similar attack two weeks prior — over <strong>100,000 prompts</strong> targeting Gemini — suggesting this is an industry-wide pattern, not an Anthropic-specific vulnerability.</p><blockquote>If 16 million distillation queries can slip past Anthropic's defenses, your model API's biggest vulnerability isn't prompt injection — it's systematic capability extraction at industrial scale.</blockquote>

    Action items

    • Implement behavioral clustering on query distributions across accounts to detect coordinated multi-account distillation campaigns — deploy by end of Q1
    • Add output watermarking (statistical signatures in generated text) to any externally-served model API this quarter
    • Monitor for high-volume chain-of-thought elicitation, systematic capability probing across task categories, and coordinated account clusters in your API logs
    • Document provenance of any synthetic training data generated via commercial LLM APIs — the legal ground is shifting as Anthropic weaponizes distillation as an IP theft narrative

    Sources:Last Week in AI #336 - Sonnet 4.6, Gemini 3.1 Pro, Anthropic vs Pentagon · Hacked? You've only got 30 minutes. · CIA warned Tim Cook about China threat to Taiwan by 2027 · The Pentagon Calls Anthropic on the Carpet · Anthropic data harvested 🤖, AI memo crashes stocks 📉, don't quit your job 💼

  3. 03

    SAGE, Recall Bottlenecks, and the Research That Actually Changes Your Pipeline

    <h3>SAGE: Self-Aware Reasoning Termination</h3><p>SAGE introduces a mechanism for reasoning models to <strong>self-determine when to stop thinking</strong>, eliminating redundant chain-of-thought tokens after reaching a correct conclusion. SAGE-RL distills these efficient patterns into standard pass@1 inference. The claim: <strong>accuracy improvement with fewer tokens across six math benchmarks</strong>. If you're running o1-class or DeepSeek-R1-class models and paying per-token, even modest reductions in average chain-of-thought length translate directly to cost savings.</p><p><em>Critical gap: no specific numbers are reported — we don't know if this is 5% or 50% token reduction. Results are math-only; generalization to code, multi-hop QA, or planning tasks is unvalidated. Treat as a research direction to prototype, not a proven technique.</em></p><h3>The Recall Bottleneck: Your Fine-Tuning May Be Solving the Wrong Problem</h3><p>Google's "Empty Shelves or Lost Keys?" paper makes a clean distinction: when a model gets a factual question wrong, is it because the fact was <strong>never encoded</strong> (empty shelf) or because the model <strong>can't access</strong> what it knows (lost keys)? The finding — that <strong>recall, not encoding, is the limiting factor</strong> — has immediate implications. If you've been fine-tuning models on domain knowledge and seeing diminishing returns, the knowledge may already be there. Your investment should shift toward inference-time retrieval mechanisms, better prompting, or architectural changes to attention and memory.</p><p><strong>Diagnostic test:</strong> Take your worst-performing factual queries and test whether the model can answer them with heavy context priming (essentially giving it the answer in the prompt neighborhood). If accuracy jumps, your problem is recall, not knowledge — and your investment should go to retrieval infrastructure, not more fine-tuning data.</p><h3>Embedding Quality: Local Neighborhoods, Not Global Cosine</h3><p>EPFL's revision of the Platonic Representation Hypothesis shows that representation convergence happens in <strong>local neighborhood structure</strong>, not global embeddings. This means your embedding quality metric should be <strong>k-NN neighborhood preservation</strong>, not just cosine similarity. If you're evaluating embedding models for RAG or search, test local neighborhood consistency as your primary quality signal.</p><h3>Masked Gradient Updates: A Cheap Stability Fix</h3><p>Google's work on <strong>masked gradient updates in adaptive optimizers</strong> shows surprising effectiveness for training stability — a potentially cheap fix for loss spikes that doesn't require architectural changes. If you're training or fine-tuning large models, try masking out low-magnitude gradient components in Adam/AdamW and compare training curves before reaching for more expensive stability interventions.</p><hr><h4>Five New Architectures Worth Tracking</h4><table><thead><tr><th>Model</th><th>Key Innovation</th><th>Your Action</th></tr></thead><tbody><tr><td><strong>Kimi K2.5</strong></td><td>Parallel-Agent RL orchestrating up to 100 subagents; optimizes critical-path latency</td><td>Benchmark for agentic task orchestration</td></tr><tr><td><strong>Qwen3.5</strong></td><td>397B params, 201 languages, sparse MoE + hybrid attention + native multimodal</td><td>Evaluate for multilingual or self-hosted needs</td></tr><tr><td><strong>GLM-5</strong></td><td>Long-horizon RL with async training for multi-step engineering</td><td>Monitor for autonomous workflow tasks</td></tr><tr><td><strong>Causal-JEPA</strong></td><td>Object-level masking inducing latent counterfactual structure</td><td>Evaluate for causal reasoning downstream tasks</td></tr><tr><td><strong>FDM-1</strong></td><td>11M hours video, 2hr→1M token compression, inverse dynamics for action prediction</td><td>Review architecture if working on video/temporal data</td></tr></tbody></table>

    Action items

    • Prototype SAGE-style early stopping on your highest-volume reasoning endpoint — measure tokens-per-correct-answer before and after
    • Run the recall diagnostic on your worst-performing factual queries: test with heavy context priming to distinguish recall failures from encoding failures
    • Add k-NN neighborhood preservation as an embedding quality metric alongside cosine similarity in your RAG evaluation suite
    • Try masked gradient updates in your next large-model fine-tuning run before reaching for more expensive stability interventions

    Sources:ChatGPT Pro Lite 🤖, Anthropic distillation 🧪, Perplexity Messages credits 💬 · FOD#141: What Happens to Software Engineering When Anyone Can Build? · A Foundational Guide to Evaluation of LLM Apps

  4. 04

    Databricks Is Quietly Neutralizing Iceberg — What This Means for Your ML Pipeline Portability

    <h3>The Feature Gap Is Deliberate</h3><p>Nearly two years after spending <strong>over $1 billion to acquire Tabular</strong> (the company behind Apache Iceberg), Databricks' managed Iceberg implementation is deliberately limited. The gap between open-source Iceberg and Databricks managed Iceberg is not a matter of catching up — it's architectural:</p><table><thead><tr><th>Capability</th><th>Open-Source Iceberg</th><th>Databricks Managed</th></tr></thead><tbody><tr><td>Hidden Partitioning</td><td>✅ Full support</td><td>❌ Not available</td></tr><tr><td>Manual File Compaction</td><td>✅ Full control</td><td>❌ Auto-only via Liquid Clustering</td></tr><tr><td>Snapshot Management</td><td>✅ Full control</td><td>❌ Not available</td></tr><tr><td>Deterministic File Layout</td><td>✅ User-controlled</td><td>❌ Platform-managed (opaque)</td></tr></tbody></table><p>You <strong>must enable Predictive Optimization</strong> to use managed Iceberg at all. The platform's new guidance is blunt: <em>"don't partition or sort or bucket your tables."</em> Liquid Clustering, Predictive Optimization, and Photon replace every physical data modeling lever that data engineers previously controlled.</p><h3>Why This Matters for ML Pipelines</h3><p><strong>Reproducibility risk:</strong> If your feature engineering pipelines depend on deterministic file layouts for reproducible training runs, Liquid Clustering's opaque layout management means you can't guarantee the same query over the same logical data will scan the same physical files across runs.</p><p><strong>Multi-engine access:</strong> Many ML teams read training data through Spark but serve features through different engines (Trino, DuckDB, Presto). Managed Iceberg's Predictive Optimization requirement creates a <strong>Databricks dependency in your read path</strong> that undermines the multi-engine promise of open table formats.</p><p><strong>The debugging problem:</strong> When your 4TB daily feature engineering job suddenly takes 3x longer because Liquid Clustering made a suboptimal layout decision, you have <strong>no manual override</strong>. No partitioning knobs, no compaction controls, no file layout inspection. You file a support ticket.</p><h3>The Power User Signal</h3><p>Practitioner Zach Wilson migrated <strong>13,000 tables to AWS Glue Catalog</strong> instead of Unity Catalog after Databricks deprecated the Tabular metastore — a concrete data point that experienced users are routing around Databricks' governance layer to retain engine flexibility.</p><p><em>Important caveat: This analysis comes from a single practitioner's experience. There are no performance benchmarks comparing Liquid Clustering vs. manual partitioning, no TCO analysis, and no discussion of whether Predictive Optimization's decisions are inspectable. Before making architectural decisions, run your own benchmarks on your own workloads.</em></p><blockquote>Databricks is trading your physical data modeling controls for platform-managed automation, and its managed Iceberg implementation strips the features that make Iceberg worth using.</blockquote>

    Action items

    • Audit your Databricks Iceberg tables for missing optimization capabilities — specifically check hidden partitioning, manual file compaction, and snapshot management availability
    • Benchmark Liquid Clustering + Predictive Optimization against your hand-tuned partitioning schemes for your top 5 most expensive Spark jobs
    • Evaluate your metastore strategy — compare Unity Catalog vs. AWS Glue Catalog for ML governance needs and document exit costs
    • Document your team's physical data modeling knowledge in runbooks that exist independent of Databricks before it atrophies

    Sources:Databricks is no longer about tuning knobs

◆ QUICK HITS

  • Update: OpenAI inference costs quadrupled to $8.4B in 2025, missing their own 46% gross margin forecast by 13pp (actual: 33%) — per-user compute efficiency improved 35%→70% but was swamped by reasoning models and video generation demand

    Dealmaker: Why OpenAI, Anthropic Are Missing Their Own Margin Forecasts

  • Update: Anthropic-Pentagon standoff reaches binary outcome — Dario Amodei met Defense Secretary Hegseth today over $200M contract; Pentagon CTO threatened 'supply chain risk' designation that would exile Anthropic from all government work

    The Briefing: Anthropic: Foe or Frenemy?

  • Meta signed a $100B+ AMD deal (MI450 + EPYC + ROCm) with 160M-share performance warrants — first credible hyperscaler-validated alternative to NVIDIA for AI training, deploying H2 2026

    CIA warned Tim Cook about China threat to Taiwan by 2027

  • SAGE-RL enables reasoning models to self-terminate thinking with accuracy gains and token savings across 6 math benchmarks — directly applicable to inference cost optimization if it generalizes beyond math

    ChatGPT Pro Lite 🤖, Anthropic distillation 🧪, Perplexity Messages credits 💬

  • OpenAI deprecating GPT-4o, GPT-5, GPT-4.1, GPT-4.1 mini, and o4-mini — hundreds of thousands of users affected; start migration testing now if your production prompts are tuned to any of these

    Last Week in AI #336 - Sonnet 4.6, Gemini 3.1 Pro, Anthropic vs Pentagon

  • Alibaba's Qwen3.5: 397B parameter open-weight model with 201 languages and native multimodal — new open-weight frontier for multilingual coverage or self-hosting

    Last Week in AI #336 - Sonnet 4.6, Gemini 3.1 Pro, Anthropic vs Pentagon

  • Princeton's agent reliability framework decomposes into 4 orthogonal dimensions — consistency, robustness, predictability, safety — each requiring separate test suites before production deployment

    FOD#141: What Happens to Software Engineering When Anyone Can Build?

  • LLM-powered attack toolkit exposed: DeepSeek + Claude Code via MCP server targeted 2,516 FortiGate appliances across 106 countries, built in ~8 weeks from open-source framework

    LLM Powered FortiGate Attacks 🤖, Pulsar RAT in NPM PNGs 🖼️, Paypal SSN Leak 🔓

  • Data center compute at 1% vacancy with $96B in projects blocked in Q2 2025 and 200+ regulatory bills introduced — budget for 15-25% GPU price increases over the next 12 months

    Axios Pro Rata: AI speed bump

  • Implement content-hash-based smart diffing in your RAG ingestion pipeline to avoid re-embedding unchanged documents — can reduce re-embedding volume by 80-95% in typical document stores

    A Foundational Guide to Evaluation of LLM Apps

BOTTOM LINE

No single frontier model wins across all tasks — Gemini 3.1 Pro leads reasoning at 77.1% ARC-AGI-2, GPT-5.3-Codex leads coding at 56.8% SWE-Bench Pro, and Sonnet 4.6 leads agentic use with a 1M-token context window — while SWE-Bench Verified is officially broken and 24,000 fake accounts distilled Claude at industrial scale; your model selection needs to be a routing system, your evaluation harness is the only benchmark you can trust, and your API needs distillation detection before someone extracts your capabilities with nothing more than email addresses and a training loop.

Frequently asked

Why can't I trust SWE-Bench Verified scores for model selection anymore?
SWE-Bench Verified has two compounding failures: flawed test cases that reject correct fixes, and training data contamination that makes prior exposure a significant scoring factor. OpenAI has officially abandoned it. If you're using it to select coding models, you're optimizing for a broken signal — build your own evaluation harness against your actual workload distribution instead.
Does Sonnet 4.6's 1M-token context window mean I should rip out my RAG pipeline?
Potentially, for document QA workloads where your corpus fits in context. The 1M window ships at unchanged pricing (4x previous), which fundamentally changes the RAG tradeoff and could eliminate your chunking → embedding → vector DB → retrieval stack. But the window is in beta, and early preference numbers (~70% vs. Sonnet 4.5) lack reported sample sizes. Run a cost-benefit analysis on your top workloads before migrating.
How do I detect distillation attacks when individual accounts look normal?
Treat it as a graph problem, not a per-account rate-limiting problem. At ~667 conversations per account, the MiniMax/DeepSeek/Moonshot campaigns stayed inside normal usage envelopes. The signal lives in cross-account coordination: query embedding similarity across accounts, temporal burst correlation, IP/proxy overlap, and account creation velocity. Adding output watermarking also enables post-hoc provenance tracking.
How do I tell whether my model's factual errors are a knowledge problem or a recall problem?
Run a context-priming diagnostic: take your worst-performing factual queries and re-ask them with heavy context priming that puts the answer in the prompt neighborhood. If accuracy jumps sharply, the knowledge is encoded but not retrievable — your problem is recall, and investment should shift to retrieval infrastructure, prompting, or attention/memory architecture rather than more fine-tuning data.
What do I lose by using Databricks managed Iceberg instead of open-source Iceberg?
You lose hidden partitioning, manual file compaction, snapshot management, and deterministic file layout — all replaced by mandatory Predictive Optimization and Liquid Clustering. For ML pipelines this creates reproducibility risk (non-deterministic physical layouts across training runs), a Databricks dependency in multi-engine read paths, and no manual override when Liquid Clustering makes a bad layout decision on a large job.

◆ ALSO READ THIS DAY AS

◆ RECENT IN DATA SCIENCE