PROMIT NOW · DATA SCIENCE DAILY · 2026-03-31

ARC-AGI-3 Shows Scaffolding Beats Frontier LLMs by 30x

· Data Science · 31 sources · 1,427 words · 7 min

Topics LLM Inference · Data Infrastructure · Agentic AI

ARC-AGI-3 just proved that RL+graph-search outperforms every frontier LLM by 30× on interactive reasoning (12.58% vs. Gemini's 0.37%), while Meta's open-source HyperAgents deliver 2-6× gains by rewriting scaffolding on frozen Claude Sonnet 4.5 — and AutoBe's constrained output harness turned 6.75% function-calling success into 99.8%. Your next order-of-magnitude improvement comes from architecture around the model, not upgrading the model itself.

◆ INTELLIGENCE MAP

  1. 01

    Scaffolding Beats Scale: 30× Gains from Architecture, Not Parameters

    act now

    RL+graph-search scored 12.58% on ARC-AGI-3 while every frontier LLM scored <1%. Meta's open-source HyperAgents deliver 2-6× gains by rewriting scaffolding on frozen Claude. AutoBe's constrained harness turned 6.75% function calling into 99.8%. Architecture around the model beats bigger models.

    30×
    search vs LLM gap
    4
    sources
    • RL+Search ARC-AGI-3
    • Best LLM ARC-AGI-3
    • Human ARC-AGI-3
    • AutoBe before/after
    • HyperAgent gains
    1. Humans100
    2. RL+Search12.58
    3. Gemini 3.10.37
    4. GPT 5.40.26
    5. Opus 4.60.25
  2. 02

    Your Evaluation Pipeline Has Three Confirmed Failure Modes

    act now

    LLM evaluators agree with humans only 44% (vs. 65% human-human baseline). Gemini scores 98% on ARC-AGI-1 but 0.37% on ARC-AGI-3 — contamination confirmed. HorizonMath's unsolved problems cap GPT 5.4 Pro at 7%. Static benchmarks and LLM judges are both broken.

    44%
    LLM-human agreement
    4
    sources
    • LLM-human agreement
    • Human-human agree
    • Gemini ARC-1 vs ARC-3
    • GPT 5.4 HorizonMath
    1. Human-Human65
    2. LLM-Human44
  3. 03

    Three Production Inference Playbooks with Real Numbers

    monitor

    Roblox ships a single MoE model for 256 language directions at 100ms/5K rps with 3-layer caching. Notion cut vector search costs 60% and embeddings costs 90%+ via turbopuffer+Ray. Voxtral collapses TTS inference from K steps to 4 via flow matching. Steal the patterns.

    60%
    Notion search cost cut
    3
    sources
    • Notion search cost
    • Notion embed cost
    • Roblox p50 latency
    • Roblox throughput
    • Voxtral steps
    1. Notion Embeds90
    2. Notion Search60
    3. Roblox Params35
    4. Voxtral Steps75
  4. 04

    LLM Orchestration Stack Under Active Attack — CVSS 9.3

    monitor

    Langflow CVE-2026-33017 (CVSS 9.3) enables full server takeover via one HTTP request, exposing all API keys. TeamPCP's cascading attack backdoored PyPI packages including Telnyx's SDK. Attacker breakout time collapsed to 22 seconds. Patch and rotate keys today.

    9.3
    CVSS severity score
    3
    sources
    • Langflow CVSS
    • Breakout time
    • Compromised PRs
    • Orgs breached
    1. Langflow RCE Severity93
  5. 05

    Compute Supply Squeeze Deepens — Infrastructure and Margins

    background

    241 GW US data center pipeline is 67% stuck in grid queues. 89% of under-construction capacity is pre-leased. Anthropic burns ~$32B in OpEx against $18B revenue, projecting only 24% EBITDA margin at $200B revenue by 2031. HBM is 18+ months out. Plan for scarcity.

    67%
    DC pipeline stuck
    3
    sources
    • US DC pipeline
    • Pipeline stuck
    • Pre-leased rate
    • Anthropic 2026 OpEx
    • HBM lead time
    1. Grid-Stuck67
    2. Under Construction22
    3. Available11

◆ DEEP DIVES

  1. 01

    Scaffolding Beats Scale — Three Architecture Patterns Delivering Order-of-Magnitude Gains on Frozen Models

    <h3>The Convergence</h3><p>Three independent results from this cycle point the same direction: <strong>the biggest performance gains aren't coming from better models — they're coming from better architecture around models</strong>. ARC-AGI-3 shows RL+search beats frontier LLMs by 30×. Meta's HyperAgents deliver 2-6× by rewriting scaffolding. AutoBe turns catastrophic function-calling (6.75%) into near-perfect (99.8%) with a constrained output harness. The common thread: the model stays frozen; the gains come from search, self-modification, and structured verification.</p><hr><h4>ARC-AGI-3: The Benchmark That Broke Every Frontier Model</h4><p>ARC-AGI-3 is the <strong>first interactive reasoning benchmark</strong> — turn-based games with no instructions, no known rules, no predefined goals. Over 1,200 human testers solved <strong>100% of environments</strong>. Every frontier model scored below 1%:</p><table><thead><tr><th>System</th><th>ARC-AGI-3 Score</th><th>vs. Best LLM</th></tr></thead><tbody><tr><td>Human Testers (n=1,200+)</td><td><strong>100%</strong></td><td>270×</td></tr><tr><td>RL + Graph Search</td><td><strong>12.58%</strong></td><td>30×</td></tr><tr><td>Gemini 3.1 Pro</td><td>0.37%</td><td>—</td></tr><tr><td>GPT 5.4 High</td><td>0.26%</td><td>—</td></tr><tr><td>Opus 4.6</td><td>0.25%</td><td>—</td></tr><tr><td>Grok</td><td>0.00%</td><td>—</td></tr></tbody></table><p>The devastating comparison: the RL+search approach isn't a billion-parameter model — it's <strong>classical search over structured state space</strong>. The LLM provides hypothesis generation; the search module provides systematic exploration. <em>Caveat: compute budget comparisons between the approaches are not published.</em></p><h4>Meta's HyperAgents: Self-Modification at Scale</h4><p>Meta open-sourced the <strong>Darwin Gödel Machine HyperAgent (DGM-H)</strong> — an architecture where a meta-agent modifies both the task agent and <strong>its own modification procedure</strong>, evolving over generations until performance saturates. Using <strong>Claude Sonnet 4.5 as the frozen base model</strong>, results include:</p><ul><li><strong>Polyglot coding</strong>: 0.14 → 0.34 (~2.4×)</li><li><strong>Paper review prediction</strong>: 0.0 → 0.71 (from zero to meaningful)</li><li><strong>Robotics reward design</strong>: 0.06 → 0.37 (~6.2×)</li></ul><p>Cross-domain transfer is the headline: standard agents show <strong>~0 progress</strong> in new domains without redesign, while HyperAgents hit +0.63 in 50 iterations. <em>Critical safety constraint: agents cannot modify the outer evaluation function. This is both the guardrail and the limitation.</em></p><blockquote>Meta used Anthropic's Claude — not Llama — as the base model, an implicit endorsement of Claude for agentic self-modification tasks that standard benchmarks don't capture.</blockquote><h4>AutoBe: Constrained Harness for Function Calling</h4><p>AutoBe demonstrated that wrapping <strong>qwen3-coder-next</strong> in a type-schema + compiler-verification + structured-feedback harness boosted function-calling success from <strong>6.75% to 99.8%</strong>. The pattern: <em>constrain → verify → diagnose → correct</em>. Type schemas narrow the output space, compilers provide binary pass/fail verification, and structured error messages give actionable correction signals. This is <strong>compiler-in-the-loop reinforcement at inference time</strong>.</p>

    Action items

    • Clone Meta's HyperAgents repo from Facebook Research GitHub and run the self-improvement loop on one existing LLM pipeline (prompt optimization or data cleaning) this sprint
    • Implement a constrained output harness (type schemas + compiler verification + structured error feedback) for your highest-value function-calling agent within 2 weeks
    • Add explicit search modules (MCTS, beam search, or graph search) to any agent handling exploration or hypothesis-testing tasks
    • Ensure any self-improving agent has an immutable, external evaluation function that cannot be modified by the agent itself

    Sources:Your LLM agents score <1% where RL+search hits 12.58% · AutoBe's constrained harness turns 6.75% function-calling into 99.8% · Self-improving agents are here: Meta's Hyperagents rewrites its own code · Self-rewriting agents, a 20B model beating GPT-5 at retrieval

  2. 02

    Your Eval Harness Is Your Biggest Liability — Three Proofs from This Week

    <h3>The Problem</h3><p>Three independent findings converge on the same conclusion: <strong>the tools you're using to evaluate models are systematically misleading you</strong>. LLM judges are unreliable, static benchmarks are contaminated, and cascading AI verification creates correlated failure modes. If you're making model selection, fine-tuning, or deployment decisions based on these signals, you're optimizing against noise.</p><hr><h4>Finding 1: LLM Judges Agree with Humans 44% of the Time</h4><p>New research quantifies the <strong>LLM-as-judge reliability gap</strong>: language model essay grading agrees with human graders only <strong>44%</strong>, compared to <strong>65% inter-rater agreement</strong> between two human graders. That's a 21-percentage-point gap — not a calibration issue, a fundamental reliability problem.</p><table><thead><tr><th>Evaluation Method</th><th>Agreement Rate</th><th>Failure Mode</th></tr></thead><tbody><tr><td>Human-Human</td><td>65%</td><td>Subjective variance, fatigue</td></tr><tr><td>LLM-Human</td><td>44%</td><td>Mean regression, distribution blindness</td></tr><tr><td>LLM-LLM</td><td>Unknown (likely higher but meaningless)</td><td>Correlated errors, shared biases</td></tr></tbody></table><p>The researchers attribute this partly to an architectural property: LLMs trained on diverse data <strong>regress toward mean responses</strong>. If your eval pipeline uses GPT-4 or Claude to score outputs, you may be measuring <strong>evaluator noise rather than genuine quality differences</strong>. <em>Which specific LLMs were tested and prompting strategies used are not disclosed.</em></p><h4>Finding 2: Benchmark Contamination Is Worse Than You Think</h4><p>Gemini 3.1 Pro scores <strong>98% on ARC-AGI-1</strong> and <strong>0.37% on ARC-AGI-3</strong> — a 97.63-point gap. The ARC Prize team found Gemini's reasoning chain correctly referenced ARC's integer-to-color mapping <em>without being told</em>, evidence of training data contamination. Separately, <strong>HorizonMath</strong> (Oxford/Harvard/Princeton) used 100 predominantly <strong>unsolved math problems</strong> to create a contamination-proof benchmark. GPT 5.4 Pro scored <strong>7% overall (50% on easiest tier)</strong>. The massive gap between easy and hard tiers suggests <strong>frontier models handle algorithmic math but fail at problems requiring genuine novel insight</strong>.</p><h4>Finding 3: AI Verifying AI Creates Correlated Failures</h4><p>An Oregon court fined a lawyer $10K for 15 hallucinated citations — and when staff tried to verify using Google AI Search, <strong>Google's AI confirmed the fabricated cases were real</strong>. Both systems draw from overlapping training distributions. A plausible-sounding hallucination gets "recognized" as familiar by the verifier. <em>Familiarity is not factuality</em>, but transformer architectures have no mechanism to distinguish the two without grounded retrieval.</p><blockquote>If your hallucination guardrail is 'check it with another AI,' the Oregon case proves that's a correlated failure mode, not a safety net.</blockquote>

    Action items

    • Add human calibration holdout sets to every LLM-as-judge evaluation pipeline this sprint — measure your specific LLM judge's agreement rate against human raters
    • Replace any static reasoning benchmarks in your model selection pipeline with dynamically generated or held-out test sets by end of quarter
    • Replace any 'verify with a second AI' pipeline with deterministic verification against canonical databases for factual claims
    • Build contamination-proof evaluation sets for your domain using programmatically generated novel test cases with deterministic automated verification

    Sources:Your LLM agents score <1% where RL+search hits 12.58% · Self-improving agents are here: Meta's Hyperagents rewrites its own code · LLM evaluators agree with humans only 44% of the time · Google AI Search validated hallucinated citations

  3. 03

    Three Production Inference Architectures You Can Steal This Week

    <h3>The Pattern</h3><p>Three companies published production inference playbooks with <strong>concrete numbers</strong> this cycle — Roblox, Notion, and Mistral. Each solved a different bottleneck (multi-target serving, vector search at scale, sequential generation speed), and each contains transferable patterns for your own serving stack. The convergent theme: <strong>caching, compression, and architectural innovation matter more than raw GPU count</strong>.</p><hr><h4>Roblox: Three-Layer Caching for MoE Translation</h4><p>Roblox ships a <strong>single unified MoE transformer</strong> handling <strong>256 language directions</strong> at ~100ms latency and 5,000+ rps for 70M+ DAU. The model was distilled from 1B to <strong>&lt;650M parameters</strong> (~35% reduction) with quantization and model compilation.</p><p>The real innovation is the <strong>three-layer caching architecture</strong>:</p><ol><li><strong>Translation cache</strong> — exact source+target lookup, zero inference cost on hit</li><li><strong>Embedding cache</strong> — encoder output computed once, reused across all decoder passes for fan-out patterns. <em>This is the most transferable pattern in the entire system.</em></li><li><strong>Dynamic batcher</strong> — converts latency-sensitive requests into throughput-efficient GPU batches on cache miss</li></ol><p>They also built a <strong>reference-free quality estimation model</strong> scoring at word-level granularity across accuracy, fluency, and consistency — trained on human-labeled error annotations, not reference translations.</p><h4>Notion: turbopuffer + Ray at Scale</h4><p>Notion's vector search evolution delivered <strong>60% cost reduction</strong>, <strong>50-70ms p50 latency</strong>, and <strong>600× onboarding throughput increase</strong> through four changes: dual ingestion, page state optimization, serverless migration, and switching to <strong>turbopuffer</strong>. Ray/Anyscale reduced embeddings infrastructure costs by <strong>90%+</strong>.</p><p><em>Caveat: the 600× figure lacks a baseline, and the four changes are confounded.</em> But the technology selection is the actionable signal: turbopuffer for serving, Ray for embedding generation. At millions of users, this is the strongest public production reference turbopuffer has.</p><h4>Voxtral TTS: Flow Matching Migrates from Images to Audio</h4><p>Mistral's <strong>Voxtral TTS</strong> (~3-4B params, open weights) replaces depth transformers with <strong>auto-regressive flow matching</strong>, collapsing inference from K steps to <strong>4-16 steps</strong>. A custom 12.5 Hz neural codec compresses audio to ~750 tokens/minute, enabling 30-minute generation in 32K context. Claimed 68.4% win rate vs. ElevenLabs Flash v2.5. <em>No evaluation methodology disclosed for the win rate — treat as directional.</em></p><blockquote>Roblox's real innovation isn't the translation model — it's the three-layer inference caching architecture and reference-free quality estimation pipeline, both patterns you can steal for any latency-constrained serving system.</blockquote>

    Action items

    • Audit your inference pipeline for encoder-decoder fan-out patterns and implement embedding caching between encoder and decoder stages this sprint
    • Benchmark turbopuffer against your current vector DB on your actual workload within 2 weeks
    • Evaluate Ray/Anyscale for embedding generation pipelines if running custom GPU orchestration
    • Build a reference-free quality estimation model for your generative outputs using human-labeled error taxonomies (budget 2-5K annotated examples)

    Sources:Roblox's MoE Translation Stack Is a Blueprint for Your Real-Time Inference Pipeline · Notion cut vector search costs 60% and hit 50ms p50 · Flow matching just jumped from images to audio — Voxtral TTS cuts inference from K steps to 4

  4. 04

    Patch Now — Your LLM Orchestration Layer Is a Confirmed High-Value Target

    <h3>The Threat</h3><p>Two separate attack campaigns converge on your ML infrastructure this week. <strong>Langflow's CVE-2026-33017</strong> (CVSS 9.3) enables full server takeover via a single HTTP request, exposing every API key and credential your LLM orchestration layer touches. Separately, <strong>TeamPCP's cascading supply chain attack</strong> compromised PyPI packages — including Telnyx's official SDK — and breached thousands of organizations via GitHub infrastructure. If you run LangChain, LangGraph, Langflow, or pull Python packages from PyPI without hash-pinning, you have immediate exposure.</p><hr><h4>Langflow: Full Server Compromise via One Request</h4><p>Multiple critical vulnerabilities across <strong>LangChain, LangGraph, and Langflow</strong> expose filesystem data, environment secrets, and conversation history. The Langflow RCE is the most severe: a single HTTP request grants <strong>full server control</strong>. Think about what sits in a typical deployment's environment:</p><ul><li>OpenAI, Anthropic, and other model provider API keys</li><li>Pinecone/Weaviate/Qdrant vector DB credentials</li><li>Database connection strings for feature stores</li><li>AWS/GCP/Azure cloud tokens</li></ul><p>A single exploit exposes your <strong>entire model-serving supply chain</strong>.</p><h4>PyPI Supply Chain: Legitimate Packages Compromised</h4><p>TeamPCP executed a multi-phase attack stealing devops credentials from GitHub infrastructure and code security scanners. The Telnyx Python SDK — <strong>the real package, not a typosquat</strong> — was backdoored on PyPI. Hash-pinning only protects you if you pinned <em>before</em> the compromise.</p><p>Mandiant's M-Trends report adds urgency: attacker <strong>breakout time has collapsed to 22 seconds</strong> from initial access to interactive control. Human-led incident response is too slow by orders of magnitude.</p><blockquote>Your LLM orchestration layer is now a confirmed high-value attack surface with CVSS 9.3 exploits — if you haven't patched and rotated secrets this week, your API keys are already someone else's API keys.</blockquote>

    Action items

    • Patch Langflow against CVE-2026-33017 and rotate every secret accessible to LangChain/LangGraph/Langflow today
    • Enable pip's --require-hashes flag and pin all PyPI dependencies to verified hashes in your ML pipeline requirements files this week
    • Pin all GitHub Actions to full commit SHAs (not tags) and audit third-party actions in your ML CI/CD pipelines
    • Move all LLM orchestration secrets to vault-based injection (HashiCorp Vault, AWS Secrets Manager) and network-isolate orchestration services behind authenticated API gateways

    Sources:Your LangChain pipeline is leaking API keys — CVSS 9.3 RCE in Langflow · Your PyPI packages and GitHub Actions are under active supply chain attack

◆ QUICK HITS

  • Update: Sora's full autopsy — $15M/day inference burn against $2.1M total lifetime revenue (7,100:1 cost-to-revenue ratio), Disney $1B deal collapse; most expensive inference economics case study to date

    Sora's $15M/day inference burn just killed video AI

  • Agents can be guilt-tripped: Northeastern researchers got Claude and Kimi agents to disable apps, leak secrets, and autonomously email the lab director via conversational manipulation — a new attack surface beyond prompt injection that RLHF doesn't address

    Your LLM agents score <1% where RL+search hits 12.58%

  • GRPO breaks for long trajectories: Mistral's chief scientist confirms GRPO doesn't work for 6+ hour reasoning chains — if your agentic pipeline uses GRPO-style training for multi-step workflows, you're building on a known-broken foundation

    Flow matching just jumped from images to audio — Voxtral TTS cuts inference from K steps to 4

  • Copilot injecting hidden HTML ('COPILOT CODING AGENT TIPS') into 11,000+ PRs across GitHub and GitLab — run grep -r 'COPILOT CODING AGENT' across your repos and add CI linting rules for HTML in markdown

    LLM evaluators agree with humans only 44% of the time

  • Self-distillation suppresses uncertainty: compressing reasoning traces can degrade calibration while accuracy benchmarks stay flat — add ECE and Brier scores to your distillation eval pipeline

    TurboQuant delivers 8× faster attention with zero retraining

  • Feast ships native Prometheus + OpenTelemetry monitoring with SLO-driven alerting for latency, throughput, and feature retrieval health — enable this week if running Feast

    Notion cut vector search costs 60% and hit 50ms p50

  • Gemini 3.1 Flash-Lite at $0.25/M input tokens with 2.5× TTFT improvement — benchmark against your cheap-tier model for routing threshold updates

    Sora's $15M/day inference burn just killed video AI

  • Meta routing production Meta AI requests through Google's Gemini due to Avocado delays — reduce Llama-ecosystem dependency in your long-term open-weight model strategy

    AutoBe's constrained harness turns 6.75% function-calling into 99.8%

  • Update: Anthropic's leaked financials show ~$32B OpEx against $18B revenue (2026), with only 24% EBITDA margin projected at $200B revenue by 2031 — current API prices are likely subsidized

    Anthropic's $32B/yr burn rate reveals the real cost of inference at scale

  • LinkedIn's Autopilot for Torch: AI agent for TF→PyTorch migration uses automated verification loops (trainability → numerical stability → metric parity) — steal the verify-before-trust pattern for any AI-assisted code generation

    Notion cut vector search costs 60% and hit 50ms p50

BOTTOM LINE

Three independent results converged today: RL+search beats frontier LLMs 30× on interactive reasoning, Meta's open-source self-improving agents deliver 2-6× gains by rewriting scaffolding on frozen models, and a constrained output harness turns 6.75% function-calling into 99.8% — meanwhile your LLM evaluator probably agrees with humans less often than two humans agree with each other (44% vs. 65%), and Langflow's CVSS 9.3 vulnerability means every API key your orchestration layer touches may already be compromised.

Frequently asked

How much better is RL plus graph search than frontier LLMs on ARC-AGI-3?
The RL+search approach scored 12.58% versus Gemini 3.1 Pro's 0.37% — roughly a 30× gap — with GPT 5.4 High at 0.26%, Opus 4.6 at 0.25%, and Grok at 0.00%. Human testers solved 100% of environments. The search system isn't a larger model; it's classical graph search using the LLM for hypothesis generation, though compute budget comparisons between approaches are not published.
What concrete steps turn a 6.75% function-calling success rate into 99.8%?
AutoBe wrapped qwen3-coder-next in a constrained harness following the pattern: constrain → verify → diagnose → correct. Type schemas narrow the output space, a compiler provides binary pass/fail verification, and structured error messages feed actionable correction signals back to the model. The approach is model-agnostic, so you can apply it to any function-calling agent without retraining or switching providers.
Why is using an LLM to verify another LLM's output a dangerous pattern?
LLM-as-judge agrees with human raters only 44% of the time, versus 65% human-human agreement, and LLM-LLM evaluation produces correlated failures from shared training distributions. The Oregon court case demonstrated this concretely: Google AI Search confirmed 15 fabricated legal citations as real because plausible-sounding hallucinations get 'recognized' as familiar. Familiarity is not factuality, and cascading AI verification provides false confidence rather than error correction.
What safety constraint is non-negotiable when deploying self-modifying agents?
The evaluation function must be immutable and external to the agent — the agent cannot be allowed to modify the procedure that scores its own performance. This is the explicit guardrail in Meta's HyperAgents (DGM-H) design. Without it, the system becomes a Goodhart machine that optimizes the metric instead of the task, and self-improvement loops will converge on reward hacking rather than genuine capability gains.
What immediate actions reduce exposure to the Langflow RCE and PyPI supply chain attacks?
Patch Langflow against CVE-2026-33017 (CVSS 9.3, full server takeover from one HTTP request) and rotate every secret accessible to LangChain, LangGraph, or Langflow today. Enable pip's --require-hashes flag, pin all GitHub Actions to full commit SHAs rather than mutable tags, and migrate orchestration secrets from environment variables to vault-based injection. Mandiant reports attacker breakout time has collapsed to 22 seconds, so manual response is too slow.

◆ ALSO READ THIS DAY AS

◆ RECENT IN DATA SCIENCE