ARC-AGI-3 Pegs Frontier Models Below 1% on Reasoning
Topics Data Infrastructure · Agentic AI · AI Regulation
ARC-AGI-3 just scored every frontier model below 1% on interactive reasoning tasks humans solve at 100% — Gemini Pro at 0.37%, GPT-5.4 at 0.26%, Grok-4.20 at literal 0%. If your agentic pipeline assumes the LLM can discover rules or form strategies in unfamiliar environments, that assumption now has a measured empirical ceiling. Design your agents for tool-orchestrated pattern matching with human fallbacks, not open-ended reasoning — the competitive advantage is in the scaffold, not the model.
◆ INTELLIGENCE MAP
01 ARC-AGI-3 Resets the Reasoning Scoreboard to Near-Zero
act nowAll frontier models score <1% on 135 interactive mini-games humans solve at 100%. Gemini Pro leads at 0.37%, GPT-5.4 at 0.26%, Grok-4.20 at 0%. Labs previously pushed ARC-AGI-2 from 3% to ~50% by training on it — AGI-3 is designed to resist that.
- Gemini Pro
- GPT-5.4 High
- Opus 4.6
- Grok-4.20
- Human baseline
02 Three Database Fixes That Outperform Your Last Model Optimization
act nowSnowflake OR-joins silently force Cartesian products — rewriting as UNION ALL yields 100–200x speedups. Postgres ON CONFLICT DO UPDATE writes WAL even on no-ops, doubling Datadog's disk writes. Airbnb's COVID-era fix decoupled booking volume from lead-time composition for shock-resilient forecasting.
- Snowflake OR-join
- Postgres WAL bloat
- Postgres WAL syncs
- Airbnb model split
03 ML Infrastructure CVE Cluster Expands Beyond LiteLLM
monitorSix critical CVEs hit ML-specific tools this week: Langflow RCE exploited in 20 hours, MLflow arbitrary file write (CVSS 9.1), NVIDIA APEX pickle RCE in PyTorch <2.6 (CVSS 9.0), gRPC-Go auth bypass (CVSS 9.1), Harbor hard-coded creds (CVSS 9.4). Pattern: ML tools assume trusted environments and ship without input validation.
- Langflow exploit
- MLflow CVSS
- NVIDIA APEX CVSS
- gRPC-Go stars
- Harbor CVSS
- 01Mesop (Google)10
- 02Langflow RCE9.9
- 03Harbor Registry9.4
- 04MLflow pyfunc9.1
- 05gRPC-Go bypass9.1
- 06NVIDIA APEX9
04 Recommendation Algorithms Ruled 'Defective Products' in Court
monitorCalifornia jury found Meta ($4.2M) and YouTube ($1.8M) negligent for addictive design — targeting algorithmic features, not content, bypassing Section 230. Legal theory extends to AI chatbots. Thousands of pending cases will use this as template. Your objective function is now discoverable evidence.
- Meta damages
- YouTube damages
- NM Meta verdict
- Pending cases
- Meta (CA bellwether)4.2
- YouTube (CA bellwether)1.8
05 Model Commoditization Accelerates — Data Moat Is All That Remains
backgroundXiaomi anonymously shipped a 1T-param model (Hunter Alpha) that users mistook for DeepSeek v4. Frontier training costs falling to $50–100M. Open-source monetizable spread closing faster than capability spread. Apple's Gemini distillation deal validates teacher→domain-adapt→distill as the production edge deployment pattern.
- Hunter Alpha params
- Context window
- Tokens processed
- Training cost trend
- Xiaomi stock jump
- Frontier training cost (est.)100
- Open-source gap (months)3
◆ DEEP DIVES
01 ARC-AGI-3: Your Agentic Pipeline Has a Sub-1% Reasoning Floor
<h3>The Benchmark That Breaks Everything</h3><p>ARC-AGI-3 launched with <strong>135 interactive mini-games across ~1,000 levels</strong>, all verified as solvable by humans on first contact with no training. The results are devastating for anyone betting on agentic AI reasoning:</p><table><thead><tr><th>Model</th><th>Lab</th><th>ARC-AGI-3 Score</th></tr></thead><tbody><tr><td><strong>Gemini Pro</strong></td><td>Google</td><td>0.37%</td></tr><tr><td>GPT-5.4 High</td><td>OpenAI</td><td>0.26%</td></tr><tr><td>Opus 4.6</td><td>Anthropic</td><td>0.25%</td></tr><tr><td>Grok-4.20</td><td>xAI</td><td>0.00%</td></tr><tr><td><em>Humans</em></td><td>—</td><td><strong>100%</strong></td></tr></tbody></table><p>The spread between first and last place among frontier models is <strong>0.37 percentage points</strong> — statistically indistinguishable noise. This isn't a tuning gap; it's a <strong>structural limitation of current approaches</strong>. Chain-of-thought, tree-of-thought, and tool use are all insufficient for adaptive real-time reasoning in novel environments.</p><hr><h3>Why This Benchmark Is Different</h3><p>ARC-AGI-3 tests <strong>zero-instruction game-like scenarios</strong> requiring rule discovery, goal formation, and strategy planning entirely from interaction. This is fundamentally different from standard benchmarks that test pattern completion over trained distributions. The critical context: labs spent millions training specifically on ARC-AGI-2 and pushed scores from <strong>3% to ~50% in under a year</strong>. ARC-AGI-3 is designed to resist this Goodhart's Law dynamic.</p><blockquote>A model that scores 90% on MMLU but <1% on ARC-AGI-3 has fundamentally different reasoning capabilities than its leaderboard position suggests.</blockquote><p>Five independent sources corroborate these scores. The uniform failure across architecturally different models from four separate labs confirms this is <strong>not a prompt engineering problem</strong> — it's a capability ceiling. 25 games are publicly available for human play, and spending an hour with them calibrates your intuition about what these models genuinely cannot do.</p><hr><h3>What This Means for Your Agents</h3><p>If your agentic architecture assumes the LLM can discover rules in unfamiliar environments, plan strategies without explicit instructions, or generalize from zero-shot interaction, the empirical evidence is now clear: <strong>it fails at rates above 99%</strong>. The competitive advantage isn't picking the "smartest" model — all models reason at roughly the same (near-zero) level on novel tasks. The advantage is in the <strong>scaffold design</strong>: tool-orchestrated pattern matching, structured fallback logic, and human-in-the-loop gates at reasoning boundaries.</p><p>Separately, new research shows <strong>step-wise RL rewards improve multi-step agent task success by up to 40%</strong> compared to terminal-only rewards. This is directly actionable: most agent training frameworks default to terminal rewards, and adding intermediate signals is a reward-architecture change, not a model change. <em>The 40% figure lacks full methodology disclosure, but aligns with classical reward shaping theory and is worth a controlled experiment.</em></p>
Action items
- Run your production LLMs against ARC-AGI-3's 25 public games this sprint to establish a reasoning capability baseline
- Add interactive reasoning tasks (rule discovery, goal formation from interaction) to your agent eval pipeline by end of quarter
- Experiment with per-step RL rewards in your agent training pipelines — same agent, same tasks, dense vs. sparse rewards
- Monitor ARC-AGI-3 leaderboard progression over 6 months to calibrate model selection decisions
Sources:TurboQuant cuts your KV-cache 6x with zero accuracy loss — and ARC-AGI-3 just exposed your frontier model's reasoning ceiling · TurboQuant claims 8x inference + 6x memory via KV-cache compression — here's what to validate before you rearchitect your serving stack · TurboQuant cuts your KV cache to 3 bits with zero accuracy loss — 8x attention speedup on H100s · ARC-AGI-3 breaks every model (<1%) — your reasoning benchmarks need a hard reset · TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
02 Three Database Fixes Worth More Than Your Last Model Optimization
<h3>Snowflake's Disjunctive Join Trap: 100–200x Hidden Tax</h3><p>When you write <code>ON a.id = b.id OR a.alt_id = b.alt_id</code> in Snowflake, the <strong>hash join optimizer silently gives up</strong>. It can't partition on an OR condition, so it falls back to a <strong>Cartesian product</strong> — joining every row against every other row, then filtering. The fix: rewrite as two separate equi-joins with UNION ALL for <strong>100–200x speedups</strong>.</p><p><em>The magnitude is directionally believable (Cartesian-to-hash-join is exactly that kind of asymptotic improvement), though actual gains depend on table sizes and join selectivity.</em> This should be an automated lint rule in your SQL CI pipeline. Every Snowflake query in your feature engineering and training data pipelines with OR in a JOIN clause is a potential order-of-magnitude win.</p><hr><h3>Postgres Upsert: The No-Op Write Amplification Bug</h3><p>Datadog discovered that <code>ON CONFLICT DO UPDATE</code> in Postgres <strong>always acquires a row lock and writes to WAL</strong>, even when the incoming data is identical to the existing row. At the scale of millions of ephemeral hosts, this <strong>doubled disk writes and quadrupled WAL syncs</strong>. The fix: add a <code>WHERE</code> clause comparing old vs. new values to skip no-op updates.</p><p>This is relevant anywhere you're doing high-frequency upserts with mostly unchanged data — <strong>feature freshness tracking, model status heartbeats, entity metadata refreshes</strong>. The write amplification is invisible unless you're monitoring WAL metrics specifically.</p><hr><h3>Airbnb's Forecasting Decomposition: A Distribution Shift Playbook</h3><p>In March 2020, Airbnb's demand models broke across <strong>three simultaneous failure modes</strong>: massive booking volume swings, unpredictable cancellation spikes, and the collapse of the normal booking-to-travel-date relationship. A monolithic model couldn't isolate which signal was shifting.</p><p>The fix was architectural: <strong>decouple forecasting into two independent models</strong> — one for gross booking metrics on the booking-date axis, one for lead-time composition (what proportion of bookings convert to trips on future dates). Each component can be independently recalibrated when one signal regime-shifts while the other holds steady.</p><blockquote>Separate the signals that have different failure modes so you don't have to retrain everything when one distribution shifts.</blockquote><p>This is the same intuition behind mixture-of-experts and modular forecasting. No quantitative recovery metrics are published, but the architectural principle is sound and generalizable to any multi-step funnel model where upstream volume and downstream conversion have independent drift dynamics.</p>
Action items
- Grep all Snowflake SQL for OR in JOIN clauses today — every instance is a potential 100x+ speedup
- Check Postgres-backed feature stores for high-frequency upsert patterns where most rows don't change; add WHERE clause to skip no-ops
- Refactor multi-step forecasting models to decompose volume from composition signals, following Airbnb's pattern
- Backtest forecasting decomposition by simulating regime changes on historical data to validate the architecture before the next shock
Sources:Your Snowflake joins may be 200x slower than necessary — plus Airbnb's black-swan-proof forecasting architecture · Airbnb rebuilt its forecasting models for shock resilience — here's what that means for your time-series pipelines
03 Six New Critical CVEs Hit ML-Specific Infrastructure — The Attack Surface Is Expanding
<h3>This Week's ML Vulnerability Cluster</h3><p>The SANS @RISK bulletin revealed <strong>six critical CVEs targeting core ML infrastructure tools</strong> — tools many teams run in production today. The dominant pattern: ML platforms assume a trusted environment. They're built for researcher notebooks and deployed into multi-tenant production without security hardening.</p><table><thead><tr><th>Tool</th><th>CVE</th><th>CVSS</th><th>Vulnerability</th><th>Status</th></tr></thead><tbody><tr><td><strong>Langflow</strong></td><td>CVE-2026-33017+</td><td>9.1–9.9</td><td>Unauth RCE, file write, shell injection</td><td><strong>Exploited in 20 hrs</strong></td></tr><tr><td><strong>MLflow</strong></td><td>CVE-2025-15031</td><td>9.1</td><td>Arbitrary file write via tar.gz Zip Slip</td><td>High risk in multi-tenant</td></tr><tr><td><strong>NVIDIA APEX</strong></td><td>CVE-2025-33244</td><td>9.0</td><td>Pickle deserialization RCE (PyTorch <2.6)</td><td>Patch: upgrade PyTorch</td></tr><tr><td><strong>gRPC-Go</strong></td><td>CVE-2026-33186</td><td>9.1</td><td>Auth bypass via HTTP/2 :path header</td><td>22,844 GitHub stars exposed</td></tr><tr><td><strong>Harbor</strong></td><td>CVE-2026-4404</td><td>9.4</td><td>Hard-coded credentials</td><td>Upgrade from ≤2.15.0</td></tr><tr><td><strong>Mesop (Google)</strong></td><td>CVE-2026-33054/57</td><td>9.8–10.0</td><td>Path traversal + code injection</td><td>6,521 GitHub stars</td></tr></tbody></table><hr><h3>The Pickle Problem — Three RCEs in One Week</h3><p>Three distinct pickle deserialization RCEs appeared in a single weekly bulletin: NVIDIA APEX, OmniGen2-RL, and MLflow's tar.gz variant. <strong>Pickle is essentially eval() with extra steps</strong>, and it remains wired into the default serialization path of most ML frameworks. The migration path exists — <strong>safetensors</strong> for model weights, <strong>protobuf</strong> for structured data — but adoption remains slow.</p><h3>Infrastructure Components You Probably Run</h3><p>The gRPC-Go vulnerability is particularly insidious. With <strong>22,844 GitHub stars</strong>, it's a transitive dependency in countless Go-based serving systems. The authorization bypass via HTTP/2 <code>:path</code> pseudo-header manipulation means your model endpoint's access control may be silently ineffective. This bug doesn't show up in application-layer testing because it operates at the <strong>protocol layer</strong>.</p><p>Meanwhile, AWS Bedrock AgentCore's "complete isolation" sandbox was demonstrated to allow <strong>bidirectional C2 via DNS tunneling</strong> — a full interactive reverse shell from a supposedly air-gapped sandbox. AWS's response: they'll update the documentation, not fix the bug. The researcher received a $100 gift card.</p><blockquote>If your ML platform doesn't get the same security hardening as your databases, it's a matter of when, not if.</blockquote>
Action items
- Run 'python -c "import torch; print(torch.__version__)"' across all training environments — anything below 2.6 is vulnerable via APEX; upgrade and remove standalone APEX (replaced by torch.amp)
- Pin and verify gRPC-Go version ≥1.79.3 across all model serving infrastructure by checking go.sum files
- Implement model artifact scanning in MLflow — reject tar.gz artifacts with path traversal patterns and sandbox extraction for pyfunc models
- If using AWS Bedrock AgentCore sandbox, implement DNS egress filtering as compensating control — do not rely on AWS's isolation claim
Sources:Your ML stack has 6 critical RCEs this week — Langflow, MLflow, PyTorch, LoLLMs all compromised · Your LiteLLM dependency is compromised — pin to ≤1.82.6 and rotate all credentials now · Your LiteLLM proxy may have leaked every cloud credential — check if you ran Python on March 24 · Your LiteLLM proxy and Trivy scanner were compromised — audit your ML pipeline dependencies now
04 Courts Validated 'Defective Product Design' Against Recommendation Algorithms — Your Objective Function Is Evidence
<h3>The Legal Theory That Bypasses Section 230</h3><p>A California jury found <strong>Meta ($4.2M) and YouTube ($1.8M) liable for negligence</strong> in the first bellwether social media addiction case. The critical innovation: plaintiffs didn't argue about harmful content. They argued <strong>platform design features</strong> — infinite scroll, algorithmic recommendations, engagement-maximizing mechanics — constituted negligence. The jury agreed, and <strong>Section 230 didn't save them</strong> because the theory targets the algorithm, not the content.</p><table><thead><tr><th>Legal Theory</th><th>Target</th><th>Section 230 Defense</th><th>Status</th></tr></thead><tbody><tr><td>Content liability</td><td>User-generated content</td><td>Protected</td><td>Traditional approach</td></tr><tr><td><strong>Product design liability</strong></td><td>Algorithmic features (recs, scroll)</td><td><strong>Not protected</strong></td><td>Jury-validated</td></tr><tr><td>Child safety failure</td><td>Platform safety systems</td><td>Not invoked</td><td>$375M NM verdict</td></tr></tbody></table><p>One day earlier, a New Mexico jury hit Meta with <strong>$375M</strong> for failing to protect minors from predators. Both companies will appeal, but the legal attack vector is validated: <em>target the algorithm, not the content</em>. Thousands of similar cases are queued, including federal cases from school districts naming Meta, YouTube, TikTok, and Snap. The theory is being <strong>explicitly extended to AI chatbot makers</strong> including OpenAI and Google.</p><hr><h3>What This Means for Your Models</h3><p>Five independent sources confirm the same analysis: your <strong>optimization objective is now legally discoverable evidence</strong>. If your recommendation system maximizes watch time, and your A/B test logs show you chose the variant that increased session duration for adolescents, that's exhibit A in litigation. The legal standard is "negligence," not "intent" — you don't have to have intended harm for liability to attach.</p><blockquote>Document your safety trade-offs like they'll be read by a jury, because they might be.</blockquote><p><em>Both verdicts will be appealed, and a single bellwether doesn't set binding precedent. But it signals how juries perceive algorithmic engagement systems — and creates the template for thousands of upcoming cases.</em></p>
Action items
- Audit your recommendation system's objective function for harm-adjacent proxy metrics (session duration, scroll depth, notification re-engagement) and document explicit safety constraints this quarter
- Add user wellbeing metrics to your A/B testing framework alongside engagement KPIs
- If your platform serves minors, implement configurable engagement caps per user cohort as a model feature
Sources:Your recommendation engine is now a legal liability — platform design found negligent in landmark trial · Your recommendation models may now be legally 'defective products' — plus AI agent deployment hits inflection · Addiction liability verdict could rewrite your recommendation objective functions · Social media negligence verdict may force your engagement models into a legal minefield
◆ QUICK HITS
Update: TurboQuant's 8x KV-cache claim faces direct contradiction — Auto-Inference-Optimiser found KV-cache quantization hurt throughput on Apple Silicon, proving gains are hardware-dependent; realistic speedup over FP16 is likely 2–4x, not 8x
TurboQuant claims 8x inference + 6x memory via KV-cache compression — here's what to validate before you rearchitect your serving stack
Update: LiteLLM forensics reveal March 24 attack window (09:00–13:30 UTC), .pth file persistence that survives downgrades, and the payload executing on every Python interpreter startup — check site-packages for unexpected .pth files even if you've already patched
Your LiteLLM proxy may have leaked every cloud credential — check if you ran Python on March 24
ASMR agentic retrieval claims ~99% on LongMemEval using 12-agent decision forest instead of vector search — open-source release planned early April 2026; expect 100–1000x cost increase per query vs. ANN lookup
Your vector search pipeline may be obsolete — ASMR's 12-agent retrieval hits ~99% on LongMemEval
Apple's Gemini distillation hitting domain mismatch: Gemini was tuned for chatbot/coding tasks that don't match device-assistant use cases — textbook distribution shift in knowledge distillation worth studying for your own teacher→student pipelines
TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
Novo Nordisk killed Claude-powered 'Found Data' tool for mining decades of clinical trial data — expensive, no noticeable advances; CDO: 'If I can do it better in Excel, stay in Excel'
Novo Nordisk's multi-model agent architecture has real MLOps lessons — and one honest failure case worth studying
AWS Security Agent scores 92.5% on CVE Bench with scaffolding, drops to 65% when LLM knowledge predates the benchmark — a 27.5pp delta measuring memorization, not generalization; add knowledge-cutoff ablations to your eval standard
Your LiteLLM proxy and Trivy scanner were compromised — audit your ML pipeline dependencies now
Volga rewrote from Python+Ray to Rust (DataFusion+Arrow+SlateDB), unifying streaming, batch, and request-time ML feature compute — early but architecturally compelling replacement for Flink+Spark+Redis stacks
Your Snowflake joins may be 200x slower than necessary — plus Airbnb's black-swan-proof forecasting architecture
GitHub Copilot now uses your code for AI training by default (opt-out, not opt-in) — audit org settings today if proprietary model architectures or pipeline logic live in Copilot-enabled repos
TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
Reddit removes ~100,000 bot accounts per day — if Reddit data is in your NLP training corpus, expect measurable distribution shift as bot-generated text gets purged
TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
68% of executives report 10%+ energy cost increases from AI workloads in past 12 months (Pure Storage/Everpure survey, likely biased upward) — re-baseline your training compute cost models with escalation scenarios
AI-for-science hits two milestones — but your compute budget faces regulatory and energy headwinds
Update: Sora pivot confirmed as video-gen→robotics training — OpenAI believes diffusion world models have higher ROI as physics simulators for embodied AI than consumer content; if you have video diffusion skills, evaluate sim-to-real transfer
Video world models → robotics training: Sora's death signals where your generative modeling skills pay off next
BOTTOM LINE
ARC-AGI-3 scored every frontier model below 1% on reasoning tasks humans solve at 100%, confirming that agentic pipelines relying on novel LLM reasoning have a near-zero capability floor — while a Snowflake OR-join audit and Postgres upsert WHERE clause will deliver more immediate compute savings than your last model optimization, six new critical CVEs prove ML infrastructure is now a first-class attack surface requiring database-grade hardening, and a California jury just ruled that recommendation algorithms can be legally 'defective products' whose optimization objectives are courtroom evidence.
Frequently asked
- What ARC-AGI-3 score did frontier models actually achieve?
- All frontier models scored below 1% on ARC-AGI-3's interactive reasoning tasks that humans solve at 100%. Specifically: Gemini Pro at 0.37%, GPT-5.4 High at 0.26%, Anthropic's Opus 4.6 at 0.25%, and Grok-4.20 at literal 0%. The 0.37-point spread across four architecturally distinct labs indicates a structural capability ceiling, not a tuning gap.
- How should I redesign my agentic pipeline given these reasoning limits?
- Shift from open-ended reasoning to tool-orchestrated pattern matching with structured human fallbacks at reasoning boundaries. The competitive advantage is in the scaffold, not the model — all frontier models reason at roughly the same near-zero level on novel tasks. Also consider replacing terminal-only RL rewards with step-wise rewards, which early research suggests can improve multi-step agent success by up to 40%.
- Why does 'ON a.id = b.id OR a.alt_id = b.alt_id' destroy Snowflake performance?
- Snowflake's hash join optimizer cannot partition on an OR condition, so it silently falls back to a Cartesian product and filters afterward — yielding 100–200x slowdowns on non-trivial tables. The fix is to rewrite the query as two separate equi-joins combined with UNION ALL. This should be codified as an automated SQL lint rule in CI, since every OR-in-JOIN in your feature pipelines is a potential order-of-magnitude win.
- Which ML infrastructure CVEs need immediate patching?
- Six critical CVEs hit ML tooling this week, with three requiring action now: NVIDIA APEX pickle deserialization RCE (CVE-2025-33244, fixed by upgrading to PyTorch ≥2.6), gRPC-Go HTTP/2 :path auth bypass (CVE-2026-33186, pin to ≥1.79.3), and MLflow tar.gz Zip Slip arbitrary file write (CVE-2025-15031). Langflow unauth RCE was exploited in the wild within 20 hours of disclosure. Separately, AWS Bedrock AgentCore's sandbox allows DNS-tunneled C2 and AWS declined to fix — implement DNS egress filtering as a compensating control.
- Why aren't social media platforms protected by Section 230 in these verdicts?
- The plaintiffs targeted algorithmic design features — infinite scroll, recommendation systems, engagement-maximizing mechanics — rather than user-generated content, which is the narrow scope of Section 230 immunity. Juries found Meta ($4.2M) and YouTube ($1.8M) negligent under a product-defect theory, and separately hit Meta with $375M in New Mexico for child safety failures. The practical implication for data scientists: your optimization objective, A/B test logs, and safety trade-off documentation are now legally discoverable evidence in thousands of queued cases, which are being explicitly extended to AI chatbot makers.
◆ ALSO READ THIS DAY AS
◆ RECENT IN DATA SCIENCE
- Meta just validated two inference infrastructure shifts in one week: KernelEvolve uses LLMs to auto-optimize GPU kernels…
- Anthropic's Project Deal experiment proved that stronger models extract systematically better negotiation outcomes while…
- DeepSeek V4-Flash serves frontier-competitive inference at $0.14/$0.28 per million tokens — 107x cheaper than GPT-5.5 ou…
- A single model scored 19% or 78.7% on the same benchmark by swapping only the agent scaffold — a 4x variance that makes…
- Google's Gemma 4 ships the most aggressive KV cache engineering in any open model — 83% memory reduction, 128K context o…