What ARC-AGI-3 score did frontier models actually achieve?

All frontier models scored below 1% on ARC-AGI-3's interactive reasoning tasks that humans solve at 100%. Specifically: Gemini Pro at 0.37%, GPT-5.4 High at 0.26%, Anthropic's Opus 4.6 at 0.25%, and Grok-4.20 at literal 0%. The 0.37-point spread across four architecturally distinct labs indicates a structural capability ceiling, not a tuning gap.

How should I redesign my agentic pipeline given these reasoning limits?

Shift from open-ended reasoning to tool-orchestrated pattern matching with structured human fallbacks at reasoning boundaries. The competitive advantage is in the scaffold, not the model — all frontier models reason at roughly the same near-zero level on novel tasks. Also consider replacing terminal-only RL rewards with step-wise rewards, which early research suggests can improve multi-step agent success by up to 40%.

Why does 'ON a.id = b.id OR a.alt_id = b.alt_id' destroy Snowflake performance?

Snowflake's hash join optimizer cannot partition on an OR condition, so it silently falls back to a Cartesian product and filters afterward — yielding 100–200x slowdowns on non-trivial tables. The fix is to rewrite the query as two separate equi-joins combined with UNION ALL. This should be codified as an automated SQL lint rule in CI, since every OR-in-JOIN in your feature pipelines is a potential order-of-magnitude win.

Which ML infrastructure CVEs need immediate patching?

Six critical CVEs hit ML tooling this week, with three requiring action now: NVIDIA APEX pickle deserialization RCE (CVE-2025-33244, fixed by upgrading to PyTorch ≥2.6), gRPC-Go HTTP/2 :path auth bypass (CVE-2026-33186, pin to ≥1.79.3), and MLflow tar.gz Zip Slip arbitrary file write (CVE-2025-15031). Langflow unauth RCE was exploited in the wild within 20 hours of disclosure. Separately, AWS Bedrock AgentCore's sandbox allows DNS-tunneled C2 and AWS declined to fix — implement DNS egress filtering as a compensating control.

Why aren't social media platforms protected by Section 230 in these verdicts?

The plaintiffs targeted algorithmic design features — infinite scroll, recommendation systems, engagement-maximizing mechanics — rather than user-generated content, which is the narrow scope of Section 230 immunity. Juries found Meta ($4.2M) and YouTube ($1.8M) negligent under a product-defect theory, and separately hit Meta with $375M in New Mexico for child safety failures. The practical implication for data scientists: your optimization objective, A/B test logs, and safety trade-off documentation are now legally discoverable evidence in thousands of queued cases, which are being explicitly extended to AI chatbot makers.

PROMIT NOW · DATA SCIENCE DAILY · 2026-03-27

ARC-AGI-3 Pegs Frontier Models Below 1% on Reasoning

2026-03-27 · Data Science · 39 sources · 1,414 words · 7 min

Topics Data Infrastructure · Agentic AI · AI Regulation

ARC-AGI-3 just scored every frontier model below 1% on interactive reasoning tasks humans solve at 100% — Gemini Pro at 0.37%, GPT-5.4 at 0.26%, Grok-4.20 at literal 0%. If your agentic pipeline assumes the LLM can discover rules or form strategies in unfamiliar environments, that assumption now has a measured empirical ceiling. Design your agents for tool-orchestrated pattern matching with human fallbacks, not open-ended reasoning — the competitive advantage is in the scaffold, not the model.

Key facts

ARC-AGI-3 scored frontier models below 1% on interactive reasoning tasks humans solve at 100%: Gemini Pro 0.37%, GPT-5.4 0.26%, Opus 4.6 0.25%, Grok-4.20 0%.
ARC-AGI-3 contains 135 interactive mini-games across roughly 1,000 levels, all verified solvable by humans on first contact with no training.
A California jury found Meta liable for $4.2M and YouTube for $1.8M on product-design negligence theory, and Section 230 did not block the claims because they targeted algorithms, not content.
A New Mexico jury hit Meta with a $375M verdict for failing to protect minors from predators on its platform.
Six critical CVEs hit ML infrastructure including Langflow (CVSS 9.1–9.9, exploited within 20 hours), MLflow Zip Slip (9.1), NVIDIA APEX pickle RCE (9.0), gRPC-Go auth bypass (9.1), Harbor hard-coded credentials (9.4), and Google Mesop (9.8–10.0).

◆ INTELLIGENCE MAP

01
ARC-AGI-3 Resets the Reasoning Scoreboard to Near-Zero
act now
All frontier models score <1% on 135 interactive mini-games humans solve at 100%. Gemini Pro leads at 0.37%, GPT-5.4 at 0.26%, Grok-4.20 at 0%. Labs previously pushed ARC-AGI-2 from 3% to ~50% by training on it — AGI-3 is designed to resist that.
<1%
frontier model ceiling
5
sources
- Gemini Pro
- GPT-5.4 High
- Opus 4.6
- Grok-4.20
- Human baseline
1. Human100
2. Gemini Pro0.37
3. GPT-5.40.26
4. Opus 4.60.25
5. Grok-4.200
02
Three Database Fixes That Outperform Your Last Model Optimization
act now
Snowflake OR-joins silently force Cartesian products — rewriting as UNION ALL yields 100–200x speedups. Postgres ON CONFLICT DO UPDATE writes WAL even on no-ops, doubling Datadog's disk writes. Airbnb's COVID-era fix decoupled booking volume from lead-time composition for shock-resilient forecasting.
200x
Snowflake join speedup
2
sources
- Snowflake OR-join
- Postgres WAL bloat
- Postgres WAL syncs
- Airbnb model split
1. Snowflake OR-join fix200
2. Postgres WAL sync fix4
3. Postgres disk write fix2
03
ML Infrastructure CVE Cluster Expands Beyond LiteLLM
monitor
Six critical CVEs hit ML-specific tools this week: Langflow RCE exploited in 20 hours, MLflow arbitrary file write (CVSS 9.1), NVIDIA APEX pickle RCE in PyTorch <2.6 (CVSS 9.0), gRPC-Go auth bypass (CVSS 9.1), Harbor hard-coded creds (CVSS 9.4). Pattern: ML tools assume trusted environments and ship without input validation.
6
critical ML CVEs this week
5
sources
- Langflow exploit
- MLflow CVSS
- NVIDIA APEX CVSS
- gRPC-Go stars
- Harbor CVSS
1. 01Mesop (Google)10
2. 02Langflow RCE9.9
3. 03Harbor Registry9.4
4. 04MLflow pyfunc9.1
5. 05gRPC-Go bypass9.1
6. 06NVIDIA APEX9
04
Recommendation Algorithms Ruled 'Defective Products' in Court
monitor
California jury found Meta ($4.2M) and YouTube ($1.8M) negligent for addictive design — targeting algorithmic features, not content, bypassing Section 230. Legal theory extends to AI chatbots. Thousands of pending cases will use this as template. Your objective function is now discoverable evidence.
$4.2M
Meta damages awarded
5
sources
- Meta damages
- YouTube damages
- NM Meta verdict
- Pending cases
1. Meta (CA bellwether)4.2
2. YouTube (CA bellwether)1.8
05
Model Commoditization Accelerates — Data Moat Is All That Remains
background
Xiaomi anonymously shipped a 1T-param model (Hunter Alpha) that users mistook for DeepSeek v4. Frontier training costs falling to $50–100M. Open-source monetizable spread closing faster than capability spread. Apple's Gemini distillation deal validates teacher→domain-adapt→distill as the production edge deployment pattern.
1T
params, anonymous model
4
sources
- Hunter Alpha params
- Context window
- Tokens processed
- Training cost trend
- Xiaomi stock jump
1. Frontier training cost (est.)100
2. Open-source gap (months)3

◆ DEEP DIVES

01
ARC-AGI-3: Your Agentic Pipeline Has a Sub-1% Reasoning Floor
<h3>The Benchmark That Breaks Everything</h3><p>ARC-AGI-3 launched with <strong>135 interactive mini-games across ~1,000 levels</strong>, all verified as solvable by humans on first contact with no training. The results are devastating for anyone betting on agentic AI reasoning:</p><table><thead><tr><th>Model</th><th>Lab</th><th>ARC-AGI-3 Score</th></tr></thead><tbody><tr><td><strong>Gemini Pro</strong></td><td>Google</td><td>0.37%</td></tr><tr><td>GPT-5.4 High</td><td>OpenAI</td><td>0.26%</td></tr><tr><td>Opus 4.6</td><td>Anthropic</td><td>0.25%</td></tr><tr><td>Grok-4.20</td><td>xAI</td><td>0.00%</td></tr><tr><td><em>Humans</em></td><td>—</td><td><strong>100%</strong></td></tr></tbody></table><p>The spread between first and last place among frontier models is <strong>0.37 percentage points</strong> — statistically indistinguishable noise. This isn't a tuning gap; it's a <strong>structural limitation of current approaches</strong>. Chain-of-thought, tree-of-thought, and tool use are all insufficient for adaptive real-time reasoning in novel environments.</p><hr><h3>Why This Benchmark Is Different</h3><p>ARC-AGI-3 tests <strong>zero-instruction game-like scenarios</strong> requiring rule discovery, goal formation, and strategy planning entirely from interaction. This is fundamentally different from standard benchmarks that test pattern completion over trained distributions. The critical context: labs spent millions training specifically on ARC-AGI-2 and pushed scores from <strong>3% to ~50% in under a year</strong>. ARC-AGI-3 is designed to resist this Goodhart's Law dynamic.</p><blockquote>A model that scores 90% on MMLU but <1% on ARC-AGI-3 has fundamentally different reasoning capabilities than its leaderboard position suggests.</blockquote><p>Five independent sources corroborate these scores. The uniform failure across architecturally different models from four separate labs confirms this is <strong>not a prompt engineering problem</strong> — it's a capability ceiling. 25 games are publicly available for human play, and spending an hour with them calibrates your intuition about what these models genuinely cannot do.</p><hr><h3>What This Means for Your Agents</h3><p>If your agentic architecture assumes the LLM can discover rules in unfamiliar environments, plan strategies without explicit instructions, or generalize from zero-shot interaction, the empirical evidence is now clear: <strong>it fails at rates above 99%</strong>. The competitive advantage isn't picking the "smartest" model — all models reason at roughly the same (near-zero) level on novel tasks. The advantage is in the <strong>scaffold design</strong>: tool-orchestrated pattern matching, structured fallback logic, and human-in-the-loop gates at reasoning boundaries.</p><p>Separately, new research shows <strong>step-wise RL rewards improve multi-step agent task success by up to 40%</strong> compared to terminal-only rewards. This is directly actionable: most agent training frameworks default to terminal rewards, and adding intermediate signals is a reward-architecture change, not a model change. <em>The 40% figure lacks full methodology disclosure, but aligns with classical reward shaping theory and is worth a controlled experiment.</em></p>
Action items
- Run your production LLMs against ARC-AGI-3's 25 public games this sprint to establish a reasoning capability baseline
- Add interactive reasoning tasks (rule discovery, goal formation from interaction) to your agent eval pipeline by end of quarter
- Experiment with per-step RL rewards in your agent training pipelines — same agent, same tasks, dense vs. sparse rewards
- Monitor ARC-AGI-3 leaderboard progression over 6 months to calibrate model selection decisions
Sources:TurboQuant cuts your KV-cache 6x with zero accuracy loss — and ARC-AGI-3 just exposed your frontier model's reasoning ceiling · TurboQuant claims 8x inference + 6x memory via KV-cache compression — here's what to validate before you rearchitect your serving stack · TurboQuant cuts your KV cache to 3 bits with zero accuracy loss — 8x attention speedup on H100s · ARC-AGI-3 breaks every model (<1%) — your reasoning benchmarks need a hard reset · TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
02
Three Database Fixes Worth More Than Your Last Model Optimization
<h3>Snowflake's Disjunctive Join Trap: 100–200x Hidden Tax</h3><p>When you write <code>ON a.id = b.id OR a.alt_id = b.alt_id</code> in Snowflake, the <strong>hash join optimizer silently gives up</strong>. It can't partition on an OR condition, so it falls back to a <strong>Cartesian product</strong> — joining every row against every other row, then filtering. The fix: rewrite as two separate equi-joins with UNION ALL for <strong>100–200x speedups</strong>.</p><p><em>The magnitude is directionally believable (Cartesian-to-hash-join is exactly that kind of asymptotic improvement), though actual gains depend on table sizes and join selectivity.</em> This should be an automated lint rule in your SQL CI pipeline. Every Snowflake query in your feature engineering and training data pipelines with OR in a JOIN clause is a potential order-of-magnitude win.</p><hr><h3>Postgres Upsert: The No-Op Write Amplification Bug</h3><p>Datadog discovered that <code>ON CONFLICT DO UPDATE</code> in Postgres <strong>always acquires a row lock and writes to WAL</strong>, even when the incoming data is identical to the existing row. At the scale of millions of ephemeral hosts, this <strong>doubled disk writes and quadrupled WAL syncs</strong>. The fix: add a <code>WHERE</code> clause comparing old vs. new values to skip no-op updates.</p><p>This is relevant anywhere you're doing high-frequency upserts with mostly unchanged data — <strong>feature freshness tracking, model status heartbeats, entity metadata refreshes</strong>. The write amplification is invisible unless you're monitoring WAL metrics specifically.</p><hr><h3>Airbnb's Forecasting Decomposition: A Distribution Shift Playbook</h3><p>In March 2020, Airbnb's demand models broke across <strong>three simultaneous failure modes</strong>: massive booking volume swings, unpredictable cancellation spikes, and the collapse of the normal booking-to-travel-date relationship. A monolithic model couldn't isolate which signal was shifting.</p><p>The fix was architectural: <strong>decouple forecasting into two independent models</strong> — one for gross booking metrics on the booking-date axis, one for lead-time composition (what proportion of bookings convert to trips on future dates). Each component can be independently recalibrated when one signal regime-shifts while the other holds steady.</p><blockquote>Separate the signals that have different failure modes so you don't have to retrain everything when one distribution shifts.</blockquote><p>This is the same intuition behind mixture-of-experts and modular forecasting. No quantitative recovery metrics are published, but the architectural principle is sound and generalizable to any multi-step funnel model where upstream volume and downstream conversion have independent drift dynamics.</p>
Action items
- Grep all Snowflake SQL for OR in JOIN clauses today — every instance is a potential 100x+ speedup
- Check Postgres-backed feature stores for high-frequency upsert patterns where most rows don't change; add WHERE clause to skip no-ops
- Refactor multi-step forecasting models to decompose volume from composition signals, following Airbnb's pattern
- Backtest forecasting decomposition by simulating regime changes on historical data to validate the architecture before the next shock
Sources:Your Snowflake joins may be 200x slower than necessary — plus Airbnb's black-swan-proof forecasting architecture · Airbnb rebuilt its forecasting models for shock resilience — here's what that means for your time-series pipelines
03
Six New Critical CVEs Hit ML-Specific Infrastructure — The Attack Surface Is Expanding
<h3>This Week's ML Vulnerability Cluster</h3><p>The SANS @RISK bulletin revealed <strong>six critical CVEs targeting core ML infrastructure tools</strong> — tools many teams run in production today. The dominant pattern: ML platforms assume a trusted environment. They're built for researcher notebooks and deployed into multi-tenant production without security hardening.</p><table><thead><tr><th>Tool</th><th>CVE</th><th>CVSS</th><th>Vulnerability</th><th>Status</th></tr></thead><tbody><tr><td><strong>Langflow</strong></td><td>CVE-2026-33017+</td><td>9.1–9.9</td><td>Unauth RCE, file write, shell injection</td><td><strong>Exploited in 20 hrs</strong></td></tr><tr><td><strong>MLflow</strong></td><td>CVE-2025-15031</td><td>9.1</td><td>Arbitrary file write via tar.gz Zip Slip</td><td>High risk in multi-tenant</td></tr><tr><td><strong>NVIDIA APEX</strong></td><td>CVE-2025-33244</td><td>9.0</td><td>Pickle deserialization RCE (PyTorch <2.6)</td><td>Patch: upgrade PyTorch</td></tr><tr><td><strong>gRPC-Go</strong></td><td>CVE-2026-33186</td><td>9.1</td><td>Auth bypass via HTTP/2 :path header</td><td>22,844 GitHub stars exposed</td></tr><tr><td><strong>Harbor</strong></td><td>CVE-2026-4404</td><td>9.4</td><td>Hard-coded credentials</td><td>Upgrade from ≤2.15.0</td></tr><tr><td><strong>Mesop (Google)</strong></td><td>CVE-2026-33054/57</td><td>9.8–10.0</td><td>Path traversal + code injection</td><td>6,521 GitHub stars</td></tr></tbody></table><hr><h3>The Pickle Problem — Three RCEs in One Week</h3><p>Three distinct pickle deserialization RCEs appeared in a single weekly bulletin: NVIDIA APEX, OmniGen2-RL, and MLflow's tar.gz variant. <strong>Pickle is essentially eval() with extra steps</strong>, and it remains wired into the default serialization path of most ML frameworks. The migration path exists — <strong>safetensors</strong> for model weights, <strong>protobuf</strong> for structured data — but adoption remains slow.</p><h3>Infrastructure Components You Probably Run</h3><p>The gRPC-Go vulnerability is particularly insidious. With <strong>22,844 GitHub stars</strong>, it's a transitive dependency in countless Go-based serving systems. The authorization bypass via HTTP/2 <code>:path</code> pseudo-header manipulation means your model endpoint's access control may be silently ineffective. This bug doesn't show up in application-layer testing because it operates at the <strong>protocol layer</strong>.</p><p>Meanwhile, AWS Bedrock AgentCore's "complete isolation" sandbox was demonstrated to allow <strong>bidirectional C2 via DNS tunneling</strong> — a full interactive reverse shell from a supposedly air-gapped sandbox. AWS's response: they'll update the documentation, not fix the bug. The researcher received a $100 gift card.</p><blockquote>If your ML platform doesn't get the same security hardening as your databases, it's a matter of when, not if.</blockquote>
Action items
- Run 'python -c "import torch; print(torch.__version__)"' across all training environments — anything below 2.6 is vulnerable via APEX; upgrade and remove standalone APEX (replaced by torch.amp)
- Pin and verify gRPC-Go version ≥1.79.3 across all model serving infrastructure by checking go.sum files
- Implement model artifact scanning in MLflow — reject tar.gz artifacts with path traversal patterns and sandbox extraction for pyfunc models
- If using AWS Bedrock AgentCore sandbox, implement DNS egress filtering as compensating control — do not rely on AWS's isolation claim
Sources:Your ML stack has 6 critical RCEs this week — Langflow, MLflow, PyTorch, LoLLMs all compromised · Your LiteLLM dependency is compromised — pin to ≤1.82.6 and rotate all credentials now · Your LiteLLM proxy may have leaked every cloud credential — check if you ran Python on March 24 · Your LiteLLM proxy and Trivy scanner were compromised — audit your ML pipeline dependencies now
04
Courts Validated 'Defective Product Design' Against Recommendation Algorithms — Your Objective Function Is Evidence
<h3>The Legal Theory That Bypasses Section 230</h3><p>A California jury found <strong>Meta ($4.2M) and YouTube ($1.8M) liable for negligence</strong> in the first bellwether social media addiction case. The critical innovation: plaintiffs didn't argue about harmful content. They argued <strong>platform design features</strong> — infinite scroll, algorithmic recommendations, engagement-maximizing mechanics — constituted negligence. The jury agreed, and <strong>Section 230 didn't save them</strong> because the theory targets the algorithm, not the content.</p><table><thead><tr><th>Legal Theory</th><th>Target</th><th>Section 230 Defense</th><th>Status</th></tr></thead><tbody><tr><td>Content liability</td><td>User-generated content</td><td>Protected</td><td>Traditional approach</td></tr><tr><td><strong>Product design liability</strong></td><td>Algorithmic features (recs, scroll)</td><td><strong>Not protected</strong></td><td>Jury-validated</td></tr><tr><td>Child safety failure</td><td>Platform safety systems</td><td>Not invoked</td><td>$375M NM verdict</td></tr></tbody></table><p>One day earlier, a New Mexico jury hit Meta with <strong>$375M</strong> for failing to protect minors from predators. Both companies will appeal, but the legal attack vector is validated: <em>target the algorithm, not the content</em>. Thousands of similar cases are queued, including federal cases from school districts naming Meta, YouTube, TikTok, and Snap. The theory is being <strong>explicitly extended to AI chatbot makers</strong> including OpenAI and Google.</p><hr><h3>What This Means for Your Models</h3><p>Five independent sources confirm the same analysis: your <strong>optimization objective is now legally discoverable evidence</strong>. If your recommendation system maximizes watch time, and your A/B test logs show you chose the variant that increased session duration for adolescents, that's exhibit A in litigation. The legal standard is "negligence," not "intent" — you don't have to have intended harm for liability to attach.</p><blockquote>Document your safety trade-offs like they'll be read by a jury, because they might be.</blockquote><p><em>Both verdicts will be appealed, and a single bellwether doesn't set binding precedent. But it signals how juries perceive algorithmic engagement systems — and creates the template for thousands of upcoming cases.</em></p>
Action items
- Audit your recommendation system's objective function for harm-adjacent proxy metrics (session duration, scroll depth, notification re-engagement) and document explicit safety constraints this quarter
- Add user wellbeing metrics to your A/B testing framework alongside engagement KPIs
- If your platform serves minors, implement configurable engagement caps per user cohort as a model feature
Sources:Your recommendation engine is now a legal liability — platform design found negligent in landmark trial · Your recommendation models may now be legally 'defective products' — plus AI agent deployment hits inflection · Addiction liability verdict could rewrite your recommendation objective functions · Social media negligence verdict may force your engagement models into a legal minefield

◆ QUICK HITS

Update: TurboQuant's 8x KV-cache claim faces direct contradiction — Auto-Inference-Optimiser found KV-cache quantization hurt throughput on Apple Silicon, proving gains are hardware-dependent; realistic speedup over FP16 is likely 2–4x, not 8x
TurboQuant claims 8x inference + 6x memory via KV-cache compression — here's what to validate before you rearchitect your serving stack
Update: LiteLLM forensics reveal March 24 attack window (09:00–13:30 UTC), .pth file persistence that survives downgrades, and the payload executing on every Python interpreter startup — check site-packages for unexpected .pth files even if you've already patched
Your LiteLLM proxy may have leaked every cloud credential — check if you ran Python on March 24
ASMR agentic retrieval claims ~99% on LongMemEval using 12-agent decision forest instead of vector search — open-source release planned early April 2026; expect 100–1000x cost increase per query vs. ANN lookup
Your vector search pipeline may be obsolete — ASMR's 12-agent retrieval hits ~99% on LongMemEval
Apple's Gemini distillation hitting domain mismatch: Gemini was tuned for chatbot/coding tasks that don't match device-assistant use cases — textbook distribution shift in knowledge distillation worth studying for your own teacher→student pipelines
TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
Novo Nordisk killed Claude-powered 'Found Data' tool for mining decades of clinical trial data — expensive, no noticeable advances; CDO: 'If I can do it better in Excel, stay in Excel'
Novo Nordisk's multi-model agent architecture has real MLOps lessons — and one honest failure case worth studying
AWS Security Agent scores 92.5% on CVE Bench with scaffolding, drops to 65% when LLM knowledge predates the benchmark — a 27.5pp delta measuring memorization, not generalization; add knowledge-cutoff ablations to your eval standard
Your LiteLLM proxy and Trivy scanner were compromised — audit your ML pipeline dependencies now
Volga rewrote from Python+Ray to Rust (DataFusion+Arrow+SlateDB), unifying streaming, batch, and request-time ML feature compute — early but architecturally compelling replacement for Flink+Spark+Redis stacks
Your Snowflake joins may be 200x slower than necessary — plus Airbnb's black-swan-proof forecasting architecture
GitHub Copilot now uses your code for AI training by default (opt-out, not opt-in) — audit org settings today if proprietary model architectures or pipeline logic live in Copilot-enabled repos
TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
Reddit removes ~100,000 bot accounts per day — if Reddit data is in your NLP training corpus, expect measurable distribution shift as bot-generated text gets purged
TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
68% of executives report 10%+ energy cost increases from AI workloads in past 12 months (Pure Storage/Everpure survey, likely biased upward) — re-baseline your training compute cost models with escalation scenarios
AI-for-science hits two milestones — but your compute budget faces regulatory and energy headwinds
Update: Sora pivot confirmed as video-gen→robotics training — OpenAI believes diffusion world models have higher ROI as physics simulators for embodied AI than consumer content; if you have video diffusion skills, evaluate sim-to-real transfer
Video world models → robotics training: Sora's death signals where your generative modeling skills pay off next

BOTTOM LINE

ARC-AGI-3 scored every frontier model below 1% on reasoning tasks humans solve at 100%, confirming that agentic pipelines relying on novel LLM reasoning have a near-zero capability floor — while a Snowflake OR-join audit and Postgres upsert WHERE clause will deliver more immediate compute savings than your last model optimization, six new critical CVEs prove ML infrastructure is now a first-class attack surface requiring database-grade hardening, and a California jury just ruled that recommendation algorithms can be legally 'defective products' whose optimization objectives are courtroom evidence.

Frequently asked

What ARC-AGI-3 score did frontier models actually achieve?: All frontier models scored below 1% on ARC-AGI-3's interactive reasoning tasks that humans solve at 100%. Specifically: Gemini Pro at 0.37%, GPT-5.4 High at 0.26%, Anthropic's Opus 4.6 at 0.25%, and Grok-4.20 at literal 0%. The 0.37-point spread across four architecturally distinct labs indicates a structural capability ceiling, not a tuning gap.
How should I redesign my agentic pipeline given these reasoning limits?: Shift from open-ended reasoning to tool-orchestrated pattern matching with structured human fallbacks at reasoning boundaries. The competitive advantage is in the scaffold, not the model — all frontier models reason at roughly the same near-zero level on novel tasks. Also consider replacing terminal-only RL rewards with step-wise rewards, which early research suggests can improve multi-step agent success by up to 40%.
Why does 'ON a.id = b.id OR a.alt_id = b.alt_id' destroy Snowflake performance?: Snowflake's hash join optimizer cannot partition on an OR condition, so it silently falls back to a Cartesian product and filters afterward — yielding 100–200x slowdowns on non-trivial tables. The fix is to rewrite the query as two separate equi-joins combined with UNION ALL. This should be codified as an automated SQL lint rule in CI, since every OR-in-JOIN in your feature pipelines is a potential order-of-magnitude win.
Which ML infrastructure CVEs need immediate patching?: Six critical CVEs hit ML tooling this week, with three requiring action now: NVIDIA APEX pickle deserialization RCE (CVE-2025-33244, fixed by upgrading to PyTorch ≥2.6), gRPC-Go HTTP/2 :path auth bypass (CVE-2026-33186, pin to ≥1.79.3), and MLflow tar.gz Zip Slip arbitrary file write (CVE-2025-15031). Langflow unauth RCE was exploited in the wild within 20 hours of disclosure. Separately, AWS Bedrock AgentCore's sandbox allows DNS-tunneled C2 and AWS declined to fix — implement DNS egress filtering as a compensating control.
Why aren't social media platforms protected by Section 230 in these verdicts?: The plaintiffs targeted algorithmic design features — infinite scroll, recommendation systems, engagement-maximizing mechanics — rather than user-generated content, which is the narrow scope of Section 230 immunity. Juries found Meta ($4.2M) and YouTube ($1.8M) negligent under a product-defect theory, and separately hit Meta with $375M in New Mexico for child safety failures. The practical implication for data scientists: your optimization objective, A/B test logs, and safety trade-off documentation are now legally discoverable evidence in thousands of queued cases, which are being explicitly extended to AI chatbot makers.

ARC-AGI-3 Pegs Frontier Models Below 1% on Reasoning

◆ INTELLIGENCE MAP

ARC-AGI-3 Resets the Reasoning Scoreboard to Near-Zero

Three Database Fixes That Outperform Your Last Model Optimization

ML Infrastructure CVE Cluster Expands Beyond LiteLLM

Recommendation Algorithms Ruled 'Defective Products' in Court

Model Commoditization Accelerates — Data Moat Is All That Remains

◆ DEEP DIVES

ARC-AGI-3: Your Agentic Pipeline Has a Sub-1% Reasoning Floor

Three Database Fixes Worth More Than Your Last Model Optimization

Six New Critical CVEs Hit ML-Specific Infrastructure — The Attack Surface Is Expanding

Courts Validated 'Defective Product Design' Against Recommendation Algorithms — Your Objective Function Is Evidence

◆ QUICK HITS

BOTTOM LINE

Frequently asked

◆ ALSO READ THIS DAY AS

◆ RECENT IN DATA SCIENCE

ARC-AGI-3 Pegs Frontier Models Below 1% on Reasoning

◆ INTELLIGENCE MAP

◆ DEEP DIVES

◆ QUICK HITS

BOTTOM LINE

Frequently asked

◆ ALSO READ THIS DAY AS

◆ RELATED THREADS

◆ RECENT IN DATA SCIENCE