Edition 2026-03-27 · read as Data Science
ARC-AGI-3PegsFrontierModelsBelow1%onReasoning
- Sources
- 39
- Words
- 1,414
- Read
- 7min
◆ The signal
ARC-AGI-3 just scored every frontier model below 1% on interactive reasoning tasks humans solve at 100% — Gemini Pro at 0.37%, GPT-5.4 at 0.26%, Grok-4.20 at literal 0%. If your agentic pipeline assumes the LLM can discover rules or form strategies in unfamiliar environments, that assumption now has a measured empirical ceiling. Design your agents for tool-orchestrated pattern matching with human fallbacks, not open-ended reasoning — the competitive advantage is in the scaffold, not the model.
◆ INTELLIGENCE MAP
01 ARC-AGI-3 Resets the Reasoning Scoreboard to Near-Zero
act nowAll frontier models score <1% on 135 interactive mini-games humans solve at 100%. Gemini Pro leads at 0.37%, GPT-5.4 at 0.26%, Grok-4.20 at 0%. Labs previously pushed ARC-AGI-2 from 3% to ~50% by training on it — AGI-3 is designed to resist that.
- Gemini Pro
- GPT-5.4 High
- Opus 4.6
- Grok-4.20
- Human baseline
02 Three Database Fixes That Outperform Your Last Model Optimization
act nowSnowflake OR-joins silently force Cartesian products — rewriting as UNION ALL yields 100–200x speedups. Postgres ON CONFLICT DO UPDATE writes WAL even on no-ops, doubling Datadog's disk writes. Airbnb's COVID-era fix decoupled booking volume from lead-time composition for shock-resilient forecasting.
- Snowflake OR-join
- Postgres WAL bloat
- Postgres WAL syncs
- Airbnb model split
03 ML Infrastructure CVE Cluster Expands Beyond LiteLLM
monitorSix critical CVEs hit ML-specific tools this week: Langflow RCE exploited in 20 hours, MLflow arbitrary file write (CVSS 9.1), NVIDIA APEX pickle RCE in PyTorch <2.6 (CVSS 9.0), gRPC-Go auth bypass (CVSS 9.1), Harbor hard-coded creds (CVSS 9.4). Pattern: ML tools assume trusted environments and ship without input validation.
- Langflow exploit
- MLflow CVSS
- NVIDIA APEX CVSS
- gRPC-Go stars
- Harbor CVSS
- 01Mesop (Google)10
- 02Langflow RCE9.9
- 03Harbor Registry9.4
- 04MLflow pyfunc9.1
- 05gRPC-Go bypass9.1
- 06NVIDIA APEX9
04 Recommendation Algorithms Ruled 'Defective Products' in Court
monitorCalifornia jury found Meta ($4.2M) and YouTube ($1.8M) negligent for addictive design — targeting algorithmic features, not content, bypassing Section 230. Legal theory extends to AI chatbots. Thousands of pending cases will use this as template. Your objective function is now discoverable evidence.
- Meta damages
- YouTube damages
- NM Meta verdict
- Pending cases
- Meta (CA bellwether)4.2
- YouTube (CA bellwether)1.8
05 Model Commoditization Accelerates — Data Moat Is All That Remains
backgroundXiaomi anonymously shipped a 1T-param model (Hunter Alpha) that users mistook for DeepSeek v4. Frontier training costs falling to $50–100M. Open-source monetizable spread closing faster than capability spread. Apple's Gemini distillation deal validates teacher→domain-adapt→distill as the production edge deployment pattern.
- Hunter Alpha params
- Context window
- Tokens processed
- Training cost trend
- Xiaomi stock jump
- Frontier training cost (est.)100
- Open-source gap (months)3
◆ DEEP DIVES
01 ARC-AGI-3: Your Agentic Pipeline Has a Sub-1% Reasoning Floor
The Benchmark That Breaks Everything
ARC-AGI-3 launched with 135 interactive mini-games across ~1,000 levels, all verified as solvable by humans on first contact with no training. The results are devastating for anyone betting on agentic AI reasoning:
Model Lab ARC-AGI-3 Score Gemini Pro Google 0.37% GPT-5.4 High OpenAI 0.26% Opus 4.6 Anthropic 0.25% Grok-4.20 xAI 0.00% Humans — 100% The spread between first and last place among frontier models is 0.37 percentage points — statistically indistinguishable noise. This isn't a tuning gap; it's a structural limitation of current approaches. Chain-of-thought, tree-of-thought, and tool use are all insufficient for adaptive real-time reasoning in novel environments.
Why This Benchmark Is Different
ARC-AGI-3 tests zero-instruction game-like scenarios requiring rule discovery, goal formation, and strategy planning entirely from interaction. This is fundamentally different from standard benchmarks that test pattern completion over trained distributions. The critical context: labs spent millions training specifically on ARC-AGI-2 and pushed scores from 3% to ~50% in under a year. ARC-AGI-3 is designed to resist this Goodhart's Law dynamic.
A model that scores 90% on MMLU but <1% on ARC-AGI-3 has fundamentally different reasoning capabilities than its leaderboard position suggests.
Five independent sources corroborate these scores. The uniform failure across architecturally different models from four separate labs confirms this is not a prompt engineering problem — it's a capability ceiling. 25 games are publicly available for human play, and spending an hour with them calibrates your intuition about what these models genuinely cannot do.
What This Means for Your Agents
If your agentic architecture assumes the LLM can discover rules in unfamiliar environments, plan strategies without explicit instructions, or generalize from zero-shot interaction, the empirical evidence is now clear: it fails at rates above 99%. The competitive advantage isn't picking the "smartest" model — all models reason at roughly the same (near-zero) level on novel tasks. The advantage is in the scaffold design: tool-orchestrated pattern matching, structured fallback logic, and human-in-the-loop gates at reasoning boundaries.
Separately, new research shows step-wise RL rewards improve multi-step agent task success by up to 40% compared to terminal-only rewards. This is directly actionable: most agent training frameworks default to terminal rewards, and adding intermediate signals is a reward-architecture change, not a model change. The 40% figure lacks full methodology disclosure, but aligns with classical reward shaping theory and is worth a controlled experiment.
Action items
- Run your production LLMs against ARC-AGI-3's 25 public games this sprint to establish a reasoning capability baseline
- Add interactive reasoning tasks (rule discovery, goal formation from interaction) to your agent eval pipeline by end of quarter
- Experiment with per-step RL rewards in your agent training pipelines — same agent, same tasks, dense vs. sparse rewards
- Monitor ARC-AGI-3 leaderboard progression over 6 months to calibrate model selection decisions
Sources:TurboQuant cuts your KV-cache 6x with zero accuracy loss — and ARC-AGI-3 just exposed your frontier model's reasoning ceiling · TurboQuant claims 8x inference + 6x memory via KV-cache compression — here's what to validate before you rearchitect your serving stack · TurboQuant cuts your KV cache to 3 bits with zero accuracy loss — 8x attention speedup on H100s · ARC-AGI-3 breaks every model (<1%) — your reasoning benchmarks need a hard reset · TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
02 Three Database Fixes Worth More Than Your Last Model Optimization
Snowflake's Disjunctive Join Trap: 100–200x Hidden Tax
When you write
ON a.id = b.id OR a.alt_id = b.alt_idin Snowflake, the hash join optimizer silently gives up. It can't partition on an OR condition, so it falls back to a Cartesian product — joining every row against every other row, then filtering. The fix: rewrite as two separate equi-joins with UNION ALL for 100–200x speedups.The magnitude is directionally believable (Cartesian-to-hash-join is exactly that kind of asymptotic improvement), though actual gains depend on table sizes and join selectivity. This should be an automated lint rule in your SQL CI pipeline. Every Snowflake query in your feature engineering and training data pipelines with OR in a JOIN clause is a potential order-of-magnitude win.
Postgres Upsert: The No-Op Write Amplification Bug
Datadog discovered that
ON CONFLICT DO UPDATEin Postgres always acquires a row lock and writes to WAL, even when the incoming data is identical to the existing row. At the scale of millions of ephemeral hosts, this doubled disk writes and quadrupled WAL syncs. The fix: add aWHEREclause comparing old vs. new values to skip no-op updates.This is relevant anywhere you're doing high-frequency upserts with mostly unchanged data — feature freshness tracking, model status heartbeats, entity metadata refreshes. The write amplification is invisible unless you're monitoring WAL metrics specifically.
Airbnb's Forecasting Decomposition: A Distribution Shift Playbook
In March 2020, Airbnb's demand models broke across three simultaneous failure modes: massive booking volume swings, unpredictable cancellation spikes, and the collapse of the normal booking-to-travel-date relationship. A monolithic model couldn't isolate which signal was shifting.
The fix was architectural: decouple forecasting into two independent models — one for gross booking metrics on the booking-date axis, one for lead-time composition (what proportion of bookings convert to trips on future dates). Each component can be independently recalibrated when one signal regime-shifts while the other holds steady.
Separate the signals that have different failure modes so you don't have to retrain everything when one distribution shifts.
This is the same intuition behind mixture-of-experts and modular forecasting. No quantitative recovery metrics are published, but the architectural principle is sound and generalizable to any multi-step funnel model where upstream volume and downstream conversion have independent drift dynamics.
Action items
- Grep all Snowflake SQL for OR in JOIN clauses today — every instance is a potential 100x+ speedup
- Check Postgres-backed feature stores for high-frequency upsert patterns where most rows don't change; add WHERE clause to skip no-ops
- Refactor multi-step forecasting models to decompose volume from composition signals, following Airbnb's pattern
- Backtest forecasting decomposition by simulating regime changes on historical data to validate the architecture before the next shock
Sources:Your Snowflake joins may be 200x slower than necessary — plus Airbnb's black-swan-proof forecasting architecture · Airbnb rebuilt its forecasting models for shock resilience — here's what that means for your time-series pipelines
03 Six New Critical CVEs Hit ML-Specific Infrastructure — The Attack Surface Is Expanding
This Week's ML Vulnerability Cluster
The SANS @RISK bulletin revealed six critical CVEs targeting core ML infrastructure tools — tools many teams run in production today. The dominant pattern: ML platforms assume a trusted environment. They're built for researcher notebooks and deployed into multi-tenant production without security hardening.
Tool CVE CVSS Vulnerability Status Langflow CVE-2026-33017+ 9.1–9.9 Unauth RCE, file write, shell injection Exploited in 20 hrs MLflow CVE-2025-15031 9.1 Arbitrary file write via tar.gz Zip Slip High risk in multi-tenant NVIDIA APEX CVE-2025-33244 9.0 Pickle deserialization RCE (PyTorch <2.6) Patch: upgrade PyTorch gRPC-Go CVE-2026-33186 9.1 Auth bypass via HTTP/2 :path header 22,844 GitHub stars exposed Harbor CVE-2026-4404 9.4 Hard-coded credentials Upgrade from ≤2.15.0 Mesop (Google) CVE-2026-33054/57 9.8–10.0 Path traversal + code injection 6,521 GitHub stars The Pickle Problem — Three RCEs in One Week
Three distinct pickle deserialization RCEs appeared in a single weekly bulletin: NVIDIA APEX, OmniGen2-RL, and MLflow's tar.gz variant. Pickle is essentially eval() with extra steps, and it remains wired into the default serialization path of most ML frameworks. The migration path exists — safetensors for model weights, protobuf for structured data — but adoption remains slow.
Infrastructure Components You Probably Run
The gRPC-Go vulnerability is particularly insidious. With 22,844 GitHub stars, it's a transitive dependency in countless Go-based serving systems. The authorization bypass via HTTP/2
:pathpseudo-header manipulation means your model endpoint's access control may be silently ineffective. This bug doesn't show up in application-layer testing because it operates at the protocol layer.Meanwhile, AWS Bedrock AgentCore's "complete isolation" sandbox was demonstrated to allow bidirectional C2 via DNS tunneling — a full interactive reverse shell from a supposedly air-gapped sandbox. AWS's response: they'll update the documentation, not fix the bug. The researcher received a $100 gift card.
If your ML platform doesn't get the same security hardening as your databases, it's a matter of when, not if.
Action items
- Run 'python -c "import torch; print(torch.__version__)"' across all training environments — anything below 2.6 is vulnerable via APEX; upgrade and remove standalone APEX (replaced by torch.amp)
- Pin and verify gRPC-Go version ≥1.79.3 across all model serving infrastructure by checking go.sum files
- Implement model artifact scanning in MLflow — reject tar.gz artifacts with path traversal patterns and sandbox extraction for pyfunc models
- If using AWS Bedrock AgentCore sandbox, implement DNS egress filtering as compensating control — do not rely on AWS's isolation claim
Sources:Your ML stack has 6 critical RCEs this week — Langflow, MLflow, PyTorch, LoLLMs all compromised · Your LiteLLM dependency is compromised — pin to ≤1.82.6 and rotate all credentials now · Your LiteLLM proxy may have leaked every cloud credential — check if you ran Python on March 24 · Your LiteLLM proxy and Trivy scanner were compromised — audit your ML pipeline dependencies now
04 Courts Validated 'Defective Product Design' Against Recommendation Algorithms — Your Objective Function Is Evidence
The Legal Theory That Bypasses Section 230
A California jury found Meta ($4.2M) and YouTube ($1.8M) liable for negligence in the first bellwether social media addiction case. The critical innovation: plaintiffs didn't argue about harmful content. They argued platform design features — infinite scroll, algorithmic recommendations, engagement-maximizing mechanics — constituted negligence. The jury agreed, and Section 230 didn't save them because the theory targets the algorithm, not the content.
Legal Theory Target Section 230 Defense Status Content liability User-generated content Protected Traditional approach Product design liability Algorithmic features (recs, scroll) Not protected Jury-validated Child safety failure Platform safety systems Not invoked $375M NM verdict One day earlier, a New Mexico jury hit Meta with $375M for failing to protect minors from predators. Both companies will appeal, but the legal attack vector is validated: target the algorithm, not the content. Thousands of similar cases are queued, including federal cases from school districts naming Meta, YouTube, TikTok, and Snap. The theory is being explicitly extended to AI chatbot makers including OpenAI and Google.
What This Means for Your Models
Five independent sources confirm the same analysis: your optimization objective is now legally discoverable evidence. If your recommendation system maximizes watch time, and your A/B test logs show you chose the variant that increased session duration for adolescents, that's exhibit A in litigation. The legal standard is "negligence," not "intent" — you don't have to have intended harm for liability to attach.
Document your safety trade-offs like they'll be read by a jury, because they might be.
Both verdicts will be appealed, and a single bellwether doesn't set binding precedent. But it signals how juries perceive algorithmic engagement systems — and creates the template for thousands of upcoming cases.
Action items
- Audit your recommendation system's objective function for harm-adjacent proxy metrics (session duration, scroll depth, notification re-engagement) and document explicit safety constraints this quarter
- Add user wellbeing metrics to your A/B testing framework alongside engagement KPIs
- If your platform serves minors, implement configurable engagement caps per user cohort as a model feature
Sources:Your recommendation engine is now a legal liability — platform design found negligent in landmark trial · Your recommendation models may now be legally 'defective products' — plus AI agent deployment hits inflection · Addiction liability verdict could rewrite your recommendation objective functions · Social media negligence verdict may force your engagement models into a legal minefield
◆ QUICK HITS
Update: TurboQuant's 8x KV-cache claim faces direct contradiction — Auto-Inference-Optimiser found KV-cache quantization hurt throughput on Apple Silicon, proving gains are hardware-dependent; realistic speedup over FP16 is likely 2–4x, not 8x
TurboQuant claims 8x inference + 6x memory via KV-cache compression — here's what to validate before you rearchitect your serving stack
Update: LiteLLM forensics reveal March 24 attack window (09:00–13:30 UTC), .pth file persistence that survives downgrades, and the payload executing on every Python interpreter startup — check site-packages for unexpected .pth files even if you've already patched
Your LiteLLM proxy may have leaked every cloud credential — check if you ran Python on March 24
ASMR agentic retrieval claims ~99% on LongMemEval using 12-agent decision forest instead of vector search — open-source release planned early April 2026; expect 100–1000x cost increase per query vs. ANN lookup
Your vector search pipeline may be obsolete — ASMR's 12-agent retrieval hits ~99% on LongMemEval
Apple's Gemini distillation hitting domain mismatch: Gemini was tuned for chatbot/coding tasks that don't match device-assistant use cases — textbook distribution shift in knowledge distillation worth studying for your own teacher→student pipelines
TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
Novo Nordisk killed Claude-powered 'Found Data' tool for mining decades of clinical trial data — expensive, no noticeable advances; CDO: 'If I can do it better in Excel, stay in Excel'
Novo Nordisk's multi-model agent architecture has real MLOps lessons — and one honest failure case worth studying
AWS Security Agent scores 92.5% on CVE Bench with scaffolding, drops to 65% when LLM knowledge predates the benchmark — a 27.5pp delta measuring memorization, not generalization; add knowledge-cutoff ablations to your eval standard
Your LiteLLM proxy and Trivy scanner were compromised — audit your ML pipeline dependencies now
Volga rewrote from Python+Ray to Rust (DataFusion+Arrow+SlateDB), unifying streaming, batch, and request-time ML feature compute — early but architecturally compelling replacement for Flink+Spark+Redis stacks
Your Snowflake joins may be 200x slower than necessary — plus Airbnb's black-swan-proof forecasting architecture
GitHub Copilot now uses your code for AI training by default (opt-out, not opt-in) — audit org settings today if proprietary model architectures or pipeline logic live in Copilot-enabled repos
TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
Reddit removes ~100,000 bot accounts per day — if Reddit data is in your NLP training corpus, expect measurable distribution shift as bot-generated text gets purged
TurboQuant cuts your KV cache 6x — plus step-wise RL rewards boost agent success 40%
68% of executives report 10%+ energy cost increases from AI workloads in past 12 months (Pure Storage/Everpure survey, likely biased upward) — re-baseline your training compute cost models with escalation scenarios
AI-for-science hits two milestones — but your compute budget faces regulatory and energy headwinds
Update: Sora pivot confirmed as video-gen→robotics training — OpenAI believes diffusion world models have higher ROI as physics simulators for embodied AI than consumer content; if you have video diffusion skills, evaluate sim-to-real transfer
Video world models → robotics training: Sora's death signals where your generative modeling skills pay off next
◆ Bottom line
The take.
ARC-AGI-3 scored every frontier model below 1% on reasoning tasks humans solve at 100%, confirming that agentic pipelines relying on novel LLM reasoning have a near-zero capability floor — while a Snowflake OR-join audit and Postgres upsert WHERE clause will deliver more immediate compute savings than your last model optimization, six new critical CVEs prove ML infrastructure is now a first-class attack surface requiring database-grade hardening, and a California jury just ruled that recommendation algorithms can be legally 'defective products' whose optimization objectives are courtroom evidence.
Frequently asked
- What ARC-AGI-3 score did frontier models actually achieve?
- All frontier models scored below 1% on ARC-AGI-3's interactive reasoning tasks that humans solve at 100%. Specifically: Gemini Pro at 0.37%, GPT-5.4 High at 0.26%, Anthropic's Opus 4.6 at 0.25%, and Grok-4.20 at literal 0%. The 0.37-point spread across four architecturally distinct labs indicates a structural capability ceiling, not a tuning gap.
- How should I redesign my agentic pipeline given these reasoning limits?
- Shift from open-ended reasoning to tool-orchestrated pattern matching with structured human fallbacks at reasoning boundaries. The competitive advantage is in the scaffold, not the model — all frontier models reason at roughly the same near-zero level on novel tasks. Also consider replacing terminal-only RL rewards with step-wise rewards, which early research suggests can improve multi-step agent success by up to 40%.
- Why does 'ON a.id = b.id OR a.alt_id = b.alt_id' destroy Snowflake performance?
- Snowflake's hash join optimizer cannot partition on an OR condition, so it silently falls back to a Cartesian product and filters afterward — yielding 100–200x slowdowns on non-trivial tables. The fix is to rewrite the query as two separate equi-joins combined with UNION ALL. This should be codified as an automated SQL lint rule in CI, since every OR-in-JOIN in your feature pipelines is a potential order-of-magnitude win.
- Which ML infrastructure CVEs need immediate patching?
- Six critical CVEs hit ML tooling this week, with three requiring action now: NVIDIA APEX pickle deserialization RCE (CVE-2025-33244, fixed by upgrading to PyTorch ≥2.6), gRPC-Go HTTP/2 :path auth bypass (CVE-2026-33186, pin to ≥1.79.3), and MLflow tar.gz Zip Slip arbitrary file write (CVE-2025-15031). Langflow unauth RCE was exploited in the wild within 20 hours of disclosure. Separately, AWS Bedrock AgentCore's sandbox allows DNS-tunneled C2 and AWS declined to fix — implement DNS egress filtering as a compensating control.
- Why aren't social media platforms protected by Section 230 in these verdicts?
- The plaintiffs targeted algorithmic design features — infinite scroll, recommendation systems, engagement-maximizing mechanics — rather than user-generated content, which is the narrow scope of Section 230 immunity. Juries found Meta ($4.2M) and YouTube ($1.8M) negligent under a product-defect theory, and separately hit Meta with $375M in New Mexico for child safety failures. The practical implication for data scientists: your optimization objective, A/B test logs, and safety trade-off documentation are now legally discoverable evidence in thousands of queued cases, which are being explicitly extended to AI chatbot makers.
◆ Same day, different angle
Read this day as…
◆ Recent in data science
Keep reading.
- Meta just validated two inference infrastructure shifts in one week: KernelEvolve uses LLMs to auto-optimize GPU kernels with >60% throughpu…
- Anthropic's Project Deal experiment proved that stronger models extract systematically better negotiation outcomes while the losing side per…
- DeepSeek V4-Flash serves frontier-competitive inference at $0.14/$0.28 per million tokens — 107x cheaper than GPT-5.5 output — with a novel…
- A single model scored 19% or 78.7% on the same benchmark by swapping only the agent scaffold — a 4x variance that makes leaderboard-driven m…
- Google's Gemma 4 ships the most aggressive KV cache engineering in any open model — 83% memory reduction, 128K context on 8GB phones — but i…