Edition 2026-05-11 · read as Data Science
GLM-5.1Open-WeightsMoEEdgesGPT-5.4onSWE-BenchPro
- Sources
- 11
- Words
- 1,358
- Read
- 7min
Topics LLM Inference Agentic AI AI Capital
◆ The signal
GLM-5.1, a 744B MoE with 40B active params under an MIT license, posted 58.4 on SWE-Bench Pro against 57.7 for GPT-5.4 and 57.3 for Claude Opus 4.6. Grok 4.3 shipped the same week at $1.25/$2.50 per M tokens. The last time an open-weights model tied the frontier on a coding benchmark, the lead evaporated on our internal task distribution inside a week. A one-day eval against the actual workload is still the cheapest hour on this week's calendar.
◆ INTELLIGENCE MAP
01 Open-Weight Frontier Parity Breaks Coding Agent Economics
act nowGLM-5.1 (MIT, 744B MoE/40B active) scored 58.4 on SWE-Bench Pro vs GPT-5.4's 57.7. Grok 4.3 launched at $1.25/M input. AirLLM runs 70B on 4GB via layer streaming. Per-token coding costs drop ~50% for teams willing to self-host or switch.
- GLM-5.1 SWE-Bench
- GPT-5.4 SWE-Bench
- Grok 4.3 input $/M
- GLM-5.1 active params
02 12M-Token Context Claims Collide with Reasoning Ceilings
monitorSubQ claims 12M-token subquadratic attention with ~1000x compute reduction. Stanford's Stable Counting Capacity proves LLMs use finite internal states that collapse to guessing at depth. Grok's 1M context doubles in price past 200K. Capacity is scaling; verified competence at length is not.
- SubQ context claim
- Grok price cliff
- Attention compute cut
- ProgramBench full-solve
03 Enterprise Agent Access Being Gatekept
monitorSAP blocks all third-party agents except Joule and Nvidia NemoClaw. Anthropic ships 10 finance agents with M365 + Moody's integrations; FactSet drops 8%. The access pattern your agents rely on today is not the one you'll have in 6 months. Contract-level fallbacks needed now.
- Anthropic agents
- FactSet drop
- SAP allowed agents
- Propagation horizon
- SAP lockdownOnly Joule + NemoClaw allowed
- Anthropic finance launch10 vertical agents ship
- FactSet repriced-8% in single session
- Oracle/Salesforce nextExpected within 12mo
04 Inference Subsidy: $700B Capex vs. $40B Revenue
backgroundHyperscalers spend $700B on AI infra against ~$40B revenue—a 17x gap. Cerebras IPOs at $35B with AWS deal. Figma shows 75% weekly AI credit consumption. Your API bill is subsidized by someone else's balance sheet. Plan architecture for the moment that stops.
- 2026 AI capex
- 2025 AI revenue
- Cerebras IPO
- Figma weekly AI usage
- AI Capex (2026)700
- AI Revenue (2025)40
05 GPU Hardware Trust Boundary Broken
monitorTwo independent teams demonstrated Rowhammer on NVIDIA GPU GDDR memory—one variant bypasses IOMMU entirely. Multi-tenant GPU isolation is no longer a trust boundary. Separately, Meta's 267TB piracy lawsuit makes dataset provenance a legal artifact, not a research footnote.
- Attack variants
- IOMMU bypass
- Meta data alleged
- Licensing deal walked
- Multi-tenant GPU isolation confidence25
◆ DEEP DIVES
01 MIT-Licensed 744B Ties Frontier: Your Coding Agent Budget Needs a Rewrite
The Convergence
Three data points landed in the same week, pointing the same way: the frontier API premium on coding workloads is compressing toward zero. Zhipu AI's GLM-5.1, a 744B MoE with roughly 40B active parameters and an MIT license, posted 58.4 on SWE-Bench Pro. GPT-5.4 scored 57.7. Claude Opus 4.6 scored 57.3. That is a 1.1-point spread across the top three, comfortably inside the noise band for eval harnesses with fewer than a few hundred tasks.
The real signal is not that GLM-5.1 is 'best.' It's that an MIT-licensed model you can self-host is now competitive with closed frontier APIs on the coding benchmark enterprises actually cite.
In the same week, Grok 4.3 launched at $1.25/$2.50 per million tokens with a 1M context window, a material undercut of incumbent pricing. And AirLLM demonstrated layer-streaming inference that runs 70B models on a 4GB GPU without quantization, trading VRAM for I/O bandwidth.
What the Headlines Don't Tell You
The SWE-Bench Pro gap at the top is 0.7 points, and no one is publishing confidence intervals. Read the leaderboard as 'GLM-5.1 is competitive,' not 'GLM-5.1 is best.' The benchmark measures issue resolution under one harness with specific retrieval assumptions. It does not measure:
- Latency under production batch sizes
- Cost per resolved ticket at your context lengths
- Behavior on half-written internal tickets, the kind filed on a Friday afternoon
- Tool-call reliability and retry behavior on long agent sessions
Grok's 1M window has a cost cliff: pricing doubles past 200K tokens to $2.50/$5.00. Pull the production token-length distribution and compute blended cost at p50/p95/p99 before committing. For most RAG workloads this wins. For whole-codebase agents it may not.
AirLLM is architecturally honest and operationally a batch-only tool. Throughput is bounded by disk-to-GPU streaming bandwidth. Useful for offline eval harness runs. Approximately unusable for interactive latency.
The Math That Changed
Model SWE-Bench Pro Context Pricing License GLM-5.1 (744B/40B active) 58.4 200K Self-host only MIT GPT-5.4 57.7 — Premium (closed) Proprietary Claude Opus 4.6 57.3 — Premium (closed) Proprietary Grok 4.3 Not reported 1M (2× >200K) $1.25/$2.50 Proprietary The prior on benchmark swaps like this one: about half of public-benchmark gains survive contact with a production eval, and the surviving half is usually smaller than the headline. That still often justifies migration, but only once cost and latency sit on the same page as accuracy.
The Architecture Tell
MoE with roughly 5% active parameters (40B of 744B) is now the dominant frontier pattern. For training or fine-tuning, dense architectures above 70B are hard to justify on FLOPs-per-quality. Combined with the progressive-disclosure pattern from Claude Skills, tiny description in active context, full instructions lazy-loaded on embedding match, the entire prompt-routing architecture deserves a revisit.
Action items
- Run your internal coding eval (SWE-Bench Pro subset + proprietary repo tasks) against GLM-5.1, GPT-5.4, and Claude Opus 4.6 this sprint
- Price-model Grok 4.3 for top-3 LLM workloads using actual token distribution at p50/p95/p99 context lengths
- Adopt progressive-disclosure prompt routing (embed-match to skill-specific system prompts) in your agent framework
- Pilot AirLLM for offline eval harness runs needing full-precision 70B outputs without GPU rental
Sources:Simplifying AI · Alejandro Saucedo - The Institute for Ethical AI & ML
02 12M-Token Context Claims Meet Hard Reasoning Ceilings — The RAG Rewrite Clock Starts (But Don't Move Yet)
Two Results That Should Be Read Together
SubQ announced a subquadratic attention architecture (SSA) with a 12M-token context claim and roughly 1000x attention-compute reduction. The same week, Stanford's Stable Counting Capacity work showed LLMs rely on finite count-like internal states, so procedural rule-following collapses into guessing once the counter runs out. These two results are not in conflict. They are complementary.
A 12M-token window that guesses past its counting budget is worse than a 128K window that doesn't, because it sells false confidence in a retrieval-free architecture.
It is entirely possible that SubQ is right about capacity and Stanford is right about competence. Context windows are scaling faster than the evaluation tooling that would measure them honestly.
The Verification Problem
SubQ's number is a vendor announcement. No ablations, no independent runs, no adversarial long-context evals (RULER, LongBench-v2, InfiniteBench). The 12M figure is real as a capacity claim. Whether recall holds past 2M tokens, which is where most long-context models quietly degrade, is unestablished.
Dimension SubQ 12M Claim Stanford Counting Probe Claim 12M-token context, ~1000x compute reduction LLMs use finite count-like internal states Evidence level Vendor announcement, no ablations Published probe methodology Implication Context capacity scales dramatically Reasoning over context has a hard ceiling Production test Needle-in-haystack at 1M+ tokens Procedural rule-following at depth Grok 4.3's 1M-token context fits the same pattern: the headline is a capacity claim, not a quality claim. Retrieval accuracy past ~128K is where most models quietly degrade, and published benchmarks rarely cover the region you would actually use the long context for.
What This Changes (And Doesn't) for RAG
If SubQ holds at half the reported context length with acceptable recall, the chunking and RAG roadmap needs a significant rewrite. If it holds at a quarter, the roadmap survives with minor edits. Neither outcome is established yet.
The correct response is not to throw out the harness or shelve RAG planning. It is to do three things:
- Build a length-stratified evaluation before the vendors do, using actual document shapes from production
- Track per-slice accuracy separately from the aggregate number
- Accept that public benchmarks and production workloads have stopped overlapping in the places that matter
Several results point the same way. ProgramBench's 0% full-solve rate on real coding tasks, SWE-Bench Pro's tight clustering at the top, and Stanford's counting ceiling all suggest the research leaderboard winner and the production winner are drifting apart every quarter. The leaderboard delta is no longer a sufficient proxy for a deployment decision.
Action items
- Run a length-stratified needle-in-haystack + procedural-reasoning benchmark at 128K/512K/2M tokens on your current stack this sprint
- Defer any SubQ-based architecture decisions until independent evals on RULER/LongBench-v2/InfiniteBench appear
- Add procedural-reasoning depth as a test axis alongside retrieval accuracy in your long-context eval
- Audit your Grok 4.3 cost model for the >200K token price cliff before committing any long-context workloads
Sources:TheSequence · Simplifying AI · Alejandro Saucedo - The Institute for Ethical AI & ML
03 SAP Locks the API, Anthropic Opens a Vertical — Enterprise Agent Access Is Reshaping Around You
Two Moves, Same Pattern
Two enterprise AI events this week, same underlying structure. SAP blocked all third-party agents from its APIs except Joule (its own) and Nvidia NemoClaw. In the same window, Anthropic shipped 10 finance agents with Microsoft 365 and Moody's integrations, covering pitchbooks, credit memos, KYC, and month-end close. FactSet dropped 8% in a single session.
The pattern is symmetric. Enterprise incumbents are gating access to their data while AI labs are building vertical distribution through data partnerships. If an agent workflow touches SAP, Oracle, Salesforce, or Workday data, the access pattern available today is not the one that will be available in six months.
The allowlist pattern will propagate from SAP to Oracle, Salesforce, and Workday inside 12 months. Sanctioned-agent fallback paths belong in quarterly planning, not the someday bucket.
The FactSet Repricing: Market Signal, Not Methodology
No one has published a head-to-head eval of Claude finance agents versus FactSet workflows. The thing the price move doesn't tell you is which system is better on task. What the market priced in is distribution risk, not quality superiority:
- Microsoft 365 integration: zero switching friction for finance teams in Excel/Outlook/Teams
- Moody's data tie-in: authoritative credit data behind the agents, eroding the 'proprietary data' moat
- Trajectory pricing: 'this is the worst version of Claude we will ever see'
The FactSet moat was never the NLP. It was normalized data, permissions plumbing, audit trail, and workflow embedding. Agents erode the top layer. On current evidence, they do not erode the middle. A demo repricing and a production repricing are separated by roughly 18 months of integration work.
The Agent Supply-Chain Problem
SAP's lockdown is the more immediately actionable signal for data science teams. Any agent framework reaching into enterprise ERP data now has a supply-chain problem, and the mitigation is not a clever prompt. It is a contract. The right response is:
If your agents touch... Risk level Action SAP data Immediate Verify sanctioned-agent status or plan fallback Oracle ERP 6-month horizon Document access pattern; negotiate contract terms Salesforce 6-month horizon Same—expect Agentforce-first policies Workday 12-month horizon Flag in quarterly planning What Survives on Your Roadmap
For teams building finance or analytics LLM workflows, the triage is clear. Anything that is now a first-party Anthropic agent and lacks a proprietary-data moat should be deprecated. Maintaining a worse version of a commodity is not a career move. Anything that depends on a proprietary data join, a compliance constraint, or a latency budget the agent cannot hit stays on the roadmap and probably gets more funding.
The split is roughly half and half on the roadmaps sources have seen. The half that survives is the half that was always the actual work.
Action items
- Audit all agent integrations touching SAP, Oracle, Salesforce, or Workday for API-policy exposure by end of sprint
- Map active ML projects against Anthropic's declared 10 finance workflows—flag any with direct overlap and no proprietary-data moat for deprecation review
- Build a rolling eval set of 50-100 real tasks from your domain and re-run on every major model release, version-pinned to model SHA
- Negotiate explicit agent-access terms in your next ERP vendor contract renewal
Sources:TheSequence · Edwin Dorsey from The Bear Cave · Simplifying AI
◆ QUICK HITS
Update: Gemma MTP speculative decoding now ships with vLLM, MLX, and Transformers first-party support—integration cost dropped from 'research project' to 'drop-in flag'
Alejandro Saucedo - The Institute for Ethical AI & ML
Intel auto-round quantization integrates with vLLM/SGLang/Transformers across CPU/XPU/CUDA—expect ~1.5-2x throughput on 4-bit weight-only quant with task-specific accuracy delta as the open question
Chris Short
NVIDIA GPU Rowhammer bypasses IOMMU—hardware isolation between co-tenants on shared GPU hosts is no longer a trust boundary; run a tenancy audit on any frontier-weight jobs sharing physical hosts
Chris Short
Netflix published ML metadata graph architecture: events + hydration (not dual-writes), Datomic for relationships, Elasticsearch for search—the lineage pattern most teams rebuild every 18 months
Alejandro Saucedo - The Institute for Ethical AI & ML
Anthropic NLAs decode model activations into human-readable text and flag eval-awareness (model knowing it's being tested)—a pre-ship interpretability check, not a post-incident tool
TheSequence
Meta 267TB piracy lawsuit (alleging Zuckerberg-authorized torrenting after walking from $200M licensing deal) is shaping up as the test case for training-data liability—build your provenance ledger now
Chris Short
Codex Chrome extension reads authenticated DOM from Salesforce, Gmail, LinkedIn via DevTools Protocol—flag to security before it spreads organically through engineering
Simplifying AI
Cerebras IPOs Thursday at $35B with AWS supplemental deal—wafer-scale inference now on the procurement menu for long-context/low-batch workloads
Martin Peers
◆ Bottom line
The take.
An MIT-licensed 744B model just tied GPT-5.4 on coding benchmarks, Grok halved the API price floor, and SAP locked its APIs to two sanctioned agents—in a single week. The frontier premium for coding workloads is collapsing while enterprise data access is being gated. Teams that benchmark the open-weight alternative on their actual task distribution this sprint will either save 50% on inference or confirm they can't switch—either answer is worth the day of compute.
Frequently asked
- Should I migrate coding workloads from GPT-5.4 or Claude Opus 4.6 to GLM-5.1 based on the SWE-Bench Pro numbers?
- Not yet — the 1.1-point spread across the top three is inside benchmark noise, and historically about half of public-benchmark gains survive contact with a production eval. Run a one-day spike against your actual task distribution (proprietary repo tasks plus a SWE-Bench Pro subset) before committing. If current code-gen spend exceeds $10K/month, that eval is ROI-positive before it finishes.
- What's the catch with Grok 4.3's $1.25/$2.50 pricing and 1M context window?
- Pricing doubles to $2.50/$5.00 past 200K tokens, which changes the blended cost calculation for most agent loops. Pull your production token-length distribution and compute cost at p50/p95/p99 before committing. RAG workloads usually stay under the cliff and win; whole-codebase agents at p95 often don't.
- Does SubQ's 12M-token context claim mean we can stop investing in RAG and chunking?
- No. The 12M figure is a vendor capacity claim with no ablations or independent long-context evals, and Stanford's Stable Counting Capacity work shows reasoning over long context hits a hard ceiling regardless of window size. Capacity and competence are different axes. Keep the RAG roadmap, but add length-stratified needle-in-haystack and procedural-reasoning tests at 128K/512K/2M to establish your own breakeven curve.
- How should I respond to SAP blocking third-party agents from its APIs?
- Treat it as a supply-chain problem, not a technical one. Audit every agent integration touching SAP this sprint and verify sanctioned-agent status or plan a fallback path. Expect the allowlist pattern to propagate to Oracle, Salesforce, and Workday within 12 months, so document access patterns and negotiate explicit agent-access terms in upcoming ERP contract renewals.
- How do I decide which finance/analytics ML projects to keep after Anthropic shipped 10 finance agents?
- Map active projects against Anthropic's declared workflows (pitchbooks, credit memos, KYC, month-end close) and deprecate anything that overlaps directly without a proprietary-data moat, compliance constraint, or latency budget the agent can't hit. Maintaining a worse version of a commodity that improves every Claude release is negative EV. Projects anchored to a proprietary data join or regulatory boundary stay funded.
◆ Same day, different angle
Read this day as…
◆ Recent in data science
Keep reading.
- Princeton's ICML 2026 audit added GPT 5.5, Gemini 3.5 Flash, and Claude Opus 4.7 and found zero meaningful reliability improvement over pred…
- Hugging Face Transformers has an RCE path that fires from model config files — not pickle weights — across 2.2 billion installs.
- Anthropic ended the flat-rate Claude subsidy this week.
- Anthropic killed the flat-rate Claude subscription this week.
- Anthropic quietly killed the 70-90% effective discount on programmatic Claude usage — subscriptions now convert to dollar-matched API credit…