Should I migrate coding workloads from GPT-5.4 or Claude Opus 4.6 to GLM-5.1 based on the SWE-Bench Pro numbers?

Not yet — the 1.1-point spread across the top three is inside benchmark noise, and historically about half of public-benchmark gains survive contact with a production eval. Run a one-day spike against your actual task distribution (proprietary repo tasks plus a SWE-Bench Pro subset) before committing. If current code-gen spend exceeds $10K/month, that eval is ROI-positive before it finishes.

What's the catch with Grok 4.3's $1.25/$2.50 pricing and 1M context window?

Pricing doubles to $2.50/$5.00 past 200K tokens, which changes the blended cost calculation for most agent loops. Pull your production token-length distribution and compute cost at p50/p95/p99 before committing. RAG workloads usually stay under the cliff and win; whole-codebase agents at p95 often don't.

Does SubQ's 12M-token context claim mean we can stop investing in RAG and chunking?

No. The 12M figure is a vendor capacity claim with no ablations or independent long-context evals, and Stanford's Stable Counting Capacity work shows reasoning over long context hits a hard ceiling regardless of window size. Capacity and competence are different axes. Keep the RAG roadmap, but add length-stratified needle-in-haystack and procedural-reasoning tests at 128K/512K/2M to establish your own breakeven curve.

How should I respond to SAP blocking third-party agents from its APIs?

Treat it as a supply-chain problem, not a technical one. Audit every agent integration touching SAP this sprint and verify sanctioned-agent status or plan a fallback path. Expect the allowlist pattern to propagate to Oracle, Salesforce, and Workday within 12 months, so document access patterns and negotiate explicit agent-access terms in upcoming ERP contract renewals.

How do I decide which finance/analytics ML projects to keep after Anthropic shipped 10 finance agents?

Map active projects against Anthropic's declared workflows (pitchbooks, credit memos, KYC, month-end close) and deprecate anything that overlaps directly without a proprietary-data moat, compliance constraint, or latency budget the agent can't hit. Maintaining a worse version of a commodity that improves every Claude release is negative EV. Projects anchored to a proprietary data join or regulatory boundary stay funded.

Edition 2026-05-11 · read as Data Science

GLM-5.1Open-WeightsMoEEdgesGPT-5.4onSWE-BenchPro

Sources: 11
Words: 1,358
Read: 7min

Topics LLM Inference Agentic AI AI Capital

◆ The signal

GLM-5.1, a 744B MoE with 40B active params under an MIT license, posted 58.4 on SWE-Bench Pro against 57.7 for GPT-5.4 and 57.3 for Claude Opus 4.6. Grok 4.3 shipped the same week at $1.25/$2.50 per M tokens. The last time an open-weights model tied the frontier on a coding benchmark, the lead evaporated on our internal task distribution inside a week. A one-day eval against the actual workload is still the cheapest hour on this week's calendar.

◆ INTELLIGENCE MAP

01
Open-Weight Frontier Parity Breaks Coding Agent Economics
act now
GLM-5.1 (MIT, 744B MoE/40B active) scored 58.4 on SWE-Bench Pro vs GPT-5.4's 57.7. Grok 4.3 launched at $1.25/M input. AirLLM runs 70B on 4GB via layer streaming. Per-token coding costs drop ~50% for teams willing to self-host or switch.
58.4
SWE-Bench Pro (MIT model)
2
sources
- GLM-5.1 SWE-Bench
- GPT-5.4 SWE-Bench
- Grok 4.3 input $/M
- GLM-5.1 active params
1. GLM-5.1 (MIT)58.4
2. GPT-5.457.7
3. Claude Opus 4.657.3
4. Grok 4.30
02
12M-Token Context Claims Collide with Reasoning Ceilings
monitor
SubQ claims 12M-token subquadratic attention with ~1000x compute reduction. Stanford's Stable Counting Capacity proves LLMs use finite internal states that collapse to guessing at depth. Grok's 1M context doubles in price past 200K. Capacity is scaling; verified competence at length is not.
12M
tokens claimed (SubQ)
2
sources
- SubQ context claim
- Grok price cliff
- Attention compute cut
- ProgramBench full-solve
1. SubQ (claimed)12000
2. Grok 4.3 (cheap tier)200
3. GLM-5.1200
4. Verified recall floor128
03
Enterprise Agent Access Being Gatekept
monitor
SAP blocks all third-party agents except Joule and Nvidia NemoClaw. Anthropic ships 10 finance agents with M365 + Moody's integrations; FactSet drops 8%. The access pattern your agents rely on today is not the one you'll have in 6 months. Contract-level fallbacks needed now.
-8%
FactSet stock (one day)
3
sources
- Anthropic agents
- FactSet drop
- SAP allowed agents
- Propagation horizon
1. SAP lockdownOnly Joule + NemoClaw allowed
2. Anthropic finance launch10 vertical agents ship
3. FactSet repriced-8% in single session
4. Oracle/Salesforce nextExpected within 12mo
04
Inference Subsidy: $700B Capex vs. $40B Revenue
background
Hyperscalers spend $700B on AI infra against ~$40B revenue—a 17x gap. Cerebras IPOs at $35B with AWS deal. Figma shows 75% weekly AI credit consumption. Your API bill is subsidized by someone else's balance sheet. Plan architecture for the moment that stops.
17x
capex-to-revenue gap
2
sources
- 2026 AI capex
- 2025 AI revenue
- Cerebras IPO
- Figma weekly AI usage
1. AI Capex (2026)700
2. AI Revenue (2025)40
05
GPU Hardware Trust Boundary Broken
monitor
Two independent teams demonstrated Rowhammer on NVIDIA GPU GDDR memory—one variant bypasses IOMMU entirely. Multi-tenant GPU isolation is no longer a trust boundary. Separately, Meta's 267TB piracy lawsuit makes dataset provenance a legal artifact, not a research footnote.
1
sources
- Attack variants
- IOMMU bypass
- Meta data alleged
- Licensing deal walked
1. Multi-tenant GPU isolation confidence25

◆ DEEP DIVES

MIT-Licensed 744B Ties Frontier: Your Coding Agent Budget Needs a Rewrite

The Convergence

Three data points landed in the same week, pointing the same way: the frontier API premium on coding workloads is compressing toward zero. Zhipu AI's GLM-5.1, a 744B MoE with roughly 40B active parameters and an MIT license, posted 58.4 on SWE-Bench Pro. GPT-5.4 scored 57.7. Claude Opus 4.6 scored 57.3. That is a 1.1-point spread across the top three, comfortably inside the noise band for eval harnesses with fewer than a few hundred tasks.

The real signal is not that GLM-5.1 is 'best.' It's that an MIT-licensed model you can self-host is now competitive with closed frontier APIs on the coding benchmark enterprises actually cite.

In the same week, Grok 4.3 launched at $1.25/$2.50 per million tokens with a 1M context window, a material undercut of incumbent pricing. And AirLLM demonstrated layer-streaming inference that runs 70B models on a 4GB GPU without quantization, trading VRAM for I/O bandwidth.

What the Headlines Don't Tell You

The SWE-Bench Pro gap at the top is 0.7 points, and no one is publishing confidence intervals. Read the leaderboard as 'GLM-5.1 is competitive,' not 'GLM-5.1 is best.' The benchmark measures issue resolution under one harness with specific retrieval assumptions. It does not measure:

Latency under production batch sizes
Cost per resolved ticket at your context lengths
Behavior on half-written internal tickets, the kind filed on a Friday afternoon
Tool-call reliability and retry behavior on long agent sessions

Grok's 1M window has a cost cliff: pricing doubles past 200K tokens to $2.50/$5.00. Pull the production token-length distribution and compute blended cost at p50/p95/p99 before committing. For most RAG workloads this wins. For whole-codebase agents it may not.

AirLLM is architecturally honest and operationally a batch-only tool. Throughput is bounded by disk-to-GPU streaming bandwidth. Useful for offline eval harness runs. Approximately unusable for interactive latency.

The Math That Changed

Model	SWE-Bench Pro	Context	Pricing	License
GLM-5.1 (744B/40B active)	58.4	200K	Self-host only	MIT
GPT-5.4	57.7	—	Premium (closed)	Proprietary
Claude Opus 4.6	57.3	—	Premium (closed)	Proprietary
Grok 4.3	Not reported	1M (2× >200K)	$1.25/$2.50	Proprietary

The prior on benchmark swaps like this one: about half of public-benchmark gains survive contact with a production eval, and the surviving half is usually smaller than the headline. That still often justifies migration, but only once cost and latency sit on the same page as accuracy.

The Architecture Tell

MoE with roughly 5% active parameters (40B of 744B) is now the dominant frontier pattern. For training or fine-tuning, dense architectures above 70B are hard to justify on FLOPs-per-quality. Combined with the progressive-disclosure pattern from Claude Skills, tiny description in active context, full instructions lazy-loaded on embedding match, the entire prompt-routing architecture deserves a revisit.

Action items

Run your internal coding eval (SWE-Bench Pro subset + proprietary repo tasks) against GLM-5.1, GPT-5.4, and Claude Opus 4.6 this sprint
Price-model Grok 4.3 for top-3 LLM workloads using actual token distribution at p50/p95/p99 context lengths
Adopt progressive-disclosure prompt routing (embed-match to skill-specific system prompts) in your agent framework
Pilot AirLLM for offline eval harness runs needing full-precision 70B outputs without GPU rental

Sources:Simplifying AI · Alejandro Saucedo - The Institute for Ethical AI & ML

12M-Token Context Claims Meet Hard Reasoning Ceilings — The RAG Rewrite Clock Starts (But Don't Move Yet)

Two Results That Should Be Read Together

SubQ announced a subquadratic attention architecture (SSA) with a 12M-token context claim and roughly 1000x attention-compute reduction. The same week, Stanford's Stable Counting Capacity work showed LLMs rely on finite count-like internal states, so procedural rule-following collapses into guessing once the counter runs out. These two results are not in conflict. They are complementary.

A 12M-token window that guesses past its counting budget is worse than a 128K window that doesn't, because it sells false confidence in a retrieval-free architecture.

It is entirely possible that SubQ is right about capacity and Stanford is right about competence. Context windows are scaling faster than the evaluation tooling that would measure them honestly.

The Verification Problem

SubQ's number is a vendor announcement. No ablations, no independent runs, no adversarial long-context evals (RULER, LongBench-v2, InfiniteBench). The 12M figure is real as a capacity claim. Whether recall holds past 2M tokens, which is where most long-context models quietly degrade, is unestablished.

Dimension	SubQ 12M Claim	Stanford Counting Probe
Claim	12M-token context, ~1000x compute reduction	LLMs use finite count-like internal states
Evidence level	Vendor announcement, no ablations	Published probe methodology
Implication	Context capacity scales dramatically	Reasoning over context has a hard ceiling
Production test	Needle-in-haystack at 1M+ tokens	Procedural rule-following at depth

Grok 4.3's 1M-token context fits the same pattern: the headline is a capacity claim, not a quality claim. Retrieval accuracy past ~128K is where most models quietly degrade, and published benchmarks rarely cover the region you would actually use the long context for.

What This Changes (And Doesn't) for RAG

If SubQ holds at half the reported context length with acceptable recall, the chunking and RAG roadmap needs a significant rewrite. If it holds at a quarter, the roadmap survives with minor edits. Neither outcome is established yet.

The correct response is not to throw out the harness or shelve RAG planning. It is to do three things:

Build a length-stratified evaluation before the vendors do, using actual document shapes from production
Track per-slice accuracy separately from the aggregate number
Accept that public benchmarks and production workloads have stopped overlapping in the places that matter

Several results point the same way. ProgramBench's 0% full-solve rate on real coding tasks, SWE-Bench Pro's tight clustering at the top, and Stanford's counting ceiling all suggest the research leaderboard winner and the production winner are drifting apart every quarter. The leaderboard delta is no longer a sufficient proxy for a deployment decision.

Action items

Run a length-stratified needle-in-haystack + procedural-reasoning benchmark at 128K/512K/2M tokens on your current stack this sprint
Defer any SubQ-based architecture decisions until independent evals on RULER/LongBench-v2/InfiniteBench appear
Add procedural-reasoning depth as a test axis alongside retrieval accuracy in your long-context eval
Audit your Grok 4.3 cost model for the >200K token price cliff before committing any long-context workloads

Sources:TheSequence · Simplifying AI · Alejandro Saucedo - The Institute for Ethical AI & ML

SAP Locks the API, Anthropic Opens a Vertical — Enterprise Agent Access Is Reshaping Around You

Two Moves, Same Pattern

Two enterprise AI events this week, same underlying structure. SAP blocked all third-party agents from its APIs except Joule (its own) and Nvidia NemoClaw. In the same window, Anthropic shipped 10 finance agents with Microsoft 365 and Moody's integrations, covering pitchbooks, credit memos, KYC, and month-end close. FactSet dropped 8% in a single session.

The pattern is symmetric. Enterprise incumbents are gating access to their data while AI labs are building vertical distribution through data partnerships. If an agent workflow touches SAP, Oracle, Salesforce, or Workday data, the access pattern available today is not the one that will be available in six months.

The allowlist pattern will propagate from SAP to Oracle, Salesforce, and Workday inside 12 months. Sanctioned-agent fallback paths belong in quarterly planning, not the someday bucket.

The FactSet Repricing: Market Signal, Not Methodology

No one has published a head-to-head eval of Claude finance agents versus FactSet workflows. The thing the price move doesn't tell you is which system is better on task. What the market priced in is distribution risk, not quality superiority:

Microsoft 365 integration: zero switching friction for finance teams in Excel/Outlook/Teams
Moody's data tie-in: authoritative credit data behind the agents, eroding the 'proprietary data' moat
Trajectory pricing: 'this is the worst version of Claude we will ever see'

The FactSet moat was never the NLP. It was normalized data, permissions plumbing, audit trail, and workflow embedding. Agents erode the top layer. On current evidence, they do not erode the middle. A demo repricing and a production repricing are separated by roughly 18 months of integration work.

The Agent Supply-Chain Problem

SAP's lockdown is the more immediately actionable signal for data science teams. Any agent framework reaching into enterprise ERP data now has a supply-chain problem, and the mitigation is not a clever prompt. It is a contract. The right response is:

If your agents touch...	Risk level	Action
SAP data	Immediate	Verify sanctioned-agent status or plan fallback
Oracle ERP	6-month horizon	Document access pattern; negotiate contract terms
Salesforce	6-month horizon	Same—expect Agentforce-first policies
Workday	12-month horizon	Flag in quarterly planning

What Survives on Your Roadmap

For teams building finance or analytics LLM workflows, the triage is clear. Anything that is now a first-party Anthropic agent and lacks a proprietary-data moat should be deprecated. Maintaining a worse version of a commodity is not a career move. Anything that depends on a proprietary data join, a compliance constraint, or a latency budget the agent cannot hit stays on the roadmap and probably gets more funding.

The split is roughly half and half on the roadmaps sources have seen. The half that survives is the half that was always the actual work.

Action items

Audit all agent integrations touching SAP, Oracle, Salesforce, or Workday for API-policy exposure by end of sprint
Map active ML projects against Anthropic's declared 10 finance workflows—flag any with direct overlap and no proprietary-data moat for deprecation review
Build a rolling eval set of 50-100 real tasks from your domain and re-run on every major model release, version-pinned to model SHA
Negotiate explicit agent-access terms in your next ERP vendor contract renewal

Sources:TheSequence · Edwin Dorsey from The Bear Cave · Simplifying AI

◆ QUICK HITS

Update: Gemma MTP speculative decoding now ships with vLLM, MLX, and Transformers first-party support—integration cost dropped from 'research project' to 'drop-in flag'
Alejandro Saucedo - The Institute for Ethical AI & ML
Intel auto-round quantization integrates with vLLM/SGLang/Transformers across CPU/XPU/CUDA—expect ~1.5-2x throughput on 4-bit weight-only quant with task-specific accuracy delta as the open question
Chris Short
NVIDIA GPU Rowhammer bypasses IOMMU—hardware isolation between co-tenants on shared GPU hosts is no longer a trust boundary; run a tenancy audit on any frontier-weight jobs sharing physical hosts
Chris Short
Netflix published ML metadata graph architecture: events + hydration (not dual-writes), Datomic for relationships, Elasticsearch for search—the lineage pattern most teams rebuild every 18 months
Alejandro Saucedo - The Institute for Ethical AI & ML
Anthropic NLAs decode model activations into human-readable text and flag eval-awareness (model knowing it's being tested)—a pre-ship interpretability check, not a post-incident tool
TheSequence
Meta 267TB piracy lawsuit (alleging Zuckerberg-authorized torrenting after walking from $200M licensing deal) is shaping up as the test case for training-data liability—build your provenance ledger now
Chris Short
Codex Chrome extension reads authenticated DOM from Salesforce, Gmail, LinkedIn via DevTools Protocol—flag to security before it spreads organically through engineering
Simplifying AI
Cerebras IPOs Thursday at $35B with AWS supplemental deal—wafer-scale inference now on the procurement menu for long-context/low-batch workloads
Martin Peers

◆ Bottom line

The take.

An MIT-licensed 744B model just tied GPT-5.4 on coding benchmarks, Grok halved the API price floor, and SAP locked its APIs to two sanctioned agents—in a single week. The frontier premium for coding workloads is collapsing while enterprise data access is being gated. Teams that benchmark the open-weight alternative on their actual task distribution this sprint will either save 50% on inference or confirm they can't switch—either answer is worth the day of compute.

Frequently asked

Should I migrate coding workloads from GPT-5.4 or Claude Opus 4.6 to GLM-5.1 based on the SWE-Bench Pro numbers?: Not yet — the 1.1-point spread across the top three is inside benchmark noise, and historically about half of public-benchmark gains survive contact with a production eval. Run a one-day spike against your actual task distribution (proprietary repo tasks plus a SWE-Bench Pro subset) before committing. If current code-gen spend exceeds $10K/month, that eval is ROI-positive before it finishes.
What's the catch with Grok 4.3's $1.25/$2.50 pricing and 1M context window?: Pricing doubles to $2.50/$5.00 past 200K tokens, which changes the blended cost calculation for most agent loops. Pull your production token-length distribution and compute cost at p50/p95/p99 before committing. RAG workloads usually stay under the cliff and win; whole-codebase agents at p95 often don't.
Does SubQ's 12M-token context claim mean we can stop investing in RAG and chunking?: No. The 12M figure is a vendor capacity claim with no ablations or independent long-context evals, and Stanford's Stable Counting Capacity work shows reasoning over long context hits a hard ceiling regardless of window size. Capacity and competence are different axes. Keep the RAG roadmap, but add length-stratified needle-in-haystack and procedural-reasoning tests at 128K/512K/2M to establish your own breakeven curve.
How should I respond to SAP blocking third-party agents from its APIs?: Treat it as a supply-chain problem, not a technical one. Audit every agent integration touching SAP this sprint and verify sanctioned-agent status or plan a fallback path. Expect the allowlist pattern to propagate to Oracle, Salesforce, and Workday within 12 months, so document access patterns and negotiate explicit agent-access terms in upcoming ERP contract renewals.
How do I decide which finance/analytics ML projects to keep after Anthropic shipped 10 finance agents?: Map active projects against Anthropic's declared workflows (pitchbooks, credit memos, KYC, month-end close) and deprecate anything that overlaps directly without a proprietary-data moat, compliance constraint, or latency budget the agent can't hit. Maintaining a worse version of a commodity that improves every Claude release is negative EV. Projects anchored to a proprietary data join or regulatory boundary stay funded.

◆ Same day, different angle

Read this day as…

◆ Recent in data science

GLM-5.1Open-WeightsMoEEdgesGPT-5.4onSWE-BenchPro

◆ INTELLIGENCE MAP

◆ DEEP DIVES

The Convergence

What the Headlines Don't Tell You

The Math That Changed

The Architecture Tell

Two Results That Should Be Read Together

The Verification Problem

What This Changes (And Doesn't) for RAG

Two Moves, Same Pattern

The FactSet Repricing: Market Signal, Not Methodology

The Agent Supply-Chain Problem

What Survives on Your Roadmap

◆ QUICK HITS

The take.

Frequently asked

◆ RELATED THREADS