How can prompt and middleware changes alone produce a 13-point benchmark swing?

Holding gpt-5.2-codex constant, system prompts shift the token distribution over the model's first decision, tool schema serialization shifts its prior over JSON shapes, and truncation policy decides what survives into the next step. None of these appear on the model card, but together they moved Terminal-Bench 2.0 from 52.8% to 66.5% in a controlled ablation. The harness is doing more work than the weights.

What's the minimum config change to defend against this quarter's npm supply chain attacks?

Add `min-release-age=7d` to .npmrc (npm 11.10.0+), or the equivalent `minimumReleaseAge` in pnpm and `npmMinimalAgeGate` in Yarn. A 7-day cooldown hides freshly published malicious versions from your installs long enough for the community to flag them — the Axios compromise was caught in 3–4 hours. Pair it with `npm ci` everywhere that isn't an explicit update so fresh resolutions don't bypass the lockfile.

Why does INSERT ON CONFLICT DO UPDATE double WAL writes even on no-ops?

The speculative insertion path takes a row lock and emits a WAL record before discovering the row is unchanged, so the "do nothing" branch still produces a full physical write. At Datadog's scale this doubled disk writes and quadrupled WAL syncs. To find it, divide WAL bytes by affected rows in pg_stat_statements and flag anything north of ~2KB per row on narrow tables.

When should I rewrite an upsert versus accept the WAL amplification?

Rewrite when most upserts resolve to updates or no-ops — use MERGE (PG 15+), or a conditional UPDATE with a WHERE clause that filters unchanged rows, accepting a small race window with retry logic. Keep ON CONFLICT when most upserts are genuine inserts; the amplification is the price of the conflict-handling path. Don't rewrite every upsert, only the three that dominate your WAL graph.

What architecture do 12-hour agent runs need that current orchestration doesn't provide?

Workflow-engine infrastructure like Temporal or Step Functions, not HTTP middleware. The supervisor cannot hold a handle to the work for 12 hours — processes restart, keys rotate, dependencies rate-limit. You need durable checkpoints keyed by a task ID the agent owns, idempotency on every side-effecting action, a state store the agent reads from instead of a transcript it scrolls through, and phase-level trace segmentation so operators can tell stuck from working.

Edition 2026-05-05 · read as Engineer

PromptandMiddlewareAblationLiftsCodex13PointsonTB2.0

Sources: 36
Words: 1,561
Read: 8min

Topics Agentic AI LLM Inference AI Regulation

◆ The signal

A controlled ablation moved gpt-5.2-codex from 52.8% to 66.5% on Terminal-Bench 2.0 — a 13-point swing — by changing only prompts and middleware, not weights. That delta is larger than most model-generation upgrades. If your roadmap is 'wait for the next frontier release,' you're optimizing the wrong layer. The competitive surface is your context pipeline, and the staff engineers should be sitting there, not on model selection.

◆ INTELLIGENCE MAP

01
Harness Engineering Delivers Model-Generation Gains
act now
Multiple teams proved this week that orchestration engineering — prompt layout, context pipelines, middleware — moves coding benchmarks 13+ points holding the model constant. Meanwhile Uber burned its 2026 AI budget in 4 months at $500–$2K/engineer/month and Claude Code doubled pricing. The fix isn't cheaper tokens. It's better harness design.
13.7
benchmark points from harness
6
sources
- Harness-only lift
- Uber burn rate
- Claude Code price change
- DeepClaude cost claim
1. Before harness fix52.8
2. After harness fix66.5
3. Typical model upgrade56
02
npm Cooldowns Ship — Supply Chain Attacks Go Multi-Ecosystem
act now
npm 11.10.0 ships `min-release-age` — a native cooldown that blocks versions newer than a configured window. TeamPCP poisoned SAP packages (572K weekly downloads), Intercom SDK, and Lightning framework. Attacks now span npm, RubyGems, Go modules, and Packagist simultaneously. Set 7-day cooldown by Friday.
572K
weekly downloads poisoned
3
sources
- SAP pkg downloads
- Ecosystems hit
- Axios dependents
- npm cooldown config
1. 01Axios (npm)57M weekly dl
2. 02SAP packages (npm)572K weekly dl
3. 03Ruby gems (new)CI/CD targeting
4. 04Go modules (new)credential theft
5. 05Packagist/PHP (new)Mini Shai-Hulud
03
Postgres Upsert WAL Amplification
monitor
Datadog found INSERT ON CONFLICT writes WAL even when the update is a no-op — 2x disk writes and 4x WAL syncs on upsert-heavy tables. The speculative insertion path acquires locks and emits WAL before deciding the row is unchanged. Audit pg_stat_wal divided by affected rows; flag anything over ~2KB/row on narrow tables.
2x
disk write amplification
2
sources
- Disk write overhead
- WAL sync overhead
- Alert threshold
- Fix: MERGE available
1. Expected writes/upsert1
2. Actual writes (no-op)2
04
AI Agent Duration Hits 12 Hours — Orchestration Breaks
monitor
METR's autonomous task horizon is on a clean 10x/year exponential: 30s (2022) → 12h (2026), with 100h projected by year-end. Scaffolding built for minute-scale runs fails at this duration. State must be checkpointed to durable storage, observability needs phase-level traces, and the supervisor cannot hold handles. This is batch infrastructure, not HTTP middleware.
10x
per year growth rate
2
sources
- GPT-3.5 (2022)
- GPT-4 (2023)
- Opus 4.6 (2026)
- Projected EOY 2026
1. 20220.5
2. 20234
3. 202440
4. 2025360
5. 2026720
05
Ubuntu DDoSed, Build Pipelines Broke
background
Ubuntu/Canonical infrastructure was hit with 3.5 Tbps DDoS via Beamed (DDoS-for-hire), causing 20+ hour outages. Security APIs and update servers went down. Any Dockerfile with `apt-get update` pinned to a single upstream mirror broke. Multi-terabit DDoS is now commodity. Single-source package infrastructure needs fallback mirrors in CI.
3.5
Tbps DDoS volume
3
sources
- DDoS volume
- Outage duration
- Attack tool
- Impact
1. Build pipeline risk72

◆ DEEP DIVES

01
Your Context Pipeline Is Worth 13 Benchmark Points — More Than the Next Model
The Data Point That Changes Priorities
Mason Drxy ran controlled ablations on coding agent orchestration this week. Holding the model constant at gpt-5.2-codex, changes to prompts and middleware alone moved Terminal-Bench 2.0 from 52.8% to 66.5%. That is a 13.7-point swing with the model fixed. The same technique pulled 20% on tau2-bench for gpt-5.3-codex. These are production coding benchmarks, not demos.
A 4-point gap between two frontier models, reported without harness control, is noise. The experiment cannot tell you they aren't equivalent.
Where the Points Come From
Swap the system prompt and the token distribution over the first decision shifts. Reformat tool schemas in middleware and the model's prior over JSON shapes shifts with it. Retry logic hides or exposes parse failures. Truncation policy decides whether the 8K-token trace survives to the next step. None of this is on the model card. All of it is in the harness.
Anthony Maio named the lock-in surface. It is not the harness shell, which LangGraph and deepagents-cli have already commoditized. It is the context pipeline: how repo state gets fetched, ranked, and compressed into the window. That is what compounds with a codebase.
The Cost Dimension Makes This Urgent
Uber reported Claude Code running at $500–$2,000 per engineer per month, burning the full-year AI coding budget in four months. A single Copilot session consumed roughly $221 of inference, over sixty million tokens across 15 messages, against a $40/month subscription. Claude Code enterprise pricing doubled this week.
Three responses landed the same week. Caveman strips about 75% of Claude Code output tokens. Deepclaude proxies to DeepSeek V4 Pro at roughly 17x lower cost. Mistral Medium 3.5 adds self-hosting at 128B dense on 4 GPUs. Which of these actually works is a function of the harness, not the model.
Sources Disagree on What to Build
The principle converges. The implementation does not. One camp (Browser Harness, 592 lines) hands the LLM raw protocol access and lets it write tools at runtime. The other camp (Hermes Agent, SQLite Kanban) builds boring persistent infrastructure. Pick based on trust surface: LLM-generated tools are untested code running in your production environment.
The Architectural Pattern
Pin the harness before A/B testing models. Log the exact system prompt, tool schema serialization, retry policy, and truncation rule on every eval run. Version the context pipeline as production code, not prompt engineering. If 400 tokens of instructions sit buried in 200K of context, the model is not reasoning. It is searching for its own task.
Action items
- Run an ablation on your top coding agent: hold the model constant, vary system prompt and middleware, measure Terminal-Bench or your internal eval
- Instrument per-engineer monthly token spend with hard ceilings and per-PR cost attribution
- Evaluate Caveman (`claude install-skill JuliusBrussee/caveman`) for 75% output token reduction on Claude Code
- Build a model routing layer that dispatches by task complexity: self-hosted open-weight for scaffolding, frontier API for hard problems
Sources:Your agent harness matters more than your model: 13-point benchmark swings from prompt/middleware alone · The 200k context window is 99.8% noise · Harness engineering is the new infra · MCP patches an agent with tools · Claude Code pricing doubled this week · Uber burned its entire 2026 AI coding budget in four months
02
npm 11.10.0 Ships Dependency Cooldowns — Set Them Before the Next TeamPCP
The Mechanism That's Being Exploited
package-lock.json pins the resolved tree at install time. package.json still ships semver ranges. When a transitive dep is compromised and a new patch version is published, anyone running npm install on a fresh checkout — CI, a new contributor, a rebuilt container — resolves the attacker's version before the lockfile is regenerated. The lockfile guarantees reproducibility of an install that already happened. It says nothing about the next install.
The attacker does not need to touch your lockfile. They need to touch a lockfile you haven't generated yet.
What Happened This Week
TeamPCP pushed malicious versions of SAP's npm packages (572K weekly downloads), the Intercom SDK, and the Lightning deep learning framework. The Axios campaign hit 57M+ weekly downloads across 84K dependents and the s1ngularity attack replicated worldwide within minutes of publication. A parallel campaign landed on Ruby gems, Go modules, and Packagist simultaneously. Four ecosystems, one actor.
PromptMink malware was traced to a library added by an Anthropic Claude Opus commit in February. Spoofed attribution or genuine AI-suggested malware, it doesn't matter which. Both paths say the same thing: AI-assisted dependency work needs a human review gate.
The Fix That Ships Today
Package Manager Config Key Recommended Value
npm 11.10.0+ min-release-age in .npmrc 7d
pnpm minimumReleaseAge 604800
Yarn npmMinimalAgeGate '7d'
Dependabot cooldown settings Extends to Actions + Python
A 7-day cooldown means a freshly published malicious version is invisible to your installs for a week. That's long enough for the community to flag it. The Axios compromise was caught in 3–4 hours. The malicious versions had already replicated worldwide in minutes. A 12-hour cooldown would have blocked it outright.
The Tradeoff
Legitimate fast-moving dependencies lag by the cooldown window. For packages you control, pin the exact version or override per-dependency. For transitive dependencies from a registry you don't own, the cooldown is the point.
Pair it with npm ci everywhere that isn't a developer explicitly updating — CI, Dockerfiles, rebuilt images. Add fallback Ubuntu/Debian mirrors to CI/CD. The Canonical DDoS this week (3.5 Tbps, 20+ hours) broke every build that pinned a single upstream.
Action items
- Add `min-release-age=7d` to .npmrc in all Node.js projects and equivalent configs for pnpm/Yarn
- Replace `npm install` with `npm ci` in all CI pipelines and Dockerfiles by end of week
- Audit direct deps for SAP npm packages, Intercom SDK, and Lightning framework for TeamPCP compromise
- Add human review gates for AI-generated code commits that add new dependencies
Sources:The lockfile is not the boundary people think it is · "Copy Fail" is the name · MCP had its first supply chain attack in September 2025

Package Manager	Config Key	Recommended Value
npm 11.10.0+	`min-release-age` in .npmrc	7d
pnpm	`minimumReleaseAge`	604800
Yarn	`npmMinimalAgeGate`	'7d'
Dependabot	cooldown settings	Extends to Actions + Python

Postgres INSERT ON CONFLICT Writes 2x WAL on No-Ops — The 30-Minute Audit

The Mechanism

Datadog published the finding this week: INSERT ... ON CONFLICT DO UPDATE takes a row lock and writes to WAL even when the proposed update changes nothing. The speculative insertion path grabs the lock, emits a WAL record, then discovers the row is unchanged. The "do nothing" branch is not free. It is a full write path that happens to produce an identical row.

One logical upsert. Two physical writes. Sometimes three, if a unique index triggers a second pass.

At Datadog's scale this doubled disk writes and quadrupled WAL syncs. The amplification stays invisible until you graph bytes_written against rows_affected and notice the ratio sits around 2x on upsert-heavy tables. This is documented behavior of the speculative insertion path. The docs say so. Nobody reads that section.

Where This Hurts

Replication lag spikes on read replicas
Backup sizes growing faster than data volume
Archive storage costs climbing with no apparent table growth
IOPS budget consumed by no-op writes in event processing, CDC, and idempotent API handlers

The Audit

Pull WAL generation per table from pg_stat_statements. Divide by affected rows. Flag anything north of ~2KB per row on narrow tables. Check pg_stat_user_tables.n_tup_ins versus n_tup_upd on each target table. If most upserts resolve to updates, the amplification is worse than the headline number suggests.

Fix Options by Workload

Pattern	Fix	Tradeoff
Most upserts are updates	MERGE (PG 15+) or UPDATE then INSERT on miss	Race window needs retry logic
Most upserts are inserts	Keep ON CONFLICT — amplification is the price	Accept the WAL cost
No-ops dominate	Pre-filter with WHERE clause or EXISTS check	Trades a read for the write

Stripe's Related Pattern Worth Stealing

Stripe published their zero-downtime migration architecture this week: 2,000+ MongoDB shards at 5M QPS with 99.9995% reliability. The reconciliation loop is the part worth copying: row hash comparison instead of row counts, CDC from a pinned LSN, version-gated cutovers. The lesson for any Postgres shard rebalance is simple. Run reconciliation for a week before flipping anything. If the diff count does not trend to zero, the bug is in the dual-write path, not in CDC.

Action items

Audit top 10 upsert-heavy tables: query pg_stat_wal divided by rows_affected, flag anything >2KB/row
Rewrite the 3 highest-WAL upsert queries using conditional UPDATE with WHERE clause filtering no-ops
For any planned Postgres migration, adopt Stripe's reconciliation pattern: hash comparison, not row counts

Sources:INSERT ... ON CONFLICT DO UPDATE looks like one write. It isn't. · Stripe published the pattern

04
12-Hour Agent Runs Need Batch Infrastructure, Not Request Middleware
The Trendline
METR's autonomous task horizon measures how long a system runs unattended at 50% reliability. The curve is clean exponential. GPT-3.5 managed 30 seconds in 2022. GPT-4 hit 4 minutes in 2023. o1 reached 40 minutes in 2024. GPT 5.2 hit 6 hours in 2025. Opus 4.6 is around 12 hours in 2026. Ajeya Cotra projects 100 hours by year-end. That is roughly 10x per year with no plateau in the data.
Most agent orchestration shipped in 2026 was designed for the 4-to-40-minute regime. The 12-hour regime is a different system.
What Breaks at Duration
The supervisor dies first. It was written assuming it holds a handle to the work. At 12 hours the process gets restarted, the upstream rotates an API key, a dependency rate-limits, something OOMs. Work has to be checkpointed to durable storage, keyed by a task ID the agent owns, and resumed by whichever worker picks it up.
Observability breaks second. Logs that read fine for one request are unreadable across 12 hours. Traces need segmenting by phase, not by call. If the only signal is "still running," the operator has nothing to decide with.
Context management breaks third. Retry logic assumes transient failures and cheap state rebuilds. Checkpoint intervals assume losing the gap is tolerable. The context window budget was set when the agent ran for four minutes. None of those assumptions survive twelve hours.
Sources Disagree on Current Reliability
METR reports 12-hour autonomous capability at the frontier. ClawMark, the multi-day benchmark, reports low task success rates on sustained autonomous work. SWE-Bench at 93.9% measures implementation capability in isolation. The gap between completing a coding task and sustaining a multi-day project is still wide. Design for bounded autonomy with human checkpoints at natural phase boundaries.
The Architecture
This is workflow-engine territory. Temporal or Step Functions, not HTTP middleware. Durable checkpoints the agent can resume from without replaying tool calls. Idempotency on every side-effecting action. A state store the agent reads from, rather than a transcript it scrolls through. Explicit trust gates on irreversible actions. Separate budgets for compute spend and wall-clock. If the current system treats the transcript as the state, that is the thing to rewrite first.
Action items
- Audit your longest-running agent workflow: measure max execution time, identify state held in memory vs. durable storage
- Implement durable checkpointing (Temporal, Step Functions, or custom) for any agent task exceeding 30 minutes
- Add phase-level observability segments to agent traces, not just per-call spans
- Do NOT architect fully autonomous multi-day workflows yet — ClawMark results show reliability is too low
Sources:AI agents now sustain twelve-hour autonomous tasks · DeepSeek V4 landed with a 1M-token MoE and an open-source license

◆ QUICK HITS

Zyphra's folded Tensor/Sequence Parallelism hits 173M tok/sec vs 86M standard on 1024 MI300X GPUs — first credible 2x throughput number that makes AMD inference serious, not just a procurement hedge
Your agent harness matters more than your model: 13-point benchmark swings from prompt/middleware alone
Bluesky postmortem: ephemeral port exhaustion cascaded into a logging storm that consumed the same ports it needed to report the original failure — separate your error-reporting path from your data path's resource pool
INSERT ... ON CONFLICT DO UPDATE looks like one write. It isn't.
K8s 1.36 Pod-Level Resource Managers (alpha): sidecars now share a separate CPU pool from NUMA-pinned main containers, preserving Guaranteed QoS without wasting exclusive cores on Envoy and log collectors
Kubernetes 1.36 addresses the sidecar-versus-QoS interaction
IBM Granite 4.1 ships 30B dense, 512K context under Apache 2.0 — the first open-weight model with enough context for whole-codebase analysis at zero licensing cost. Evaluate as tiered routing fallback for the 70% of tasks that don't need frontier quality
Uber burned its entire 2026 AI coding budget in four months
Stripe's zero-downtime migration at 5M QPS uses version-gated cutovers: verify consistency at a specific version boundary before committing. Rollback is trivial because you just don't commit. Steal this for your next shard rebalance
Stripe published the pattern
Update: Copy Fail Linux privesc — CISA confirmed active exploitation within 24 hours of disclosure. If you haven't patched since Saturday's advisory, every host is now assumed compromised until verified
"Copy Fail" is the name
AI-generated traffic creating novel capacity planning failure: correlated tail latencies across tenants sharing an upstream model invalidate independent-percentile assumptions in autoscaler policies
Kubernetes 1.36 addresses the sidecar-versus-QoS interaction
Google DeepMind's CaMeL: planner-executor separation that reports 'practically solving' the AgentDojo prompt injection benchmark — planner sees untrusted input but holds zero tool handles, executor has tools but never touches raw input
Most LLM agent deployments run one defense layer against prompt injection

◆ Bottom line

The take.

The single biggest performance lever for your AI coding agents this week isn't a model upgrade — it's harness engineering, which delivered a 13-point benchmark swing while Uber burned its entire 2026 AI budget in four months at $500–$2K per engineer. Set min-release-age=7d in your .npmrc before Friday because four package ecosystems are being poisoned simultaneously, audit your Postgres upserts for the 2x WAL amplification Datadog just documented, and start treating agent orchestration as batch infrastructure — because the autonomy window jumped to 12 hours and your scaffolding was built for four minutes.

Frequently asked

How can prompt and middleware changes alone produce a 13-point benchmark swing?: Holding gpt-5.2-codex constant, system prompts shift the token distribution over the model's first decision, tool schema serialization shifts its prior over JSON shapes, and truncation policy decides what survives into the next step. None of these appear on the model card, but together they moved Terminal-Bench 2.0 from 52.8% to 66.5% in a controlled ablation. The harness is doing more work than the weights.
What's the minimum config change to defend against this quarter's npm supply chain attacks?: Add `min-release-age=7d` to .npmrc (npm 11.10.0+), or the equivalent `minimumReleaseAge` in pnpm and `npmMinimalAgeGate` in Yarn. A 7-day cooldown hides freshly published malicious versions from your installs long enough for the community to flag them — the Axios compromise was caught in 3–4 hours. Pair it with `npm ci` everywhere that isn't an explicit update so fresh resolutions don't bypass the lockfile.
Why does INSERT ON CONFLICT DO UPDATE double WAL writes even on no-ops?: The speculative insertion path takes a row lock and emits a WAL record before discovering the row is unchanged, so the "do nothing" branch still produces a full physical write. At Datadog's scale this doubled disk writes and quadrupled WAL syncs. To find it, divide WAL bytes by affected rows in pg_stat_statements and flag anything north of ~2KB per row on narrow tables.
When should I rewrite an upsert versus accept the WAL amplification?: Rewrite when most upserts resolve to updates or no-ops — use MERGE (PG 15+), or a conditional UPDATE with a WHERE clause that filters unchanged rows, accepting a small race window with retry logic. Keep ON CONFLICT when most upserts are genuine inserts; the amplification is the price of the conflict-handling path. Don't rewrite every upsert, only the three that dominate your WAL graph.
What architecture do 12-hour agent runs need that current orchestration doesn't provide?: Workflow-engine infrastructure like Temporal or Step Functions, not HTTP middleware. The supervisor cannot hold a handle to the work for 12 hours — processes restart, keys rotate, dependencies rate-limit. You need durable checkpoints keyed by a task ID the agent owns, idempotency on every side-effecting action, a state store the agent reads from instead of a transcript it scrolls through, and phase-level trace segmentation so operators can tell stuck from working.

◆ Same day, different angle

Read this day as…

◆ Recent in engineer

PromptandMiddlewareAblationLiftsCodex13PointsonTB2.0

◆ INTELLIGENCE MAP

◆ DEEP DIVES

The Data Point That Changes Priorities

Where the Points Come From

The Cost Dimension Makes This Urgent

Sources Disagree on What to Build

The Architectural Pattern

The Mechanism That's Being Exploited

What Happened This Week

The Fix That Ships Today

The Tradeoff

The Mechanism

Where This Hurts

The Audit

Fix Options by Workload

Stripe's Related Pattern Worth Stealing

The Trendline

What Breaks at Duration

Sources Disagree on Current Reliability

The Architecture

◆ QUICK HITS

The take.

Frequently asked

◆ RELATED THREADS