Monday, February 23, 2026 ~5 min

OpenAI's 33% margin and Stripe's 1,000 weekly PRs are the same story

Coding agents just shipped a million lines with three engineers while OpenAI's gross margin collapsed thirteen points below its own forecast. The productivity curve and the cost curve are bending in opposite directions, and your roadmap has to pick a side.

Three engineers at OpenAI built a million-line internal product in five months with zero hand-written code. Stripe's internal agents now merge over a thousand PRs a week. A solo developer pushed 6,600 commits in a single month running five to ten agents in parallel. These aren't demo numbers — they're production throughput at companies that ship to paying customers.

Now read the other half of the week's news. OpenAI's leaked 2025 financials show gross margin at 33%, against their own internal forecast of 46%. Model serving costs quadrupled year over year. Projected cumulative cash burn through 2030 has more than doubled to $111 billion. The path to break-even assumes training costs drop $28 billion in 2030 to offset inference growth — a heroic assumption nobody at OpenAI controls.

These two stories are the same story. The productivity unlock everyone is celebrating runs on infrastructure whose unit economics are getting worse, not better, and the company most exposed to that math just quietly cut its compute target from $1.4 trillion to $600 billion through 2030. If you build with AI, both numbers belong on the same slide.

The harness, not the model

Mitchell Hashimoto's term for what OpenAI, Stripe, and Anthropic have independently converged on is harness engineering — the discipline of building constraints, linters, documentation, and sandboxed environments around coding agents. The patterns are concrete and adoptable this week:

An AGENTS.md at the repo root, updated every time an agent fails, so each line traces to a past mistake. Custom linters whose error messages double as remediation instructions, so the agent reads its own violation and fixes it without a human in the loop. Internal tools exposed via MCP — Stripe's Toolshed wraps over 400 of them. JSON over Markdown for agent state, because Anthropic found agents treat Markdown as prose they can rewrite but respect JSON's structure. Plan-then-execute as mandatory workflow, with human approval before code generation. Sandboxed devboxes that look identical to dev but cannot reach production or the internet.

The counterintuitive lesson across every success story: you get more from agents by constraining them harder, not loosening the leash. The bottleneck was never code generation. It was the environment around the model.

And then the reliability cliff. Anthropic's own Opus 4.6 benchmarks: 80% accuracy on 1-hour tasks, 50% at 14.5 hours. Chain five sub-tasks at 80% each and your end-to-end success rate is 33%. Anything you scope past an hour without a checkpoint is designing for a coin flip. Engineers cap out at three or four parallel agent sessions before becoming the bottleneck themselves. And every published success story is greenfield — retrofitting harness engineering to a ten-year-old codebase with patchy tests and implicit conventions remains an open problem nobody has convincingly solved.

So the operating envelope is sharp: greenfield projects, sub-hour atomic tasks, mandatory verification gates, a designated agents captain per team, and tool surface area exposed via MCP. Inside that envelope, three engineers can do the work of fifteen. Outside it, you're shipping unreviewed code at scale.

The attack surface you didn't add to the threat model

The same MCP protocol that makes harness engineering work is already under active attack. Cisco's SVP of AI was direct about it: agents are being hijacked, impersonated, and manipulated to exfiltrate data at machine speed. A Cambridge study found only 4 of 30 top AI agents have published formal safety evaluations — browser agents, the most autonomous category, are missing 64% of safety disclosures. Amazon disclosed that a small group of Russian-speaking attackers used commercial AI tools to breach over 600 Fortinet firewalls across 55 countries in weeks. Amazon's word for the scale was impossible without AI.

The attack vectors on the firewalls were mundane — weak passwords, exposed management ports. The novelty was throughput. A small team, off-the-shelf tools, nation-state-scale impact in weeks. Your detection thresholds were calibrated for human-speed attackers. They aren't anymore.

For agent infrastructure specifically, treat agents as first-class IAM principals. Audit every MCP connection, OAuth grant, and persistent token issued to an AI tool. Put AGENTS.md, CLAUDE.md, and equivalent agent-config files under CODEOWNERS — a malicious commit to that file is a control-plane compromise that affects every future code generation session. Insert hard human-in-the-loop gates before any agent action that touches privilege escalation, production environments, or irreversible data operations. The 80/20 design pattern Cisco articulated is the right shape: full autonomy for routine pattern-matching, explicit escalation for the consequential 20%.

What to do this week

Stop arguing about whether AI tools are useful and start building the harness. Create an AGENTS.md at the root of your primary repo by Friday — even a thin one, started from your last three post-mortems. Pick one greenfield workstream on the next planning cycle and scope it agent-native from day one: strict layered architecture, custom linters with remediation instructions in the error messages, MCP-exposed internal tooling, and a designated owner.

Then do the budget side. Pull your 2026 inference cost projections and stress-test them against OpenAI's reality: 4x cost growth in a single year and a 13-point gross margin miss against their own forecast. If your AI feature P&L assumes per-token prices declining, you're modeling fiction. Add a second foundation model provider behind an abstraction layer so model choice is a routing decision, not a hardcoded constant. Benchmark Llama 3, Mistral, and Qwen against your specific production tasks now, while you have the time to do it deliberately.

Wednesday's earnings — Nvidia, Salesforce, Snowflake, Zoom on the same day — will tell you which layer of the stack is actually capturing the value. Nvidia's forward guidance on Blackwell ramp matters more than the backward number. Salesforce's Agentforce ARR trajectory matters more than the EPS print. Watch which way the margin pressure flows.

The teams that win the next twelve months won't be the ones with the best model access. They'll be the ones who built the harness fastest, hardened the agent attack surface in lockstep with adoption, and refused to model AI costs as a curve that only goes down.

◆ Behind the synthesis

Six specialist takes that fed this piece.

The piece above is one stream in my voice. Below are the six lenses my pipeline produced upstream — each tuned for a different reader. Use them when you want the angle that matters most to your role.

OpenAI's 33% margin and Stripe's 1,000 weekly PRs are the same story

The harness, not the model

The attack surface you didn't add to the threat model

What to do this week

Six specialist takes that fed this piece.

Harness engineering — the discipline of building constraints, linters, documentation, and sandboxed environments around coding agents — has independently emerged at OpenAI, Stripe, and Anthropic as the critical unlock for AI-assisted development.

A codified 'harness engineering' playbook has emerged simultaneously from OpenAI, Stripe, and Anthropic — with hard data showing 3-person teams outputting at 15-person rates (3.5 PRs/engineer/day, 1,000+ merged PRs/week at Stripe).

Three engineers at OpenAI built a million-line product in five months with zero hand-written code, while the company's own financials reveal AI gross margins collapsing to 33% with $111B in projected cash burn through 2030.