PROMIT NOW · ENGINEER DAILY · 2026-02-23

Harness Engineering Emerges as the AI Coding Unlock

· Engineer · 22 sources · 1,435 words · 7 min

Topics Agentic AI · LLM Inference · AI Capital

Harness engineering — the discipline of building constraints, linters, documentation, and sandboxed environments around coding agents — has independently emerged at OpenAI, Stripe, and Anthropic as the critical unlock for AI-assisted development. OpenAI's 3-person team shipped a million-line product in five months with zero hand-written code; Stripe's agents merge 1,000+ PRs per week. The bottleneck was never the model — it was your environment. Start building AGENTS.md and agent-friendly linter rules this week, because this infrastructure compounds and every day you wait widens the gap.

◆ INTELLIGENCE MAP

  1. 01

    Harness Engineering: The New Discipline for Agent-Assisted Development

    act now

    OpenAI, Stripe, and Anthropic have converged on a concrete set of patterns — AGENTS.md, custom linters with remediation instructions, MCP-exposed tooling, and plan-then-execute workflows — that make coding agents production-viable, but every success story is greenfield and legacy adoption remains unsolved.

    3
    sources
  2. 02

    MCP and Agent Protocol Security Is the New Attack Surface

    act now

    Cisco confirms attackers are already probing MCP and agent-to-agent protocols — the same protocols that harness engineering relies on for tool exposure — meaning agent infrastructure hardening must happen in parallel with adoption, not after.

    2
    sources
  3. 03

    Multi-Model Routing and LLM Architecture Patterns

    monitor

    Multi-model workflows (Claude for coding, Gemini for long context, GPT-5 for general tasks) are now standard practice at the executive level, and GPT-5's native platform connectors signal OpenAI is moving from model provider to integration platform — your LLM abstraction layer needs model routing as a first-class capability.

    3
    sources
  4. 04

    Developer Productivity Measurement and AI Workflow Gaps

    monitor

    LinkedIn open-sourced their Developer Productivity & Happiness Framework while practitioners report a widening gap between engineers who effectively use LLMs and those who don't — structured team retrospectives on AI tool usage are overdue.

    2
    sources
  5. 05

    AI Infrastructure Energy and Ownership Constraints

    background

    72% of enterprises report being blocked by infrastructure debt when scaling AI, and the AI-energy nexus is becoming a top editorial theme — the bottleneck for most organizations isn't model capability but the data, power, and infrastructure plumbing feeding the models.

    2
    sources

◆ DEEP DIVES

  1. 01

    Harness Engineering Is Here: The Concrete Playbook from OpenAI, Stripe, and Anthropic

    <h3>The Environment Was Always the Bottleneck</h3><p>Mitchell Hashimoto coined the term <strong>harness engineering</strong> — the practice of building constraints, tools, documentation, and feedback loops that keep coding agents productive. The key insight, validated independently across OpenAI, Stripe, and Anthropic: <strong>agent capability was never the constraint</strong>. The environment was. Think Docker for AI-generated code — it didn't make applications faster, it made deployment reliable by constraining the environment.</p><blockquote>OpenAI's 3-person team built a million-line internal product in five months with zero hand-written code, averaging 3.5 PRs per engineer per day. Their secret wasn't a better model — it was a strict layered architecture with rigid dependency boundaries enforced by custom linters.</blockquote><h3>The Patterns That Work</h3><p>The convergence across organizations is striking. Here are the concrete, adoptable patterns:</p><ul><li><strong>AGENTS.md as a living feedback loop</strong>: Not static documentation — every time an agent makes a mistake, you add a line preventing that class of error permanently. Ghostty's AGENTS.md has each line corresponding to a specific past agent failure. OpenAI uses a hierarchical approach: a small AGENTS.md pointing to deeper design docs, architecture maps, and quality grades, all versioned in the repo.</li><li><strong>Custom linter rules with remediation instructions</strong>: When an agent violates an architectural constraint, the error message tells it exactly how to fix the violation. This creates a self-correcting loop that doesn't require human intervention for known failure modes. The linters themselves were Codex-generated.</li><li><strong>MCP-exposed internal tooling</strong>: Stripe's Toolshed platform exposes <strong>400+ internal tools via MCP servers</strong>, giving agents the same operational surface area as human engineers.</li><li><strong>JSON over Markdown</strong> for agent-facing structured data: Anthropic discovered agents treat Markdown as prose they can freely rewrite, but respect JSON's structure. Small detail, big implications for agent-facing configuration.</li><li><strong>Plan-then-execute as mandatory workflow</strong>: No agent writes code until a human has reviewed and approved a written plan.</li></ul><h3>The Honest Failure Modes</h3><p>Agent-generated code <strong>accumulates entropy differently</strong> than human-written code. OpenAI runs periodic 'garbage collection' agents but admits it's an emerging practice. Anthropic found agents marking features as complete without proper end-to-end testing — their browser automation tools have blind spots (<em>Puppeteer can't see native alert modals</em>). The human review bottleneck is real: practitioners cap out at <strong>3-4 parallel agent sessions</strong> before becoming the constraint.</p><blockquote>Every success story is greenfield. Applying harness engineering to a legacy codebase with inconsistent testing, implicit conventions, and patchy documentation is an open problem nobody has convincingly solved.</blockquote><h3>Why This Compounds</h3><p>Every AGENTS.md update, every custom linter rule, every MCP-exposed tool accelerates all future agent work. Stripe's unattended model — developer posts a task in Slack, agent writes code in a pre-warmed sandboxed devbox, passes CI, opens a PR — produces <strong>1,000+ merged PRs per week</strong>. But it requires mature infrastructure most teams don't have yet. The role of the engineer is bifurcating: you're either building the environment or managing the work. Start now, even incrementally.</p>

    Action items

    • Create an AGENTS.md at the root of your primary repositories by end of this week — start with architectural constraints, common pitfalls, and testing conventions
    • Audit your linter rules and add agent-friendly remediation instructions to error messages within this sprint, prioritizing architectural boundary violations
    • Designate an 'agents captain' on each team by end of month — someone responsible for evaluating agent fit, maintaining the harness, and championing adoption
    • Inventory your internal tools and create a prioritized list for MCP server exposure this quarter, starting with most-used CLI tools, deployment scripts, and monitoring dashboards

    Sources:The Emerging "Harness Engineering" Playbook · 🧠 Intelligence should be owned, not rented · Welcome to the free edition of The Pragmatic Engineer Newsletter

  2. 02

    MCP Is the New Microservices Security Problem — and Attackers Are Already There

    <h3>The Protocol Layer Is Under Active Attack</h3><p>Here's the tension you need to internalize: the same MCP protocol that harness engineering depends on for tool exposure is <strong>already being actively probed by attackers</strong>. Cisco's SVP of AI Software and Platform confirms it plainly — agents are being hijacked, impersonated, and manipulated to exfiltrate data or execute unauthorized commands <strong>'at machine speed.'</strong></p><blockquote>MCP and agent-to-agent protocols scaled as connectivity standards, not security standards. They tell agents how to talk to tools and each other, but the identity, authorization, and behavioral monitoring layers are bolted on after the fact, if at all.</blockquote><h3>The Microservices Parallel Is Exact</h3><p>If you lived through the early microservices era, you've seen this movie. Everyone was excited about service decomposition, but the real production pain was <strong>service-to-service auth, mTLS, and observability</strong>. We're in that exact same phase with agentic AI. The prescription maps cleanly:</p><table><thead><tr><th>Microservices Pattern</th><th>Agent Equivalent</th></tr></thead><tbody><tr><td>Service identity (mTLS)</td><td>Zero-trust identity for agents as first-class IAM principals</td></tr><tr><td>API gateway / service mesh</td><td>Controlled tool registries — agents access only explicitly granted capabilities</td></tr><tr><td>Distributed tracing</td><td>Continuous behavioral monitoring — detecting action pattern deviation</td></tr><tr><td>Circuit breakers</td><td>Human-in-the-loop gates for privileged operations</td></tr></tbody></table><h3>Where to Draw the Human-in-the-Loop Line</h3><p>The design constraint is crisp: anything that affects <strong>trust, access, or control over critical systems</strong> — granting privileges, changing production environments, authorizing sensitive data access, initiating irreversible actions — should never run fully autonomous. The 80/20 heuristic for autonomous incident resolution gives you a planning model: design for full autonomy on the <strong>pattern-matching 80%</strong>, and design explicit escalation paths for the complex 20%.</p><p>This directly intersects with harness engineering. If you're exposing 400 tools via MCP like Stripe, each tool needs an explicit authorization model. <em>The speed at which you adopt agent tooling must not outpace the speed at which you secure it.</em></p>

    Action items

    • Audit all MCP and agent-to-agent protocol implementations for identity, authorization, and behavioral monitoring gaps by end of this sprint
    • Map your agent capabilities against a privileged-operations checklist (production deployments, PII access, privilege grants) and add human approval gates for each by end of month
    • Evaluate adding agent identity as first-class principals in your IAM system this quarter

    Sources:🧠 Intelligence should be owned, not rented · The Emerging "Harness Engineering" Playbook

  3. 03

    Your LLM Abstraction Layer Needs Model Routing — Here's the Evidence

    <h3>Multi-Model Is Now Default, Not Experimental</h3><p>Three independent sources this cycle confirm the same pattern: <strong>no single model wins across all tasks</strong>, and the teams getting the most value are routing tasks to the right model. Even GPT-5 power users still reach for <strong>Claude for coding</strong> and <strong>Gemini for long-context analysis</strong>. This isn't benchmark data — it's consistent practitioner behavior across different organizations and use cases.</p><h3>GPT-5's Platform Play Changes the Integration Calculus</h3><p>GPT-5 now ships with <strong>native platform connectors</strong> (e.g., HubSpot), signaling OpenAI is moving from model provider to integration platform. You now have three competing patterns for LLM-powered tool integration:</p><ol><li><strong>Custom function-calling</strong>: You define tools and handle API calls yourself. Maximum control, maximum maintenance.</li><li><strong>MCP or similar open protocols</strong>: Tool discovery and execution via an open standard. Portable but security-immature (see above).</li><li><strong>OpenAI native connectors</strong>: Fastest to ship, hardest to migrate away from. Vendor lock-in by design.</li></ol><p>If you're hardcoding <code>model='gpt-5'</code> throughout your codebase, you're accumulating tech debt. A routing layer that selects models based on task classification pays for itself quickly — and becomes essential as the model landscape continues to shift quarterly.</p><h3>The Two-Pass Pattern for User-Facing Features</h3><p>The 'meta prompting' technique — asking an LLM to rewrite a prompt before executing it — is the prompt-engineering equivalent of <strong>query rewriting in search/RAG systems</strong>. The interesting UX detail: generating multiple-choice clarification options instead of free-text follow-ups. This structured disambiguation reduces cognitive load on users while constraining the input space, making the refinement step more reliable. If you're building user-facing AI features where prompt quality varies wildly, this two-pass pattern with structured output on the first pass is worth implementing.</p>

    Action items

    • Add model routing to your LLM abstraction layer this quarter if you don't already have it — treat the model as a swappable dependency, not a hardcoded constant
    • Evaluate GPT-5's native connector architecture within the next two sprints if you're building SaaS integrations through LLMs
    • Prototype a two-pass prompt refinement pattern for one user-facing AI feature this quarter

    Sources:🧠 Intelligence should be owned, not rented · 🤖 Meta Prompting: The Secret to Better AI Results · Welcome Email 2/3: Our Most Popular Issue

  4. 04

    Developer Productivity Measurement: LinkedIn's DPH Framework and the LLM Usage Gap

    <h3>LinkedIn Open-Sources Their Productivity Framework</h3><p>LinkedIn has released their internal <strong>Developer Productivity & Happiness (DPH) Framework</strong> — a structured system of metrics, processes, and feedback loops for understanding developer needs. This isn't just a dashboard template; it's an operational playbook covering systems, processes, metrics, and feedback systems. The interesting question is how their metric choices compare to the <strong>DORA/SPACE frameworks</strong> that have dominated the DevEx conversation.</p><p><em>The cargo-culting risk is real</em>: LinkedIn has thousands of engineers, and their measurement overhead makes sense at that scale. For a 20-person team, you need to ruthlessly simplify. But as a baseline taxonomy for what to measure and why, it's the most concrete reference implementation available.</p><h3>The LLM Usage Gap Is Widening</h3><p>A staff engineer's observation resonates with the harness engineering data: engineers getting the most value from LLMs aren't using them for the obvious use case (generate this function). They're using them for <strong>codebase exploration</strong> ('explain this legacy module's error handling patterns'), <strong>RFC drafting</strong> ('here's my architecture decision, argue against it'), and <strong>test generation</strong>. At the staff level, work is less about writing code and more about understanding systems, communicating decisions, and reviewing others' work — all areas where LLMs are surprisingly effective.</p><blockquote>If your team hasn't had a structured conversation about LLM workflows, you're leaving productivity on the table — and the gap between effective and ineffective users is widening every month.</blockquote><p>This connects directly to harness engineering: the teams that formalize how agents and LLMs fit into their workflows — with AGENTS.md, designated champions, and shared patterns — will compound their advantage over teams where adoption is ad hoc and individual.</p>

    Action items

    • Pull LinkedIn's DPH Framework repo and evaluate their metric taxonomy against your current developer productivity measurements within the next month
    • Run a team retrospective on LLM tool usage patterns within the next two sprints — identify where AI tools are working, where they're not, and share effective workflows

    Sources:Welcome Email 2/3: Our Most Popular Issue · The Emerging "Harness Engineering" Playbook

◆ QUICK HITS

  • Anthropic found JSON outperforms Markdown for agent-facing structured data — agents treat Markdown as rewritable prose but respect JSON's structure

    The Emerging "Harness Engineering" Playbook

  • 72% of enterprises report being blocked by infrastructure debt (legacy networks, fragmented data, siloed tooling) when scaling AI beyond pilots

    🧠 Intelligence should be owned, not rented

  • The SQL-as-API pattern is being re-evaluated — with Postgres row-level security and parameterized query APIs, the traditional objections are weaker than in 2005, especially for internal service-to-service communication

    Welcome Email 2/3: Our Most Popular Issue

  • PostHog published their fully-async operations playbook across 11 countries — worth reading if you're running distributed engineering teams

    Welcome Email 2/3: Our Most Popular Issue

  • Voice AI agents in production suffer from hallucination, drift, and prompt breakage at scale — ElevenLabs investing in enterprise go-to-market signals the market is moving from pilot to production phase

    📬 Write prompts that power better AI conversations

BOTTOM LINE

Harness engineering — AGENTS.md files, custom linters with remediation instructions, MCP-exposed tooling, and plan-then-execute workflows — is the discipline that separates teams shipping 1,000 agent-generated PRs per week from teams still debating whether AI coding tools are useful. But the same MCP protocol enabling this is already under active attack, so you must harden agent infrastructure in lockstep with adoption. Create your AGENTS.md this week, audit your MCP security this sprint, and add model routing to your LLM layer this quarter.

Frequently asked

What is harness engineering and why does it matter now?
Harness engineering is the practice of building constraints, linters, documentation, and sandboxed environments that keep coding agents productive. It matters now because the bottleneck to AI-assisted development was never model capability — it was the environment. Teams that invest in harnesses compound their advantage: every new rule, doc line, or exposed tool accelerates all future agent work.
How do I start building an AGENTS.md file effectively?
Place an AGENTS.md at the root of your repo and seed it with architectural constraints, common pitfalls, and testing conventions. Treat it as a living feedback loop: every time an agent makes a mistake, add a line that prevents that class of error permanently. For larger codebases, use a hierarchical approach — a short root file pointing to deeper design docs, architecture maps, and quality grades.
What are the security risks of exposing internal tools via MCP?
MCP was designed as a connectivity standard, not a security standard, and attackers are already hijacking and impersonating agents to exfiltrate data at machine speed. Each exposed tool is an unaudited attack surface unless you add agent identity as a first-class IAM principal, explicit per-tool authorization, and behavioral monitoring. Treat it like the early microservices era: mTLS, gateways, tracing, and circuit breakers all have agent equivalents you need to build.
Where should human-in-the-loop gates sit in an agent workflow?
Require human approval for anything affecting trust, access, or control over critical systems — privilege grants, production changes, sensitive data access, and irreversible actions. A useful planning heuristic is the 80/20 split: design for full autonomy on the pattern-matching 80% of tasks, and build explicit escalation paths for the complex 20%. Plan-then-execute, where a human approves the written plan before code is written, is also emerging as a mandatory default.
Should I hardcode a single LLM, or build a routing layer?
Build a routing layer. Practitioners consistently reach for different models for different tasks — Claude for coding, Gemini for long-context analysis, GPT-5 for general work — and the model landscape shifts quarterly. Hardcoding model='gpt-5' throughout your codebase accumulates tech debt fast; treating the model as a swappable dependency keyed on task classification is now standard practice, not premature optimization.

◆ ALSO READ THIS DAY AS

◆ RECENT IN ENGINEER