What's the practical implication of Opus 4.6 dropping from 80% to 50% reliability between 1 and 14.5 hour tasks?

Chain five sub-tasks at 80% reliability each and your end-to-end success rate collapses to roughly 33%, so any agent workflow exceeding one hour needs checkpointed sub-tasks with verification gates between steps. Unmonitored multi-hour autonomous runs are near-certain failures without this decomposition.

Why is agent verification especially dangerous for ML code versus application code?

Agents mark features complete without proper end-to-end testing, and for ML code the failure mode is insidious: a subtle feature transformation error can pass unit tests while introducing distribution shift that only degrades production predictions later. No current harness pattern addresses statistical correctness verification, making it an open problem teams must solve in-house.

What should be in an AGENTS.md for an ML repository?

Encode team conventions that trace to past failures: feature naming standards, train/test split protocols, forbidden patterns like computing features before splits or leaking across CV folds, logging and experiment-tracking requirements, and pointers to feature store schema docs. Update it on every agent failure so rules compound across future runs.

How should ML teams plan inference budgets given OpenAI's cost trajectory?

Plan for cost uncertainty rather than cost decline. OpenAI's serving costs quadrupled in 2025 and gross margin came in at 33% versus a 46% forecast, so stress-test 2026-2027 budgets against 2x-3x price movement in either direction, benchmark open-source alternatives like Llama 3 and Mistral for fallback routing, and invest in distillation and quantization as hedges.

What's the concrete MCP security audit checklist for an ML pipeline?

Check whether attackers can impersonate one agent to another via MCP, whether a compromised agent can escalate privileges through tool-call chains, whether agent actions are logged with granularity sufficient for behavioral anomaly detection, and whether hard human-in-the-loop gates exist before any agent can modify production or trigger irreversible operations. Feature stores, experiment trackers, and model registries exposed via MCP all fall inside this surface.

PROMIT NOW · DATA SCIENCE DAILY · 2026-02-23

Harness Engineering Emerges as Agent Reliability Playbook

2026-02-23 · Data Science · 27 sources · 1,417 words · 7 min

Topics Agentic AI · AI Capital · LLM Inference

Agent reliability degrades to a coin flip past 1 hour of autonomous operation (Opus 4.6: 80% at 1hr, 50% at 14.5hrs), and the emerging discipline to fix this — 'harness engineering' — is converging across OpenAI, Stripe, and Anthropic on identical patterns: AGENTS.md files, remediation linters, JSON-over-Markdown state, and sandboxed execution. If you're deploying agents against your ML codebase, the playbook is crystallizing now and the teams that invest in constraints today will compound a productivity gap that late adopters can't close.

Key facts

Anthropic's Opus 4.6 agent reliability drops from 80% accuracy on 1-hour tasks to 50% on 14.5-hour tasks, making compound errors across chained sub-tasks near-certain.
OpenAI, Stripe, and Anthropic are converging on 'harness engineering' patterns including AGENTS.md files, remediation linters, JSON-over-Markdown state, and sandboxed devboxes.
A Cambridge-led study found only 4 of 30 top AI agents (13%) have published formal safety evaluations, with browser agents missing 64% of safety disclosures.
Attackers breached 600+ Fortinet firewalls across 55 countries in weeks using commercial AI tools to scale exploitation of weak passwords and exposed ports.
OpenAI's 2025 model serving costs quadrupled, compressing gross margins to 33% versus a 46% internal forecast, with cumulative cash burn through 2030 now projected at $111 billion.

◆ INTELLIGENCE MAP

01
Agent Reliability & Harness Engineering
act now
Opus 4.6's steep reliability decay curve (80%→50% over 1→14.5hrs) validates the harness engineering thesis: the bottleneck is never the agent's code generation ability but the constraints, verification gates, and structured documentation surrounding it.
2
sources
02
MCP Security & Agentic Infrastructure Risk
act now
MCP is becoming the de facto agent-to-tool integration layer, but Cisco confirms attackers are already probing agent hijacking vectors, a Cambridge study finds 87% of top agents lack formal safety evals, and AI-augmented cyberattacks breached 600+ Fortinet firewalls across 55 countries — your agent infrastructure is simultaneously standardizing and becoming a target.
3
sources
03
LLM Inference Economics Under Pressure
monitor
OpenAI's inference costs 4x'd in 2025 (gross margin: 33% vs. 46% projected), the company forecasts $111B cash burn through 2030, and the physical infrastructure bottleneck — 1GW data centers requiring $10M+ executives to build — means the cost curve is not bending anytime soon.
3
sources
04
Enterprise AI Adoption Reality Check
monitor
Enterprise AI monetization is real but tiny — Salesforce Agentforce at $500M+ ARR (~1% of revenue), Snowflake AI at $100M (~2%) — while 72% of enterprises remain blocked by infrastructure debt, not model capability.
2
sources
05
AI Compute & Infrastructure Supply Constraints
background
The AI bottleneck is shifting from model talent to physical infrastructure — 1GW data centers, $10M+ executive packages with key-man clauses on project financing — while Nvidia reports 67% YoY revenue growth to $65.7B, signaling GPU demand still outstrips supply.
2
sources

◆ DEEP DIVES

01
Harness Engineering Meets Agent Reliability: The Playbook Your ML Team Needs Now
<h3>The Convergence</h3><p>Two independent intelligence streams this week point to the same conclusion: <strong>coding agents are production-ready in capability but production-dangerous without scaffolding</strong>. Anthropic's Opus 4.6 benchmarks reveal a steep reliability decay — <strong>80% accuracy at 1-hour tasks, dropping to 50% at 14.5 hours</strong> — while OpenAI, Stripe, and Anthropic are independently converging on a discipline called <strong>harness engineering</strong> to solve exactly this problem.</p><p>The throughput numbers are seductive: 3 engineers built a <strong>million-line product in 5 months</strong> at OpenAI with zero hand-written code. Stripe's internal agents produce <strong>1,000+ merged PRs per week</strong>. A solo developer made 6,600+ commits in a single month running 5-10 agents simultaneously. But the compound error math is brutal — chain 5 sub-tasks at 80% reliability each, and your end-to-end success rate drops to <strong>~33%</strong>.</p><blockquote>The bottleneck for coding agents is never the agent's ability to write code — it's the lack of structure, tools, and feedback mechanisms surrounding it. This is the MLOps lesson all over again: the model isn't the bottleneck, the system around it is.</blockquote><hr><h3>The Emerging Playbook</h3><p>Here's what the convergent practices look like, translated for ML teams:</p><table><thead><tr><th>Practice</th><th>What It Is</th><th>Who's Doing It</th><th>Your ML Application</th></tr></thead><tbody><tr><td><strong>AGENTS.md</strong></td><td>Living doc at repo root; updated on every agent failure</td><td>OpenAI, Hashimoto (Ghostty)</td><td>Encode feature naming, train/test split rules, forbidden patterns (leakage), logging requirements</td></tr><tr><td><strong>Remediation linters</strong></td><td>Custom linters whose error messages tell the agent how to fix violations</td><td>OpenAI (Codex-generated)</td><td>Catch data leakage, unlogged experiments, missing null handling</td></tr><tr><td><strong>JSON over Markdown</strong></td><td>Structured formats for agent state — agents corrupt freeform text more readily</td><td>Anthropic</td><td>Experiment configs, feature lists, pipeline DAG definitions</td></tr><tr><td><strong>Planning-execution separation</strong></td><td>Agent writes plan, human approves before code generation</td><td>Cloudflare, Anthropic</td><td>Agent proposes experiment design for human review before implementation</td></tr><tr><td><strong>Sandboxed devboxes</strong></td><td>Isolated environments identical to dev but cut off from production</td><td>Stripe (Minions)</td><td>Prevent agent-generated training jobs from accessing production data</td></tr><tr><td><strong>MCP tool exposure</strong></td><td>Internal tools accessible via Model Context Protocol</td><td>Stripe (400+ tools via Toolshed)</td><td>Feature store, experiment tracker, model registry as MCP endpoints</td></tr></tbody></table><hr><h3>The Verification Gap — Worse for ML</h3><p>Anthropic discovered that <strong>agents mark features as complete without proper end-to-end testing</strong>. For application code, this produces visible bugs. For data science code, the consequences are far more insidious: a subtle feature transformation error that passes unit tests but introduces <strong>distribution shift is invisible until it degrades production predictions</strong>. No harness pattern yet addresses statistical correctness verification — this is the open problem your team should be thinking about.</p><h4>Where This Doesn't Work Yet</h4><ul><li><strong>Legacy ML codebases</strong> — your existing Airflow DAGs, undocumented Spark jobs, and notebook-to-production scripts are the brownfield problem that remains unsolved</li><li><strong>Review bandwidth</strong> — human reviewers become the bottleneck at 3-4 parallel agent sessions, and reviewing agent-generated statistical code requires <em>more</em> domain expertise per line than CRUD endpoints</li><li><strong>Code entropy</strong> — agent-generated code accumulates cruft differently than human-written code, manifesting as redundant preprocessing steps and unnecessarily complex transformations</li></ul>
Action items
- Create an AGENTS.md for your primary ML repository encoding team conventions: feature naming standards, train/test split protocols, forbidden patterns (e.g., no data leakage across folds), and pointers to feature store schema docs
- Build custom linters for your ML codebase that catch data science anti-patterns AND include remediation instructions in error messages (e.g., 'ERROR: Feature computed before train/test split. Move inside cross-validation loop. See AGENTS.md#feature-engineering')
- Pilot agent-assisted development on one greenfield ML component (new feature store module, experiment tracking integration, or model serving endpoint) — not on legacy pipelines
- Decompose any agent workflow exceeding 1 hour into checkpointed sub-tasks with verification gates between each step
Sources:The Emerging "Harness Engineering" Playbook · 🐱 She forgot 3 emails. Then built this.
02
MCP Is the New Attack Surface — And 87% of Agents Have Never Been Safety-Tested
<h3>Three Signals, One Threat Model</h3><p>Three independent sources this week converge on a single uncomfortable truth: <strong>the agentic AI stack is scaling faster than its security</strong>. Cisco's SVP of AI confirms attackers are "already probing gaps" in MCP and agent-to-agent protocols. A Cambridge-led study found only <strong>4 of 30 top AI agents (13%) have published formal safety evaluations</strong> — with browser agents missing <strong>64% of safety disclosures</strong> despite being the most autonomous category. And Amazon reports that <strong>600+ Fortinet firewalls were breached across 55 countries in weeks</strong> by a small group using commercial AI tools — attack vectors were mundane (weak passwords, exposed ports), but AI enabled scanning and exploitation at unprecedented scale.</p><blockquote>If you're deploying agents without checkpoints and your own evaluation harness, you're the one running the experiment — and attackers are already probing the same infrastructure your agents connect to.</blockquote><hr><h3>The Dual Threat</h3><p>Cisco's framework identifies two distinct threat vectors that require different mitigations:</p><ol><li><strong>Protecting the enterprise from agents</strong> — rogue or compromised agents acting against organizational interests (data exfiltration, unauthorized privilege escalation, production environment modification)</li><li><strong>Protecting agents from the world</strong> — adversarial inputs, prompt injection, and data poisoning targeting agent decision-making</li></ol><p>The proposed security layers are sound in principle:</p><ul><li><strong>Zero-trust identity</strong> — every agent treated as an untrusted actor requiring continuous verification</li><li><strong>Control over agent protocols and tool registries</strong> — governing what agents can access and execute</li><li><strong>Continuous behavioral monitoring</strong> — detecting anomalous agent actions in real-time</li></ul><p><em>No implementation specifics, detection rates, or false positive management data were disclosed by any source.</em></p><hr><h3>MCP: Standard and Target Simultaneously</h3><p>MCP is coalescing as the <strong>de facto agent-to-tool integration layer</strong> — Stripe exposes 400+ internal tools via their Toolshed MCP implementation, Claude Code uses it as the reference integration point for Gmail/Slack/Notion/Calendar, and third-party tools like runCLAUDErun are building on it. But this standardization creates a monoculture risk: a vulnerability in MCP's protocol layer would affect every agent ecosystem built on it.</p><p>For ML teams specifically, MCP is how you'd expose your <strong>feature store CLI, experiment tracker, model registry, and data catalog</strong> to agents. Every one of those tools contains sensitive data about your models, training data, and production infrastructure. The question isn't whether to adopt MCP — it's whether your security posture is ready for it.</p><h4>Your Specific Threat Surface</h4><ul><li>Can an attacker impersonate one of your agents to another via MCP?</li><li>Can a compromised agent escalate its own privileges through tool-call chains?</li><li>Are you logging all agent actions with enough granularity to detect behavioral anomalies?</li><li>Do you have <strong>hard human-in-the-loop gates</strong> before any agent can modify production environments or trigger irreversible operations?</li></ul>
Action items
- Audit your agentic pipeline's MCP and tool-access protocols for identity spoofing, privilege escalation, and unauthorized data exfiltration vectors by end of next sprint
- Build an internal safety evaluation harness for any third-party AI agents you deploy, covering adversarial inputs, boundary conditions, and failure mode cataloging — complete within this quarter
- Enforce mandatory human-in-the-loop gates for any agent action involving privilege changes, production environment modifications, or irreversible operations — implement immediately
- Audit ML infrastructure credentials and firewall configurations — specifically Fortinet devices — and rotate any weak or default passwords this week
Sources:🧠 Intelligence should be owned, not rented · 🐱 She forgot 3 emails. Then built this. · The Emerging "Harness Engineering" Playbook
03
Inference Economics Are Broken — Build for Cost Uncertainty, Not Cost Decline
<h3>OpenAI Can't Forecast Its Own Costs</h3><p>OpenAI's leaked financials reveal a cost structure that should recalibrate every ML team's inference budget planning. <strong>Model serving costs quadrupled in 2025</strong>, compressing gross margins to <strong>33%</strong> — a 13-percentage-point miss against their own internal forecast of 46%. Revenue hit $13.1B (slightly above forecast), but the cost side tells the real story: the company now projects <strong>$111 billion in cumulative cash burn through 2030</strong>, more than double its previous estimate.</p><table><thead><tr><th>Metric</th><th>2025 Actual</th><th>2025 Expected</th><th>2030 Projected</th></tr></thead><tbody><tr><td>Revenue</td><td>$13.1B</td><td>~$13B</td><td>~$285B implied</td></tr><tr><td>Gross Margin</td><td>33%</td><td>46%</td><td>Not disclosed</td></tr><tr><td>Cumulative Cash Burn</td><td>—</td><td>~$55B (prior)</td><td>$111B</td></tr><tr><td>Training Cost Change (2030)</td><td>—</td><td>—</td><td>-$28B (assumed)</td></tr></tbody></table><p>The path to profitability rests on a single assumption: <strong>training costs drop ~$28B in 2030</strong>, fully offsetting inference growth. This requires either post-Blackwell hardware delivering dramatic FLOPS/dollar improvements or algorithmic breakthroughs reducing compute by an order of magnitude. <em>Neither is guaranteed, and both are outside OpenAI's direct control.</em></p><blockquote>OpenAI can't forecast its own inference costs one year out (33% vs 46% gross margin), so treat anyone's 5-year AI cost projections as fiction and build your pipelines for cost uncertainty, not cost decline.</blockquote><hr><h3>The Physical Bottleneck</h3><p>The cost pressure has a physical dimension. New AI campuses are designed for <strong>1 gigawatt of power or more</strong> — over 10x traditional data centers. The executives who can build at this scale are commanding <strong>$10M+ compensation packages</strong>, and lenders are inserting key-man clauses that can pull project financing if a specific executive leaves mid-build. Nvidia's expected <strong>67% YoY revenue growth to $65.7B</strong> confirms GPU demand still outstrips supply.</p><p>For your planning: if the binding constraint on AI scaling is now physical infrastructure (power, cooling, construction) rather than chip fabrication, <strong>GPU availability growth may plateau or become lumpy</strong>. Wednesday's Nvidia earnings call guidance matters more than backward-looking numbers — listen for Blackwell ramp timeline and supply constraint commentary.</p><hr><h3>What This Means for Your Stack</h3><p>If you're consuming OpenAI APIs or any hosted LLM, a <strong>33% gross margin is not sustainable</strong> for a company burning $111B. Either prices go up, rate limits tighten, or cheaper models get pushed more aggressively. The engineering response:</p><ul><li><strong>Build model-routing logic</strong> that dynamically shifts between providers and model sizes based on cost/quality tradeoffs</li><li><strong>Benchmark smaller open-source models</strong> (Llama 3, Mistral) against your specific use cases — the quality gap may be smaller than you think for many production tasks</li><li><strong>Invest in efficiency techniques</strong> (distillation, quantization, MoE) as a hedge against both cost increases and capacity constraints</li><li><strong>Stress-test your 2026-2027 inference budget</strong> against scenarios of 2x-3x price changes in either direction</li></ul>
Action items
- Stress-test inference cost projections against OpenAI's revealed 4x cost increase — if your 2026-2027 model serving budget assumes declining per-token costs, validate with actual billing data trends this sprint
- Benchmark top 3 open-source models (Llama 3, Mistral, Qwen) against your production use cases within this quarter to establish fallback options
- Monitor Wednesday's Nvidia earnings call for Blackwell ramp timeline and data center revenue mix — flag any supply constraint commentary for your compute procurement team
Sources:The Briefing: Nvidia, Salesforce on Deck · Still interested in The Information? Save 25% today · Editor's Pick: The $10 Million Power Players of the AI Buildout

◆ QUICK HITS

NVIDIA open-sourced DreamDojo — a robotics world model trained on 44K hours of human video that predicts physical interactions without a physics engine; worth evaluating if you work in embodied AI or sim-to-real transfer
🐱 She forgot 3 emails. Then built this.
Stanford's Harper Carroll recommends placing the task/question BEFORE context for non-reasoning models but context-first for reasoning models — a zero-cost A/B test for RAG pipelines with mixed model types
🐱 She forgot 3 emails. Then built this.
Salesforce Agentforce hit $500M+ ARR and Snowflake AI revenue run rate reached $100M — enterprise AI is monetizing but still represents ~1-2% of core business revenue
The Briefing: Nvidia, Salesforce on Deck
Cisco reports only 28% of organizations believe they're ready for AI workloads — the blocker is infrastructure debt (legacy networks, fragmented data, siloed tooling), not model capability
🧠 Intelligence should be owned, not rented
Multi-model review workflows (draft in model A, critique in model B) are becoming standard practice among AI-forward teams — a cheap way to catch hallucinations that costs almost nothing to validate
🧠 Intelligence should be owned, not rented

BOTTOM LINE

Coding agents hit 80% reliability at 1-hour tasks but degrade to a coin flip at 14.5 hours, the MCP protocol connecting them to your tools is already being probed by attackers, and OpenAI's own inference costs quadrupled while they missed their gross margin forecast by 13 points — the teams that win from here are the ones building constraints around agents (AGENTS.md, remediation linters, verification gates), hardening their agent security posture now, and planning for cost uncertainty rather than cost decline.

Frequently asked

What's the practical implication of Opus 4.6 dropping from 80% to 50% reliability between 1 and 14.5 hour tasks?: Chain five sub-tasks at 80% reliability each and your end-to-end success rate collapses to roughly 33%, so any agent workflow exceeding one hour needs checkpointed sub-tasks with verification gates between steps. Unmonitored multi-hour autonomous runs are near-certain failures without this decomposition.
Why is agent verification especially dangerous for ML code versus application code?: Agents mark features complete without proper end-to-end testing, and for ML code the failure mode is insidious: a subtle feature transformation error can pass unit tests while introducing distribution shift that only degrades production predictions later. No current harness pattern addresses statistical correctness verification, making it an open problem teams must solve in-house.
What should be in an AGENTS.md for an ML repository?: Encode team conventions that trace to past failures: feature naming standards, train/test split protocols, forbidden patterns like computing features before splits or leaking across CV folds, logging and experiment-tracking requirements, and pointers to feature store schema docs. Update it on every agent failure so rules compound across future runs.
How should ML teams plan inference budgets given OpenAI's cost trajectory?: Plan for cost uncertainty rather than cost decline. OpenAI's serving costs quadrupled in 2025 and gross margin came in at 33% versus a 46% forecast, so stress-test 2026-2027 budgets against 2x-3x price movement in either direction, benchmark open-source alternatives like Llama 3 and Mistral for fallback routing, and invest in distillation and quantization as hedges.
What's the concrete MCP security audit checklist for an ML pipeline?: Check whether attackers can impersonate one agent to another via MCP, whether a compromised agent can escalate privileges through tool-call chains, whether agent actions are logged with granularity sufficient for behavioral anomaly detection, and whether hard human-in-the-loop gates exist before any agent can modify production or trigger irreversible operations. Feature stores, experiment trackers, and model registries exposed via MCP all fall inside this surface.

Harness Engineering Emerges as Agent Reliability Playbook

◆ INTELLIGENCE MAP

Agent Reliability & Harness Engineering

MCP Security & Agentic Infrastructure Risk

LLM Inference Economics Under Pressure

Enterprise AI Adoption Reality Check

AI Compute & Infrastructure Supply Constraints

◆ DEEP DIVES

Harness Engineering Meets Agent Reliability: The Playbook Your ML Team Needs Now

MCP Is the New Attack Surface — And 87% of Agents Have Never Been Safety-Tested

Inference Economics Are Broken — Build for Cost Uncertainty, Not Cost Decline

◆ QUICK HITS

BOTTOM LINE

Frequently asked

◆ ALSO READ THIS DAY AS

◆ RECENT IN DATA SCIENCE

Harness Engineering Emerges as Agent Reliability Playbook

◆ INTELLIGENCE MAP

◆ DEEP DIVES

◆ QUICK HITS

BOTTOM LINE

Frequently asked

◆ ALSO READ THIS DAY AS

◆ RELATED THREADS

◆ RECENT IN DATA SCIENCE