Wednesday, May 13, 2026 ~4 min

The harness, not the model, is where your inference bill lives

A 30x cost spread across coding-agent harnesses just got benchmarked, Anthropic bought the SDK layer underneath its competitors, and two npm campaigns torched the trust model that was supposed to prevent exactly this. Three things to do before Friday.

The Artificial Analysis Coding Agent Index landed this week with a number that should reorganize most inference budgets: more than 30x cost-per-task variance across model-and-harness pairs at comparable quality, with greater than 7x wall-time spread and cache hit rates ranging from 80% to 96% on the same workload. The leaderboard everyone is going to screenshot is the wrong artifact. The Pareto frontier is the artifact. A smaller model in a well-built harness beats a frontier model in a naive one, consistently, and the gap between best and worst harness on identical tasks is roughly an order of magnitude larger than the gap between Opus 4.7 and GPT-5.5.

Most teams have spent the last year optimizing the wrong layer.

The second number worth your sprint: Llama 3.2 1B as a speculative drafter delivers 2.31x throughput on a 70B target, with mathematically identical outputs. The 8B drafter gets 2.08x — worse, despite a higher acceptance rate, because drafting latency compounds at every step and eats the verification gain. Google runs speculative decoding for AI Overviews at over a billion users. Anthropic and Meta ship it by default. If your vLLM deployment isn't running it, you are paying for half to two-thirds of your GPU budget and not using it. The integration is a flag and a measurement.

These two facts stack multiplicatively. Audit the harness, enable the drafter, and the floor on inference cost moves down 5-10x at equal quality. That's the week's actual story at the engineering layer, and it's the one most teams will skip past on the way to the louder headlines.

The SDK layer stopped being neutral

Anthropic is acquiring Stainless for $300M+. Stainless is not a model company. It is the OpenAPI-to-SDK shop that generates the official Python, TypeScript, Go, Java, and Ruby clients for Anthropic, OpenAI, and Google. Every team running pip install openai or import google-genai is shipping Stainless-generated code into production. As of this week, one frontier lab owns the developer experience layer of its two largest competitors.

The immediate failure mode isn't a breach. It's drift. A deprecation here, a lag in supporting a new tool-call schema there, retry semantics that shift in a minor version. Drift is harder to detect than an outage and harder to attribute than a model regression — your eval will move and you'll blame the wrong layer. OpenAI and Google will almost certainly bring SDK generation in-house within 6-12 months, and the divergence period between now and then is when subtle behavior changes will land in production agents and look like model-quality issues.

The defensive posture is cheap if done now. Most teams use a narrow slice of the SDK surface — auth, typed requests, retries, streaming. A motivated engineer replaces the hot path with about 200 lines of httpx in an afternoon. The point isn't to actually rip out the SDK. The point is to know exactly what it would cost, pin the version, and put automated diff-monitoring on the upstream spec so behavior changes don't land silently.

Trusted publishing died this week

Two coordinated npm campaigns hit 253 package names. The TanStack vector chained GitHub Actions vulnerabilities to mint legitimate publishing tokens, then shipped 84 malicious versions across 42 packages with 12M+ weekly downloads. The Bun-based worm — call it Mini Shai-Hulud — abused optionalDependencies and prepare hooks to exfiltrate GitHub PATs, npm tokens, and cloud credentials from CI runners across the Mistral and TanStack ecosystems. Trusted publishing did not prevent either. The lockfile hashes match. The provenance checks pass. The packages are "trusted" because the workflow that published them was, briefly, a legitimate publisher.

Running in parallel: TeamPCP backdoored the Checkmarx Jenkins AST plugin (v2026.5.09), the third CI/CD distribution channel from this actor since February. And Cyera's CVE-2026-7482 dropped — a pre-auth heap leak in Ollama, exploitable in three unauthenticated API calls, with roughly 300,000 instances exposed on the public internet. The heap of a running inference server contains exactly what you'd expect: prompts, system prompts, API keys, environment variables. Inference servers are now in the same risk class as the misconfigured Elasticsearch instances of 2018. Same failure mode, new process name.

If any CI runner installed an affected package or ran the backdoored Jenkins plugin since May 11, every credential reachable from that runner is presumed compromised. Rotation is the remediation. Patching is not.

What to do this week

Three moves, in order of payback.

Audit the harness. Pick one production workload. Run the Artificial Analysis methodology on three model-and-harness combinations against your own repo. Score cost-per-task, tokens-per-task, cache hit rate, and wall time. Most defaults sit 5-10x on the wrong side of the cost curve at equal quality, and a one-week spike protects a multi-quarter commitment. While you're there, enable speculative decoding with a same-family 1B drafter and instrument acceptance rate per request — alert when it drops below 60% for any traffic slice. That's where the drafter stops paying rent.

Pin the SDK and own the abstraction. Inventory every repo for direct openai/anthropic/google-genai imports. Pin versions. Set up automated diff-monitoring against the upstream API specs. Prototype a 200-line httpx shim for your two highest-volume inference calls so provider-swap is a config change, not a refactor. The optionality just got more valuable than it was last Friday.

Rotate every CI credential and unbind every Ollama instance from 0.0.0.0. Grep lockfiles against the published IOC list for both npm campaigns. Roll back the Checkmarx plugin to 2.0.13-829. Block /api/push at the perimeter for any Ollama instance you can't patch tonight. Pin GitHub Actions to commit SHAs and set workflow-level permissions: {} with minimum grants per job. The next post-mortem will read exactly like this one with a different package name in the headline; the controls that matter are the ones that contain blast radius when the marketplace lies to you again.

◆ Behind the synthesis

Six specialist takes that fed this piece.

The piece above is one stream in my voice. Below are the six lenses my pipeline produced upstream — each tuned for a different reader. Use them when you want the angle that matters most to your role.

The harness, not the model, is where your inference bill lives

The SDK layer stopped being neutral

Trusted publishing died this week

What to do this week

Six specialist takes that fed this piece.

Two coordinated npm campaigns hit 253 packages this week: 84 TanStack versions (12M+ weekly downloads) via GitHub Actions credential exfiltration, and 169 packages through a Bun-based worm abusing optionalDependencies prepare hooks across Mistral and Tanstack.

Two credential-theft campaigns are live in CI/CD pipelines.

The Artificial Analysis Coding Agent Index shows more than 30x cost-per-task variance across model and harness pairs at comparable quality.

Anthropic paid three hundred million dollars for Stainless, the company that builds the developer SDKs for OpenAI and Google.

Anthropic grew from $9B to $45B annualized revenue in five months — 5x growth, 80x annualized, now raising at $1 trillion — while Cerebras prints Thursday at $50B+ on a single OpenAI contract that converts compute spend into 11% equity with termination rights.