Sunday, March 22, 2026 ~4 min

The AI bill came due this week — from six different directions

Microsoft retreated on Copilot, Alibaba and Tencent shed $66B in a day, METR proved benchmarks lie by half, and one user burned 870M tokens on a Monday. The 'add AI everywhere' era ended.

Six things happened this week that, taken together, mark the end of a phase.

Microsoft stripped Copilot out of Snipping Tool, Photos, Widgets, and Notepad after acknowledging "near-universal" negative feedback. Xbox's new lead Asha Sharma opened her tenure with the phrase "No Soulless AI Slop." Alibaba and Tencent lost $66 billion in combined market cap in twenty-four hours — not for bad AI, but for vague monetization. METR published research showing roughly half of AI-generated PRs that pass SWE-bench Verified would fail human review. Anthropic acquired NVIDIA's enterprise crown — surging from 40% to 73% share in ninety days, while Claude Code alone reportedly cleared $2.5B in February revenue. And one documented power user, Azeem Azhar, hit 870 million tokens in a single day running a four-sub-agent setup — up from 150K/day in summer 2024.

None of these are the story. The story is that they all happened in the same week.

The copilot ceiling is real and someone finally measured it

NVIDIA's own chip-design AI failed in 2023. Not on capability — on adoption. Hardware engineers wouldn't trust outputs they couldn't trace. Product Lead Shraddha Sridhar rebuilt the system around source attribution and verifiability, and only then did the team use it. She also gave the cleanest framework I've seen for AI deployment: Tier 1 is individual productivity (the copilot in your sidebar, capped around 30% time saved). Tier 2 is team-level scaling. Tier 3 is capability expansion — things you couldn't do at all before.

Most enterprises are entirely in Tier 1. That's why ROI feels anemic. The Microsoft retreat, the Xbox positioning, the consumer hostility — all of it is the market punishing Tier 1 deployments dressed up as transformation. Hachette pulled a novel on suspicion of AI use. Not proof. Suspicion. When that's the brand environment, shipping a chat panel inside a tool nobody asked for it in is negative-value work.

Traceability is the unlock. NVIDIA proved it internally. If your AI feature can't show its sources at the point of output, your sophisticated users will reject it the same way NVIDIA's hardware engineers did — and you won't get a second chance to earn the trust back.

Your benchmarks are lying by a factor of two

METR's number — ~50% of SWE-bench-passing PRs fail human review — should be the most-cited finding of the quarter. Every vendor pitch, every internal eval, every board slide that cites a benchmark score is overstated by roughly 2x against the thing you actually care about, which is whether the code merges and ships.

This matters more because of how it compounds with the "almost perfect" problem. Raffi Krikorian, who ran Uber's self-driving program, crashed his Tesla and wrote the line: a system that works 97% of the time trains humans to stop catching the 3%. Your highest-accuracy AI features are where your worst incidents will originate. The supervisor isn't supervising. They're rubber-stamping. Audit your human-in-the-loop gates — any one with a 98%+ approval rate is a checkbox, not a control.

The fix isn't more humans or more AI. It's curation. Stripe and Coinbase run autonomous coding agents in production with roughly fifteen tools and an AGENTS.md file encoding their conventions. LangChain just open-sourced the same pattern as Open SWE under MIT. The capability is no longer the constraint — the constraint is whether you've codified your team's quality bar in a form a model can act on.

Token consumption is exponential and your spreadsheet is linear

870 million tokens in a day from one user is the data point that should reset every infrastructure forecast in your company. The driver isn't heavier chat use — it's multi-agent orchestration. One coordinator, four sub-agents, each paying full token cost on each call. No shared context. No deduplication. The economics are 1,000x to 6,000x what your chat-era model assumes.

NVIDIA's response — paying $20B for Groq, announcing a Vera Rubin + Groq architecture claiming 35x throughput per megawatt over Blackwell — is the supply side conceding that GPUs were never the right shape for inference. Prefill is compute-bound; decode is memory-bandwidth-bound; one chip can't be optimal for both. Demand-paging research claiming 90% memory reduction with sub-1% accuracy loss is the second pincer. Inference costs will fall hard in the next 12-18 months even as aggregate consumption explodes.

Which means: the AI feature you killed last quarter for cost reasons is worth re-prototyping now. The pricing model you set for usage-based AI is probably fine. The pricing model you set as fixed-fee with variable inference cost is a margin trap waiting to spring. And if your CFO is still treating tokens as an IT line item rather than a productive input, you're going to lose the budget fight to a competitor whose CFO learned the difference.

What to do this week

One thing, specifically. Pick the AI feature in your product with the highest usage and instrument three numbers against it: real-world success rate (not benchmark score), source-traceability coverage (can the user verify any output to its origin), and tokens-per-successful-outcome (not tokens-per-call).

Those three numbers tell you whether you're in Tier 1 with a 30% ceiling, whether your users will stay once the novelty wears off, and whether your unit economics survive the consumption curve that's already arrived for early adopters and is on its way to everyone else. Every other AI metric on your dashboard is downstream of these.

The "add AI" phase is over. The "prove AI" phase started this week.

◆ Behind the synthesis

Six specialist takes that fed this piece.

The piece above is one stream in my voice. Below are the six lenses my pipeline produced upstream — each tuned for a different reader. Use them when you want the angle that matters most to your role.

The AI bill came due this week — from six different directions

The copilot ceiling is real and someone finally measured it

Your benchmarks are lying by a factor of two

Token consumption is exponential and your spreadsheet is linear

What to do this week

Six specialist takes that fed this piece.

METR just quantified what every senior engineer suspected: ~50% of AI-generated PRs that pass SWE-bench automated grading would fail human code review.

Claude Code Channels now bridges Telegram and Discord directly to live code execution sessions — protected only by a sender allowlist and pairing code.

Multi-agent workflows are driving 1,000–6,000x increases in per-user token consumption — and NVIDIA just valued Groq at $20B to solve it.

NVIDIA just paid $20B for inference chip maker Groq and announced 35x throughput gains over its own Blackwell — while real-world token consumption among agentic early adopters has exploded 6,000x in two years.

Microsoft just retreated on Copilot after 'near-universal' negative user feedback, NVIDIA's own chip-design AI failed until they rebuilt their entire org around it, and three sources independently confirm copilot ROI is hitting a hard ceiling at ~30% task acceleration.