Edition 2026-05-08 · read as Product
MicrosoftCutsAIin81Products:AuditYourRoadmapNow
- Sources
- 42
- Words
- 1,873
- Read
- 9min
Topics Agentic AI LLM Inference AI Capital
◆ The signal
Microsoft killed AI features across 81 products this week after customers called them 'functionally useless' — while the surviving features (365 Copilot) grew paying users 33%. The dividing line: features that automate a task users already hate, producing output good enough to ship without editing, live. Everything else dies. Your roadmap's 17 'AI-powered' items have the same split sitting inside them — run the audit before your next planning review, because 12 of them are inference cost with no retention signal attached.
◆ INTELLIGENCE MAP
01 Microsoft's 81-Product AI Pruning: The Kill Framework
act nowMicrosoft killed Copilot in Gaming, Photos, Widgets, and Notepad while 365 Copilot grew 33% in paying users. The survivors automate weekly workflows (meeting transcription, email drafts). The dead added chat surfaces where nobody was asking questions. Nadella merged all Copilot under one EVP to enforce centralized quality gates.
- Products with Copilot
- Paying user growth
- Features killed
- Models now used
02 Agent Payment Rails Ship: Stripe, Google Cloud, Coinbase Go Live
monitorStripe shipped 280+ agentic commerce features including agent wallets. Google Cloud launched pay.sh with Solana for $0.001-$20 stablecoin micropayments. Coinbase shipped agentic.market. NFX's Project Deal proved agents close real transactions ($4K, 46% WTP). Per-seat pricing's expiration date moved from 'someday' to 'this cycle.'
- Stripe new features
- pay.sh providers
- WTP for agent commerce
- Payment range
- Stripe agent walletsLive now
- Google pay.sh + SolanaLive now
- Coinbase agentic.marketLive now
- Anchorage KYA BankingLive now
- Per-seat model expires3-5 years
03 AI Inference Margin Crisis: The 30-Point Gap Has a Fix
act nowAI features run at 50-60% gross margins vs. 80-90% for traditional SaaS. Reasoning models push per-query cost 10x higher. But vLLM delivers 2-4x speedups, prompt caching cuts 90% of repeated costs, and EAGLE decoding adds 45% speed. At 10M daily requests, a 30% optimization saves $20M+/year. Microsoft admits even free-model inference drags margins.
- SaaS margin gap
- Reasoning cost spike
- vLLM speedup
- Prompt cache savings
- EAGLE speed gain
04 Verification > Generation: RAG Cliff + Legal Exposure
monitorRAG accuracy collapses from 90.7% at 5K docs to 50.6% at 500K — a coin flip at enterprise scale. Character.AI is being sued for fabricating a medical license number. Apple paid $250M for marketing undelivered AI features. Dario Amodei named verification as the Amdahl's Law bottleneck. Value is migrating from output generation to output trust.
- RAG at 5K docs
- RAG at 500K docs
- Apple AI settlement
- BM25 at 500K docs
- 5K documents91
- 100K documents70
- 500K documents51
05 GitHub at Zero Nines: AI Load Breaks Developer Infrastructure
backgroundGitHub dropped to 86% uptime — 2-3 hours of daily degradation. AI agents drove 3.5x load growth in 2 years; GitHub revised capacity targets from 10x to 30x in 4 months. Mitchell Hashimoto (18-year user) left publicly. Competitors handle similar load without collapse. Any product shipping through GitHub Actions has release cadence coupled to daily outage windows.
- Daily degradation
- AI load growth
- Scale target revision
- GitHub engineers
- GitHub uptime86
- Typical SaaS SLA99.9
◆ DEEP DIVES
01 Ship or Kill: Microsoft's $100B Pruning Experiment Gives You the Decision Grid
The Experiment, Concluded
A user opened Notepad this week to jot a phone number. The Copilot button was there. She did not press it. She has never pressed it. Microsoft shipped Copilot into 81 distinct product surfaces over 18 months and this week killed it in Gaming, Photos, Widgets, and Notepad, while 365 Copilot grew paying users 33% quarter-over-quarter. Pavan Davuluri's phrasing gives it away: users "want it to be better," not "want more of it." Coverage was never the product.
AI distribution is not AI value. The survivors live inside paid workflows and replace work the user was doing by hand. The dead added a chat surface to utility apps where no conversation was happening.
The Grid That Predicts Survival
Two axes fall out of the data. Axis 1: the AI replaces a task the user actively avoids (meeting transcription, email drafts), or it adds a layer to a task the user already does competently. Axis 2: the output has to be correct, or plausible is enough. The shippable cell is "replaces avoided task" where "plausible is enough." Every other cell is a demo with a roadmap ticket attached.
Why This Applies to 17 AI Features on a Roadmap
Meta's internal token-consumption leaderboard was gamed immediately. Engineers wrote scripts that burned millions of tokens doing nothing. Meta shut it down. The same pattern shows up in product dashboards. Teams count tokens consumed and sessions opened and call it adoption. The durable numbers are harder to collect: retention of AI-assisted workflows after 30 days, time-to-first-useful-output, percentage of AI output that ships to production without rewrite.
Multiple sources confirm the bottleneck has moved from engineering capacity to discovery and specification quality. When shipping takes an afternoon, shipping the wrong thing takes an afternoon too. Feature count goes up. Time-to-value does not.
The Margin Forcing Function
Microsoft gets OpenAI's technology at preferential rates and still admitted Copilot inference costs drag margins. Every AI feature carries an ongoing compute cost that scales with usage, not with value delivered. A feature with 5% engagement and 100% inference cost on every page load is burning money on 95% of impressions. Microsoft is not cutting features because users hate them. It is cutting features because each impression has a marginal cost and most impressions don't earn it back.
The Organizational Response
Nadella merged consumer and enterprise Copilot under a single EVP, Jacob Andreou. The diagnosis is governance, not product. When every product team independently bolted a chatbot onto its surface, nobody owned the holistic experience or the total inference bill. Teams with distributed AI feature ownership and no central quality gate are on the trajectory Microsoft was on six months ago.
Action items
- Map every AI feature in your product onto the 2x2 grid (replaces-avoided-task vs. adds-layer, and plausible-output vs. must-be-correct) by end of this sprint
- Pull 30-day retention and usage depth for each AI feature shipped in the last 90 days — replace token/session metrics on leadership dashboards
- Propose a centralized AI product owner or quality council to leadership using Microsoft's 81-product cautionary tale
- Kill or pause at least 3 AI features that show no usage lift after 4+ weeks in production
Sources:Aaron Holmes · Brian Ardinger, Inside Outside Innovation · Engineer's Codex · ben's bites
02 Your Next Buyer Has No Eyes: Agent Commerce Infrastructure Went Live This Week
Three Payment Rails, One Week
A product manager watching the agent-commerce space opened three tabs this week and found the same pattern shipping from different directions:
- Stripe Sessions 2026: 280+ features organized around "agentic commerce" — agent wallets, autonomous transacting across fiat and stablecoin rails, AI checkout optimization
- Google Cloud + Solana: pay.sh puts Gemini, BigQuery, and Vertex AI behind stablecoin micropayments at $0.001-$20 per call, no account creation required, MCP-server compatible with Claude
- Coinbase: agentic.market with x402 protocol; Anchorage launched KYA-compliant Agentic Banking
When Google is willing to sell its own AI services on Solana stablecoin rails at a tenth of a cent per call, the interesting thing is not the crypto. It is the pricing model.
The Demand Signal Is Real
Anthropic's Project Deal put 69 participants through actual agent-to-agent negotiation. The agents moved $4,000 in actual transactions. 46% of participants said they would pay to keep using it. That is the number that matters — not demo traffic, not pilot NPS, willingness-to-pay on a workflow that already exists. NFX published a category thesis calling agentic marketplaces "bigger than SaaS" to 160K+ founders. Expect 50+ funded startups inside 12 months.
Per-Seat Pricing Has a Timer
HubSpot is pursuing full API parity with their UI so "agents can run on HubSpot, and agents can run HubSpot." Coming from a company with 200K+ customers, that sentence is the admission that the human dashboard is no longer the primary surface. ServiceNow shipped AI Control Tower with shadow agent discovery. Microsoft's Agent 365 is GA. The governance layer for agent-as-user is shipping faster than most products are designing for it.
The thing pitched as "AI pricing strategy" is usually a discount grid. The thing actually being done is re-pricing for a customer that calls an API ten thousand times in an hour and then goes quiet for a week. That customer is a terrible subscription and a natural micropayment customer. A product with an API and no agent-native billing path has a distribution channel that is closed on Monday.
The Standards War
Standard Backer Best For Differentiator pay.sh Google + Solana AI developers 75 providers, MCP/Claude native x402 / agentic.market Coinbase Consumer apps Exchange liquidity, brand trust Agentic Banking Anchorage Enterprise/regulated KYA identity, compliance Developer adoption velocity decides this one, not spec elegance. pay.sh has early advantage with 75 launch providers and Claude compatibility.
Action items
- Map your product's transaction flow from an AI agent's perspective — identify every point requiring human presence (CAPTCHAs, email confirmations, visual browsing, seat-based auth)
- Model revenue impact if 30% of 'seats' become AI agents over 3 years — identify which pricing levers (API calls, outcomes delivered, data processed) supplement per-seat
- Evaluate pay.sh integration as a distribution channel for any API you ship — test whether per-request stablecoin payments unlock agent usage impossible under current billing
- Add 'agent-as-user' persona to your next user research cycle with specific questions about delegatable transactions
Sources:TLDR Fintech · TLDR IT · NFX · TLDR Crypto · ben's bites · TLDR Marketing
03 The 30-Point Margin Leak: An Inference Optimization Playbook You Can Execute in Weeks
The Problem Is Structural
A PM looked at her inference line last month and saw it had tripled while revenue doubled. She is not alone. AI features run at 50-60% gross margins against 80-90% for traditional SaaS, per BVP data. This is not a phase companies grow out of. COGS scales with usage in a way SaaS COGS never did. Every API call has a real marginal cost. Reasoning models make it worse. A single deep-reasoning response burns 100 turns' worth of tokens, pushing per-query cost up 10x even as per-token costs fall.
If the company with free models cannot make the economics work at scale, the smaller shops pretending they can are fooling themselves.
The Optimization Stack Is Production-Ready
A year ago these techniques were conference talks and research papers. Today they ship as infrastructure:
- Prompt caching: Anthropic's technique cuts 90% of costs on repeated system prompts. vLLM plus Mooncake hit 92.2% cache hit rates on agentic workloads, up from 1.7%, with 3.8x throughput and 46x lower time-to-first-token.
- Model routing: Route 80% of queries to fast, cheap models. Reserve expensive reasoning for the cases users will pay a premium for. Google's GKE Inference Gateway routes to warm caches automatically.
- vLLM + speculative decoding: vLLM delivers 2-4x speedups from a backend switch. EAGLE speculative decoding adds 45% on top. Both are in SGLang and TensorRT-LLM already.
- Full-stack stacking: Yandex published a blueprint combining quantization, EAGLE3, KV cache reuse, and parallelization for 5.8x total speedup, cutting token generation from 140ms to 67ms.
The Dollar Math
At $0.02 per inference and 10M daily requests, the inference line runs $70M a year. A 30% reduction, which any one of the techniques above can deliver on its own, saves $20M+ annually. That is not an engineering efficiency metric. It is a P&L line larger than most features sitting on the roadmap this quarter.
The Anthropic Advisor Pattern
Anthropic shipped an advisor strategy this week. Sonnet runs as primary inference and calls Opus on-demand when the problem warrants it. The claim is frontier-model quality at 5x lower cost. The old binary of expensive-and-good versus cheap-and-worse no longer describes the Pareto frontier. OpenAI claims 1,000x cost reduction over 14 months through stacked optimizations.
The PM Decision
Inference optimization is not a platform ticket to defer to next quarter. It is a product commitment, and the PM owns the margin number the same way they own retention and time-to-value. The forcing function is straightforward. Put the annualized dollar value of a 30% inference cost reduction on one side of the page. Put the expected revenue contribution of Feature X on the other. At scale, optimization almost always wins the sprint.
Action items
- Build a per-feature inference cost model mapping each AI capability to actual cost-per-use — present to finance as 'AI ROI dashboard' within 2 weeks
- Commission a 2-week infra spike on prompt caching and vLLM adoption for your highest-volume AI endpoints
- Add 'inference cost impact' as a required field in your PRD template for any AI-powered capability
- Implement tiered model routing: fast/cheap models for 80% of queries, expensive reasoning only where users demonstrably value it
Sources:🔳 Turing Post · AINews · The Information AM · Future Perfect · Aaron Holmes
04 Verification Is Where Value Migrates: The RAG Cliff, Credential Lawsuits, and the $250M Apple Precedent
The RAG Scaling Cliff, Quantified
A PM ships a knowledge assistant into a 5K-document pilot and it works. Retrieval feels crisp. Legal signs off. Six months later the corpus hits 500K documents and the complaints start. That curve has a number on it now. Onyx's EnterpriseRAG-Bench measures vector search accuracy falling from 90.7% at 5K documents to 50.6% at 500K. At enterprise scale, the retrieval layer is a coin flip. The mechanism is not mysterious. At 5K docs, 3-5 documents touch any given topic and top-k retrieval lands the right ones. At 500K, 40-60 documents sit in the same embedding neighborhood and the system cannot tell them apart.
BM25 keyword search degrades more gracefully (85.8% → 68.4%), which reframes hybrid retrieval P0 infrastructure, not a nice-to-have. Adding BM25 reranking is typically 1-2 sprints for roughly 18 percentage points of accuracy recovery. Knowledge-graph architectures like Rowboat (13K+ GitHub stars) hold query cost flat as the corpus grows. That is a different scaling curve, not a faster version of the same one.
Legal Exposure From Unverified Output
Three legal developments landed in the same week:
- Character.AI: Pennsylvania sued after the chatbot fabricated a specific medical license number and claimed to be a licensed psychiatrist. A novel consumer-protection theory any state AG can copy.
- Apple: Settled for $250M over AI features marketed before delivery. 37M devices, $25-$95 per claim. First enforceable precedent coupling GTM timeline to engineering delivery.
- Connecticut SB5: Automated decision-making is codified as not a defense to discrimination. Deployers own discriminatory outcomes whether or not AI made the call.
As generation gets faster, verification becomes the serial bottleneck. Value is moving from 'AI writes it' to 'can I trust what AI wrote.'
The Product Implication
Dario Amodei invoked Amdahl's Law on AI coding: generation speed is irrelevant if a human still spends 40 minutes reviewing a 400-line diff trying to decide what to trust. Cognition's Devin Review targets that gap directly. The feature the team is pitched on is generation throughput. The feature users actually need two quarters out is the one that cuts verification time: confidence scoring, automated review, audit trails, formal verification. Generation is commoditizing. Verification is not.
The Design Response
Here is the 2x2 worth drawing on a whiteboard this sprint. One axis: corpus size above or below 100K documents. Other axis: does the product act on the output, or surface it for human approval. Products in the large-corpus-plus-act-without-checking cell face a 50.6% accuracy rate that is functionally a lawsuit. The fix hierarchy is unambiguous. Ship hybrid retrieval with BM25 reranking this sprint. Put reranker investment ahead of the next embedding model upgrade. Get a knowledge-graph-based architecture on the roadmap before the corpus exceeds 100K docs. For conversational AI, the minimum viable guardrail is an output classifier that detects credential claims, professional titles, and license-number patterns before the response leaves the server. That is the Character.AI case, caught at the edge.
Action items
- Run your RAG pipeline against EnterpriseRAG-Bench at projected 12-month corpus size — not pilot size — before next enterprise deal closes
- Add hybrid retrieval (BM25 + vector search) to your retrieval architecture if not already present — 1-2 sprint investment for 18-point accuracy recovery
- Audit all customer-facing AI claims — marketing pages, app store descriptions, sales decks — for features marketed but not fully delivered, using Apple's $250M settlement as the legal bar
- Add an output classifier detecting credential claims, professional titles, and license-number patterns to any conversational AI feature before next release
Sources:Daily Dose of DS · Future Perfect · Morning Brew · The Hustle · AINews · Techpresso
◆ QUICK HITS
Anthropic's Sonnet-calling-Opus 'advisor pattern' achieves frontier quality at 5x lower cost — re-run your cost model against this new Pareto frontier before next pricing review
Future Perfect
EU AI Act high-risk deadline pushed 16 months from August 2026 to December 2027 — free capacity for feature work, but build compliance into architecture now rather than bolting on later
AI Weekly
Google's Universal Commerce Protocol enables in-SERP purchases — users buy without visiting merchant sites, severing email capture, pixel fires, and retargeting events from the conversion funnel
MarketingShot
Pinterest hit $1B quarterly revenue (+18% YoY) by matching intent signals to outcomes — 24% higher conversions and 80% A/B win rate versus engagement-optimized targeting
TLDR Design
ProgramBench: best AI model passes 95% of tests on only 3% of real software rebuild tasks — scope AI coding features to bounded single-module tasks where reliability is proven
Techpresso
Update: Anthropic doubled Claude Code rate limits and removed peak-hour throttling via SpaceX Colossus 1 (220K GPUs) — revisit features previously deferred due to rate limit constraints
Simplifying AI
AI productivity gains plateau at ~6 months without operating model changes (team composition, career ladders, decision rights) — pair every AI capability investment with a defined org change
TLDR Data
Google standalone AI agent (Project Mariner) killed after 17 months, folded into Chrome — standalone agents are being absorbed into platforms, not becoming products
Techpresso
CISA considering cutting patch deadlines from 3 weeks to 3 days while Mozilla found 271 Firefox vulnerabilities with one AI fuzzer run — patch pipeline architecture becomes a product requirement
Risky.Biz
Google Chrome silently installing ~4GB Gemini Nano on desktop devices plus testing agentic AI 'Remy' — any feature competing with 'good enough' in-browser assistance faces zero-latency default behavior
Simplifying AI
Bing formally split index into 'search' (ranks pages for humans) and 'grounding' (retrieves passages for AI answers) — content strategy needs two separate optimization targets
MarketingShot
Meta rolling out behavioral AI age detection (birthday posts, school references, bios) across US/EU/UK/Brazil by June 2026 — raises regulatory floor for every consumer app with social features
Mindstream
◆ Bottom line
The take.
Microsoft spent $100B shipping AI into 81 products and just proved that AI distribution is not AI value — the features that survived automate hated weekly tasks with good-enough output, everything else burned inference cost for zero retention. Your roadmap has the same split: run the 2x2 audit this sprint, kill the features in the wrong cell before they become your own 'functionally useless' story, and redirect that inference budget into the optimization stack (vLLM + prompt caching recovers 20-30 margin points in weeks, not quarters) and the verification layer where defensible value is migrating as generation commoditizes.
Frequently asked
- How do I decide which AI features on my roadmap to kill?
- Map each feature onto two axes: does it replace a task users actively avoid, and is plausible output good enough (no editing required)? Only the cell where both are true survives. Microsoft's pruning of 81 product surfaces — while 365 Copilot grew paid users 33% — confirms features outside that cell don't earn back their inference cost.
- What metrics should replace tokens consumed and sessions opened on AI dashboards?
- Track 30-day retention of AI-assisted workflows, time-to-first-useful-output, and the percentage of AI output shipped to production without rewrite. Meta's internal token leaderboard was gamed within days by engineers running scripts that burned millions of tokens doing nothing. Input metrics tell you nothing about whether users kept the workflow.
- How much can inference optimization actually save, and how fast?
- At $0.02 per inference and 10M daily requests, a 30% reduction saves $20M+ annually — and any one of prompt caching, vLLM adoption, or model routing can hit that alone. Prompt caching cuts 90% of repeated-prompt costs; vLLM delivers 2-4x speedups from a backend switch. Two-week infra spikes are realistic timelines.
- Why does my RAG demo work at pilot scale but break in enterprise deployments?
- Vector search accuracy drops from 90.7% at 5K documents to 50.6% at 500K, because dense embedding neighborhoods get crowded and top-k retrieval can't disambiguate. BM25 keyword search degrades more gracefully (85.8% to 68.4%), so hybrid retrieval with BM25 reranking is now P0 infrastructure — typically 1-2 sprints for ~18 points of accuracy recovery.
- What does agent-native commerce mean for per-seat SaaS pricing?
- Per-seat pricing faces structural pressure as agents become primary users — HubSpot is pursuing full API/UI parity so 'agents can run HubSpot,' and pay.sh, x402, and Anchorage's Agentic Banking all shipped agent-payment rails this week. Model what happens if 30% of seats become agents over 3 years, and identify pricing levers (API calls, outcomes delivered, data processed) that supplement seat-based revenue before renewals force the conversation.
◆ Same day, different angle
Read this day as…
◆ Recent in product
Keep reading.
- Princeton's ICML 2026 study proved that GPT 5.5, Gemini 3.1 Pro, and Claude Opus 4.7 are NOT more reliable than their predecessors on agent…
- GitHub logged 17 million agent-generated pull requests in March 2026 — 3x their projected growth — and switches to usage-based billing June…
- Anthropic eliminates the 70-90% implicit discount on third-party Claude tool usage starting June 15 — and OpenAI is offering 2 months free C…
- Anthropic's June 15 pricing change eliminates the 70-90% implicit discount on Claude usage through third-party tools (Cursor, Cline, Zed, Op…
- Anthropic's June 15 pricing restructure eliminates the 70-90% implicit discount third-party harness users (Cursor, Cline, OpenCode) have bee…