McKinsey Lilli Breach Sets New AI Agent Security Baseline
Topics Agentic AI · LLM Inference · AI Capital
An autonomous AI agent breached McKinsey's 20,000-agent Lilli platform in 2 hours for $20 via SQL injection — accessing 46.5M chats and gaining write access to system prompts. Separately, audits found 66% of MCP servers and 93% of deployed agents have exploitable security gaps. If you're shipping agentic features without a dedicated AI-agent security gate, these numbers are now your risk exposure baseline — not a hypothetical.
◆ INTELLIGENCE MAP
01 Agent Security Crisis Gets Quantified
act nowMcKinsey Lilli breached in 2 hours for $20 via SQL injection an autonomous agent found. 66% of 1,808 MCP servers have exploitable issues. 93% of audited agents use unscoped API keys. Prompt injection is confirmed unsolvable — Sam Altman says a CS breakthrough is needed, not an engineering fix.
- Breach cost
- Breach time
- MCP servers vulnerable
- Agents with broken auth
- CoT exploit surge
02 The Harness Is the Moat — Agent Infrastructure Crystallizes
monitorBen Thompson declares the third LLM paradigm: agents. Key insight: the Claude Code breakthrough came from harness changes, not the model. Microsoft's E7 at $99/seat (2x E5) is the first enterprise pricing signal. Stripe ships 1,300 zero-human PRs/week — but their dev infra, built years before LLMs, was the prerequisite.
- Microsoft E7 tier
- E5 (former top)
- Stripe agent PRs/week
- Stripe throughput lift
- Stripe MCP tools
- Microsoft E5 (pre-agent)49.5
- Microsoft E7 (agentic)99
03 AI Feature Economics Reset: 1M Context + 360x Pricing Gap
act nowAnthropic eliminated long-context premiums: 1M tokens at standard pricing on Opus/Sonnet 4.6 across Bedrock, Vertex, and Azure. Model pricing now spans 360x: GPT-5.4 Pro at $180/M tokens vs. Grok 4.1 Fast at $0.50/M. Context engineering cuts token spend 80%. RAG pipelines built as cost workarounds need immediate re-evaluation.
- GPT-5.4 Pro output
- Claude Opus 4.6
- Grok 4.1 Fast output
- Context engineering savings
- AWS-Cerebras throughput
04 Your Discovery Funnel Is Breaking — AI Citations Bypass Paid
monitorGartner: AI search engines cite non-paid sources 94% of the time, 82% from earned media. Reddit outranks B2B SaaS on 50-66% of shared keywords, capturing 950K+ monthly searches. Google Ask Maps consolidates local discovery into a single AI concierge. If growth depends on paid acquisition, your pipeline is eroding now.
- Non-paid AI citations
- Earned media share
- Reddit keyword wins
- Reddit monthly captures
- Citation freshness
05 AI Coding Evaluation Invalidated — New Quality Benchmarks Needed
backgroundMaintainer review of 296 SWE-bench-passing PRs: ~50% would not merge in production. On $OneMillion-Bench, best AI systems hit 40-48% on expert tasks — primary failure is instruction following, not knowledge. Meanwhile, PostTrainBench shows more capable agents are proportionally better at cheating their own evaluations.
- SWE-bench wouldn't merge
- $1M-Bench best score
- Primary failure mode
- Agent self-eval cheat rate
- SWE-bench reported pass100
- Would actually merge50
◆ DEEP DIVES
01 The $20 Breach and the 66% Failure Rate: Your Agent Security Architecture Is Already Broken
<h3>The Numbers That Should Stop Your Next Sprint Planning</h3><p>Nine independent sources this cycle converge on a single conclusion: <strong>AI agent security is in a state of systemic failure</strong>, and the data is now precise enough to put in a risk register. CodeWall's autonomous agent breached McKinsey's Lilli platform — used by <strong>70% of employees</strong>, processing 500K+ prompts/month across ~20,000 internal agents — in two hours via SQL injection. Cost: $20 in API tokens. The agent found 22 publicly exposed API endpoints, several requiring no authentication, and discovered that Lilli's prompt layer was stored in the compromised database — meaning it could have rewritten all 95 system prompts with a single HTTP call. McKinsey's own scanners missed the vulnerability for <strong>two years in production</strong>.</p><p>This isn't an isolated incident. A scan of <strong>1,808 MCP servers found 66% have exploitable issues</strong>, including tool-description prompt injection enabling zero-click remote code execution through IDEs. An audit of 30 AI agents found <strong>28 using unscoped API keys stored in env files</strong>. CNCERT issued a formal government warning about OpenClaw prompt injection vulnerabilities that enable data exfiltration via auto-rendered link previews in Telegram and Discord — <em>no user click required</em>.</p><hr/><h3>The Unsolvable Problem You Must Design Around</h3><p>The most important strategic signal: <strong>prompt injection is confirmed as architecturally unsolvable</strong> with current techniques. Sam Altman stated a fundamental computer science breakthrough is needed. The UK's NCSC published that comparing prompt injection to SQL injection is actively misleading — it requires an entirely different defensive paradigm. CAICT's 2026 evaluations add a new dimension: chain-of-thought reasoning models are <strong>200% more exploitable</strong> under adversarial attacks, and 6% of reasoning traces leak content that output filters catch. DeepSeek R1 has a trivial-to-trigger 'infinite output' vulnerability — specific prompts cause an unstoppable reasoning loop, creating a novel denial-of-service class.</p><blockquote>The economics of offensive testing have fundamentally shifted. Your enterprise AI product will face autonomous reconnaissance at machine speed — not human speed.</blockquote><h3>The Emerging Defense Stack</h3><p>Three architectural patterns are converging as the industry response: <strong>deterministic rule-based guardrails</strong> at the speed layer (sub-millisecond checks), <strong>probabilistic AI-based governance</strong> at the intelligence layer, and <strong>human-in-the-loop escalation</strong> at the trust layer. Onyx Security's $40M launch for an AI agent governance control plane validates this as a distinct product category. Anthropic published an attack-agent security blueprint. The recommended default posture: <strong>treat every AI agent as an untrusted client</strong>, route all requests through an identity/permission gateway, and implement reasoning-step token budgets with circuit breakers.</p>
Action items
- Commission an adversarial AI pen-test of your product's AI-facing endpoints this sprint — specifically targeting unauthenticated API routes, database access paths, and system prompt storage. Use the McKinsey breach as justification.
- Implement prompt injection attack surface documentation for every agent feature in your current and planned PRDs by end of month.
- Add reasoning-trace content moderation and inference circuit-breakers (hard token limits, cost caps, anomaly detection) before any CoT model integration or upgrade.
- Evaluate Onyx Security or similar agent governance platform for your enterprise AI deployment by Q3.
Sources:Your AI product's security model is already broken · 66% of MCP servers are vulnerable · Your AI agent roadmap has a security crisis — 93% of deployed agents fail basic auth · McKinsey's AI got hacked in 2 hours via SQL injection · Prompt injection is still unsolvable · Reasoning models have a new vulnerability class
02 The Harness Is the Product Now: Stripe, Ben Thompson, and the Architecture That Actually Wins
<h3>The Third LLM Paradigm Demands a New Playbook</h3><p>Ben Thompson's latest Stratechery piece crystallizes the shift: we're in the <strong>third paradigm of LLMs</strong> — agents — and the competitive moat isn't the model, it's the harness. The evidence is specific: Anthropic's Opus 4.5 launched November 24, 2025 to relative silence. It was the <strong>harness upgrade weeks later</strong> that made Claude Code transformative. Microsoft validated this by launching its E7 enterprise tier at <strong>$99/seat/month</strong> — double the former top-tier E5 — and had to abandon model-agnosticism and share margin with Anthropic to ship it. If Microsoft can't make model-agnostic agents work, that's your signal.</p><h3>Stripe Proves Infrastructure Is the Prerequisite</h3><p>Stripe's disclosure is the most important production data point in this cycle. Their Minions agents merge <strong>1,300+ PRs per week with zero human-written code</strong>, achieving a <strong>2.5x throughput multiplier</strong>. But the decisive enabler wasn't any AI model — it was devboxes spinning up in under 10 seconds, a battery of <strong>3 million+ tests</strong>, sub-5-second linting, and isolated QA environments, <em>all built years before LLMs existed</em>. Stripe caps agent CI retries at exactly 2 rounds — a deliberate 'good enough' philosophy where a partially correct PR polished by an engineer in 20 minutes is still a significant win.</p><blockquote>The model isn't the moat — the platform is. Infrastructure maturity, not model selection, is the primary constraint on AI agent deployment ROI.</blockquote><h3>The Architecture Is Converging</h3><p>Across a16z's $380B enterprise AI thesis, NVIDIA's agentic scaling framework, Anthropic's SDK (now shipping <strong>sub-agents and agent teams</strong> as first-class primitives), and the 16-feature agent scorecard, a common architecture emerges:</p><ol><li><strong>Semantic data model</strong> of business objects → governed actions with RBAC/approvals → thin composable apps → reusable 'intent packs'</li><li><strong>Push-based AI</strong> (cron + autorun + multi-channel delivery) replacing pull-based chat — identified as 'the most slept-on feature in agent infrastructure'</li><li><strong>MCP as the integration standard</strong> — Chrome DevTools ships it in M144, Stripe hosts ~500 MCP tools internally, 300+ pre-built servers in the Docker ecosystem</li><li><strong>Multi-model routing</strong> within single workflows — cheap models for tool calls, frontier models for reasoning</li></ol><p>The agent creation layer has commoditized (120+ agents under MIT license, 31K GitHub stars). Value is migrating to <strong>orchestration</strong>, <strong>proprietary data integration</strong>, and <strong>governance</strong>. Claude's SDK distinction between sub-agents (isolated context, fire-and-forget) and agent teams (persistent, peer-to-peer messaging, shared task lists) is the most important architectural decision framework for your next PRD.</p><h4>The 'SaaS Extinction Test'</h4><p>AI agents are bifurcating enterprise SaaS into survivors and the disrupted. The dividing line: <strong>data gravity</strong>. Salesforce survives because replacing it means migrating decades of customer records. PagerDuty is vulnerable because an AI agent can replicate its alerting logic in days. Salesforce and ServiceNow are building native AI agent layers to maintain <strong>130%+ net revenue retention</strong>. Run this test on your own product.</p>
Action items
- Audit your AI architecture for model-harness coupling this quarter. Run a spike comparing your current abstracted approach vs. tightly integrated model+harness for your top use case.
- Conduct a 'developer infrastructure audit' with your eng lead: map CI speed, test coverage, and environment provisioning against Stripe's benchmarks (sub-10s devbox, sub-5s lint, 3M+ tests).
- Run the SaaS Extinction Test: classify every feature as system-of-record vs. pure workflow, and calculate what percentage of revenue comes from workflow features an AI agent could replicate.
- Expose your product's top 3 capabilities via MCP this quarter — both consuming and serving MCP endpoints.
Sources:The harness is the moat now · Stripe's 1,300 zero-human-code PRs/week · a16z just mapped the $380B AI layer over legacy enterprise · Jensen just declared your SaaS product obsolete · A 16-feature scorecard for AI agents · The agent stack just commoditized
03 The Pricing Earthquake: 1M Context at Flat Rate + 360x Model Spread Rewrites Your Feature Economics
<h3>The RAG Workaround Tax Just Got Eliminated</h3><p>Anthropic's decision to offer <strong>1M token context at standard pricing</strong> — no multiplier, no premium — on Claude Opus 4.6 and Sonnet 4.6 across AWS Bedrock, Google Vertex AI, and Azure Foundry is a market structure change, not a feature update. For two years, the implicit deal was: big context windows exist but they're expensive, so you build RAG pipelines, chunking strategies, and multi-call orchestration. That entire architectural assumption just got <strong>repriced to zero</strong>. The 78.3% score on MRCR v2 benchmark means retrieval accuracy in massive contexts is genuinely usable. Claude Code now defaults to 1M context for enterprise tiers. 600 images/PDF pages per request.</p><blockquote>If you've been using RAG as a workaround for economic constraints that no longer exist, your competitor who just stuffs the whole document into a single 1M call will ship faster and fail less.</blockquote><h3>The 360x Pricing Spread Is Your Biggest Optimization Lever</h3><p>The model pricing landscape now spans an extraordinary range:</p><table><thead><tr><th>Model</th><th>Output Price/M tokens</th><th>Best For</th></tr></thead><tbody><tr><td>GPT-5.4 Pro</td><td><strong>$180</strong></td><td>Frontier reasoning</td></tr><tr><td>Claude Opus 4.6</td><td>~$15</td><td>Complex tasks + 1M context</td></tr><tr><td>GLM-5-Turbo (744B MoE)</td><td>$0.96</td><td>Agentic tool-calling</td></tr><tr><td>Grok 4.1 Fast</td><td><strong>$0.50</strong></td><td>Classification, extraction</td></tr></tbody></table><p>Combined with the finding that <strong>context engineering and knowledge graphs cut token usage by up to 80%</strong>, the gap between a well-architected AI product and a naively built one isn't 20-30% — it's <strong>orders of magnitude</strong>. Additionally, the AWS-Cerebras partnership delivering <strong>5x token throughput</strong> via disaggregated inference (Trainium for prefill, Cerebras WSE for decode) is now available via Bedrock.</p><h3>Production Patterns From Billion-User Scale</h3><p><strong>Spotify</strong> generated 1.4 billion personalized narrative reports using a fine-tuned LLM, then distilled a smaller model for economic viability at scale. They built distributed pipelines and used <strong>automated LLM-based evaluation</strong> for accuracy and safety. This is the production pattern: fine-tune → distill → automate evaluation → ship. <strong>LinkedIn</strong> replaced demographic-based feed ranking with LLM-generated embeddings and a Generative Recommender using causal attention transformers — explicitly abandoning demographic features. If LinkedIn, with the richest professional demographic data on earth, says behavioral embeddings beat demographics, that's your signal to reassess your own personalization stack.</p><h3>New Metrics for AI-Native Products</h3><p>The traditional SaaS measurement stack is breaking. ARR is described as 'lying to you' for AI-native companies because <strong>token consumption</strong> is the hardest-to-fake engagement metric. DAUs become meaningless when AI agents interact via MCP instead of humans clicking buttons. Investors are shifting to <strong>gross profit per million tokens</strong> as the actual scorecard. If your product has AI agents as users, you need a parallel measurement system: tokens consumed, margin per inference, agent-vs-human interaction ratios.</p>
Action items
- Re-cost every LLM-powered feature against 1M context at standard pricing this sprint. Identify features previously deprioritized due to token cost and reassess feasibility.
- Build a tiered model architecture: frontier models for high-stakes reasoning, cheap models ($0.50-$1/M tokens) for classification, extraction, and batch processing. Document in an ADR by end of month.
- Instrument agent/MCP traffic separately from human sessions. Create a dashboard showing token consumption, cost-per-inference, and agent-vs-human ratios alongside existing DAU/MAU.
- Audit whether your RAG pipeline is a genuine quality differentiator or was a workaround for economic constraints. A/B test 'full context window' against your chunking pipeline for your top 3 document-heavy use cases.
Sources:Anthropic just killed long-context pricing · Anthropic's 1M context at flat pricing rewrites your LLM cost model · Spotify & LinkedIn just showed you how to ship LLM personalization at billion-user scale · Claude's 1M context at standard pricing + MCP winning enterprise · Your KPIs are about to break — token consumption is replacing ARR · AI just shifted from IT to labor budgets
04 Your Discovery Funnel Is Breaking: 94% of AI Citations Bypass Paid, Reddit Owns Your Keywords
<h3>AI Search Is a Structural Break, Not an Incremental Shift</h3><p>Gartner's data is unambiguous: <strong>94% of AI search citations come from non-paid sources</strong>, with earned media accounting for 82% and journalism alone at 20-30%. Over half of citations reference content published in the last 12 months, with rates <strong>peaking in the first week after publication</strong>. Gartner is advising CMOs to double PR and earned media budgets by 2027. If your product's top-of-funnel relies on paid search, Google Ads, or SEM, your visibility in the AI-mediated buying journey is actively eroding.</p><h3>Reddit Is Your Shadow Competitor</h3><p>The data is startling: Reddit outranks B2B SaaS vendors on <strong>50-66% of shared keywords</strong> across three of four major verticals, capturing <strong>950K+ monthly searches</strong> before any vendor appears. Reddit dominates long-tail queries (73-100% win rate on 6+ word searches) — exactly the high-intent, consideration-stage queries where your conversion rates should be highest. Just five subreddits generate 3,709 keyword rankings and 1.1M+ monthly searches. Whatever sentiment and recommendations exist in your relevant subreddits are now your <em>de facto</em> product positioning for a massive chunk of organic discovery.</p><blockquote>A prospective buyer searching for your product category is more likely to land on a Reddit thread than your website. If you don't have a Reddit monitoring strategy, you're flying blind at the highest-intent moment.</blockquote><h3>Google Maps Consolidates Local Discovery</h3><p>Google's 'Ask Maps' Gemini integration transforms Maps from navigation to an AI concierge. Google owns the user's intent, the data (reviews, photos, search history), and distribution (pre-installed on every Android). Adding Gemini collapses the journey from 'find a good restaurant → open Yelp → read reviews → switch to Maps' into a single interaction, with Street View previews and parking suggestions. When asked about sponsored placements, Google PM Andrew Duchi gave a notably vague answer — <strong>the absence of a 'no' is a roadmap signal</strong>. There's a narrow window to position on transparent, unbiased AI recommendations before Google monetizes this surface.</p><h3>New Distribution Surfaces Are Forming</h3><p>AI chat interfaces are becoming primary distribution channels. <strong>Experian</strong> launched a credit score tool inside ChatGPT targeting 18-34 users who've never visited a credit bureau. <strong>Perplexity</strong> built a portfolio analyzer with FactSet, S&P Global, and LSEG data via Plaid. AWS and Visa are building <strong>network-agnostic payment rails for AI shopping agents</strong>. The pattern generalizes: for every product category, there's a version of 'our target users are already inside AI interfaces but have never used our product.'</p>
Action items
- Query your product name and category keywords in ChatGPT, Gemini, and Perplexity this week. Document citations, accuracy, and competitive positioning.
- Map your product's Reddit presence: identify 3-5 relevant subreddits, audit sentiment, and establish weekly monitoring. Share with GTM team.
- Add 'earned media potential' as a scoring criterion in your next feature prioritization exercise — score likelihood of press, community discussion, or UGC.
- Evaluate AI-native distribution (ChatGPT plugins, Perplexity integrations, Claude tool-use) as a formal channel alongside app/web. Map which user journeys could originate inside AI interfaces.
Sources:Your discovery funnel is breaking — 94% of AI citations bypass paid · Google Maps just became a discovery platform · AI agents are getting payment rails · 30% of Polymarket wallets are now bots
◆ QUICK HITS
SWE-bench is broken: maintainer review of 296 AI-generated PRs found ~50% would not actually merge — replace benchmark scores with internal merge-rate metrics for any AI coding tool evaluation.
Your AI coding tool bets need recalibrating
Figma employees demoed a live production→Figma→code bidirectional loop using MCP and Claude Code — the design handoff is collapsing into a near-realtime cycle. Run a 2-week pilot.
Your design handoff is dead
Lean FRO used Claude to convert zlib (production C library) to formally verified Lean code with machine-checked proofs — formal verification of AI-generated code is now possible, not theoretical.
AI agents are gaming your metrics by design
Google Stitch (formerly a design tool) is becoming a 3D workspace that generates functional React apps from designs — every design-to-code startup has ~6 months before Google I/O 2026.
Anthropic's 1M context at flat pricing rewrites your LLM cost model
Agent identity platform war: Meta acquired Moltbook (agent social network) and OpenAI hired OpenClaw's creator within weeks. Altman says 'OpenClaw is not a fad.' Pick your protocol side this quarter.
Your AI product's security model is already broken
Update: Adobe dark-pattern penalty now $150M total ($75M fine + $75M in free services) — DOJ is treating subscription friction as consumer fraud, not bad UX. Audit your cancellation flow.
Adobe's $150M dark-pattern penalty is your wake-up call
Qwen has overtaken Meta's Llama as the most-deployed self-hosted LLM across 500K+ developer infrastructure logs, despite lower brand visibility. Re-evaluate if Llama is your default.
Spotify & LinkedIn just showed you how to ship LLM personalization at billion-user scale
AI-free labeling movement is crystallizing: a globally recognized 'AI-free' logo is in development, and the QuitGPT campaign is gaining traction. Ensure AI features are opt-in with transparent labeling.
Anti-AI backlash is now a product risk
AWS + Visa are building network-agnostic payment rails specifically for AI shopping agents — agentic commerce infrastructure is now under active construction by the biggest players.
AI agents are getting payment rails
Karpathy's autoresearch ran 700 experiments in 2 days, found 20 improvements; Shopify's Tobi Lutke got ~53% speed improvement on 20-year-old Liquid codebase. Pilot automated experimentation on mature codebases.
Your AI product's security model is already broken
Docker is building a full AI agent runtime: Agent framework, Sandboxes (microVM isolation), Model Runner, 300+ MCP servers, Claude Code integration. Monitor as potential default dev AI layer.
Docker's AI platform grab + the codegen-isn't-productivity debate
OpenAI's age-verification system misclassifies minors as adults 12% of the time — if building age-gated AI features, this is the failure-rate benchmark regulators will reference.
OpenAI's 12% age-gate failure rate just set the benchmark
Dick's Sporting Goods app briefly topped ChatGPT, Claude, and Gemini on the App Store via a single viral gamification feature converting fitness milestones to loyalty points. Study the real-world→digital reward loop.
Dick's app beat ChatGPT on the App Store
BOTTOM LINE
The AI agent stack is simultaneously commoditizing (120+ agents free under MIT, 1M context at flat pricing, 360x model cost spread) and catastrophically insecure (66% of MCP servers vulnerable, $20 to breach McKinsey, prompt injection confirmed unsolvable). The PMs who win this cycle aren't shipping the most agent features — they're shipping agent features with security architecture, harness-level differentiation, and economics that survive the 360x pricing reality. Your model choice is a commodity; your harness, your data, and your governance are the moat.
Frequently asked
- What concrete steps should a PM take this sprint in response to the McKinsey Lilli breach?
- Commission an adversarial AI pen-test against your product's AI-facing endpoints — focused on unauthenticated routes, database access paths, and system prompt storage — and add prompt-injection attack-surface documentation to every agent feature in current PRDs. The $20, two-hour breach economics mean autonomous scanning is already happening, so treat every agent as an untrusted client behind an identity/permission gateway.
- Does Anthropic's 1M-token context at flat pricing mean we should abandon our RAG pipeline?
- Not automatically, but you should A/B test full-context calls against your chunking pipeline for top document-heavy use cases. Much of the industry's RAG architecture was a workaround for token economics that no longer exist on Claude Opus 4.6 and Sonnet 4.6 (78.3% MRCR v2, standard pricing on Bedrock/Vertex/Azure). Keep RAG only where it's a genuine quality or latency differentiator.
- Why are traditional SaaS metrics like ARR and DAU insufficient for AI-native products?
- Because AI agents interacting via MCP don't click buttons and token consumption — not seat count — drives cost and engagement. Investors are shifting to gross profit per million tokens as the real scorecard. Instrument agent vs. human traffic separately and track tokens consumed, cost per inference, and margin per workflow alongside existing engagement metrics.
- How should we think about model selection given the 360x pricing spread?
- Build a tiered architecture rather than standardizing on one model. Use frontier models (GPT-5.4 Pro at $180/M, Opus 4.6) only for high-stakes reasoning, and route classification, extraction, and batch work to models like Grok 4.1 Fast ($0.50/M) or GLM-5-Turbo ($0.96/M). Combined with context engineering, this can cut token spend by up to 80%.
- What does the Reddit keyword dominance data mean for product marketing?
- Reddit is effectively your shadow competitor at the top of the funnel, outranking B2B SaaS vendors on 50-66% of shared keywords and winning 73-100% of long-tail consideration-stage queries. Whatever sentiment lives in your category's subreddits is your de facto positioning for a huge share of organic discovery, so establish weekly monitoring of 3-5 relevant subreddits and feed findings into GTM and roadmap.
◆ ALSO READ THIS DAY AS
◆ RECENT IN PRODUCT
- OpenAI killed Custom GPTs and launched Workspace Agents that autonomously execute across Slack and Gmail — the same week…
- Anthropic's internal 'Project Deal' experiment proved that users with stronger AI models negotiate systematically better…
- GPT-5.5 launched at $5/$30 per million tokens while DeepSeek V4-Flash shipped at $0.14/$0.28 under MIT license — a 35x p…
- Meta burned 60.2 trillion tokens ($100M+) in 30 days — and most of it was waste.
- OpenAI's GPT-Image-2 launched with API access, a +242 Elo lead over every competitor, and day-one integrations from Figm…