~4 min
The over-engineering tax is bigger than your model bill
Three independent benchmarks landed the same verdict this week: simpler beats complex across RAG, agents, and inference economics. Meanwhile your security stack just failed three live stress tests.
Three numbers from this week, in the order I'd put them on a whiteboard.
On an H100 generating tokens, you're using roughly 1% of the compute you're paying for. Not because your code is bad — because decode is memory-bound at ~1 FLOP/byte and the H100 wants 295. The gap gets worse every chip generation, since compute grows 3x every two years and bandwidth grows at half that rate. FlashAttention doesn't fix it. A bigger GPU doesn't fix it. The physics doesn't care.
FloTorch's 2026 RAG benchmark: simple 512-token recursive character splitting beats semantic and proposition-based chunking on accuracy and produces 3–5x fewer vectors. If your team has been investing sprints in clever chunking, you're paying 3–5x more for worse retrieval.
LangChain's coding agent jumped from Top 30 to Top 5 on Terminal Bench 2.0. Same model. They added self-verification and structured tracing to the harness. That's it.
Three independent domains — inference, retrieval, agents — converging on the same lesson: the dominant failure mode in production AI right now is over-engineering. The highest-leverage work this sprint isn't adding a layer. It's deleting one.
Context length is the bill nobody priced
The single most expensive product decision you'll make this year isn't which model you pick. It's how much context you give it.
A 7B model on an H100 serves 278 concurrent users at 4K context. At 128K, it serves 8. That's a 35x per-user cost increase from the same hardware, and it's because the KV cache eats GPU memory linearly while the quadratic attention term goes from 8% of total compute at 1K to 92% at 128K. Every "agent memory," "document ingestion," or "long conversation history" feature on your roadmap pushes you up that curve. Multi-agent setups where agents pass full traces between each other compound it multiplicatively.
Claude Sonnet 4.6 with a 1M token window is a great API product and an economically devastating self-host. Use the long context through providers who absorb the utilization problem. For everything you run yourself, cap defaults at 4K–8K, price longer context as a tier, and design inter-agent protocols that summarize rather than pass raw context. Treat KV cache as a first-class budgeted resource, not a side effect.
The gap between the raw compute floor (~$0.004 per million tokens at full utilization) and what APIs charge ($0.30–$1.25) is 75–330x. That's not margin. That's the operational overhead of running this stuff well — and it's the addressable market for anyone who closes it.
Patch the three things that broke this week
While we were optimizing prompts, three trust assumptions in the security stack failed simultaneously, and at least one is being actively exploited.
Dell RecoverPoint, CVE-2026-22769, CVSS 10.0. Hardcoded admin credential in tomcat-users.xml. Unauthenticated WAR deploy via /manager/text/deploy gets you root. UNC6201 is in the wild with GRIMBOLT — a native-AOT C# backdoor that strips CIL metadata, persists via convert_hosts.sh from rc.local, and pivots through VMware via Ghost NICs and iptables Single Packet Authorization on vCenter. They are deliberately targeting backup infrastructure to deny recovery. Patch today. Audit /home/kos/auditlog/fapi_cl_audit_log.log for /manager requests. Hash-check convert_hosts.sh. Deploy Mandiant's YARA rules.
ADWSDomainDump bypasses CrowdStrike Falcon and Microsoft Defender for Endpoint. It enumerates Active Directory over ADWS on port 9389 instead of LDAP. Both EDR platforms simply don't watch that protocol. This isn't a signature gap — it's an architectural one, and the tool is public. Add network-level detection on port 9389 and segment access to it before next Friday.
ETH Zurich broke zero-knowledge across Bitwarden, LastPass, and Dashlane. 25 demonstrated attacks. Roughly 60M users affected. Lightweight server impersonation during sync, exploiting feature bloat over 1990s primitives. Full paper drops at USENIX Security 2026, which means weaponized tooling follows shortly after. Start migration planning now, not when the headlines hit.
If your password manager, your backup infrastructure, and your EDR all have confirmed trust failures in the same week, the diagnosis isn't three bugs. It's a security architecture that took vendor claims at face value.
What to do this week
Profile your context-length distribution in production today. Not your max; your actual median and p95 by feature. If your median is over 8K and you don't have a unit-economics number for those sessions, you have a hidden cost bomb and you don't know how big it is.
Then pick the simplest thing on your AI roadmap and benchmark it against an even simpler version. Your semantic chunker against 512-token splits. Your model upgrade against a self-verification loop on the model you already have. Your 128K context window against retrieval over 4K. The 2026 evidence is that the simpler version wins more often than the complexity-bias of the field will let you believe.
And before any of that — patch RecoverPoint. The actively-exploited CVSS 10.0 in your DR layer outranks every architecture conversation on your calendar.
◆ Behind the synthesis
Six specialist takes that fed this piece.
The piece above is one stream in my voice. Below are the six lenses my pipeline produced upstream — each tuned for a different reader. Use them when you want the angle that matters most to your role.
-
Dell RecoverPoint CVE-2026-22769 (CVSS 10.0) is being actively exploited by UNC6201 via a hardcoded Tomcat credential — if you run RecoverPoint for Virtual Machines, stop reading and patch now.
Dell RecoverPoint has a CVSS 10.0 actively exploited hardcoded credential (CVE-2026-22769), your EDR is blind to AD enumeration over ADWS port 9389, and ETH Zurich broke zero-knowl…
8 sources · 6 min Read → -
CVE-2026-22769 is a CVSS 10.0 hardcoded credential in Dell RecoverPoint actively exploited by UNC6201 with a new GRIMBOLT backdoor that pivots through VMware via Ghost NICs — patch immediately and hunt for compromise indicators in your DR infrastructure.
Your disaster recovery infrastructure has a CVSS 10.0 actively exploited hardcoded credential (CVE-2026-22769), your EDR is blind to AD enumeration on port 9389, your enterprise pa…
24 sources · 7 min Read → -
Your GPU is running at 1% utilization during token generation, your RAG chunking is probably over-engineered, and your A/B tests are likely reporting inflated lifts — three independent sources converge on the same meta-insight today: the biggest cost and accuracy gains come from simplifying, not adding complexity.
Today's strongest signal across 16 sources is that simplicity systematically beats complexity in production ML: 512-token chunking outperforms semantic methods at 3-5x lower cost,…
16 sources · 10 min Read → -
Your AI features are hiding a 35x cost multiplier in context length, not model size — and the fix is simpler than you think.
The three biggest AI product levers right now aren't model selection — they're context window sizing (35x cost swing), RAG chunking simplicity (3-5x savings), and harness engineeri…
20 sources · 9 min Read → -
Your enterprise security assumptions just failed three simultaneous stress tests: ETH Zurich broke zero-knowledge encryption across all major password managers (60M users exposed), a CVSS 10.0 Dell zero-day is being actively exploited by nation-state actors targeting backup infrastructure, and both CrowdStrike and Microsoft Defender have a confirmed protocol-level blind spot.
Three enterprise security pillars — password managers, backup infrastructure, and EDR detection — all failed empirically this week while AI is simultaneously repricing headcount (K…
23 sources · 9 min Read → -
AI capital is repricing at every layer simultaneously: $5B+ in mega-seed rounds dropped this week (Ineffable Intelligence at $4B, World Labs at $1B, Entire at $300M), while inference economics reveal a structural memory-bandwidth wall that makes current GPU infrastructure 99% wasteful for the workloads that matter most.
AI capital is simultaneously repricing at the top (coconut rounds hitting $2B for pre-product companies) and being constrained at the bottom (inference runs at 1% GPU utilization d…
23 sources · 10 min Read →