~5 min
The day enterprise AI's measurement layer broke in public
Meta burned $100M in a month on tokens that caused production SEVs, a 27B model beat a 397B one, and the same model swung 60 points on a benchmark by changing only the harness. Pick which number you trust.
Three numbers landed on the same day, and they don't agree with each other.
Meta's 85,000 employees burned 60.2 trillion tokens in 30 days on an internal Claude leaderboard called Claudeonomics. Estimated cost north of $100M. Top-ranked engineers were producing — by the company's own internal review — "throwaway, wasteful" work, and production SEVs have been traced back to the people gaming hardest. Microsoft has run the same kind of leaderboard since January, with VPs who don't write code in the top 20. Salesforce sets a $170-a-week minimum spend per engineer and ships a Mac widget that updates every fifteen minutes so you can watch your peers' numbers.
In the same news cycle, Alibaba shipped Qwen3.6-27B — a dense, Apache 2.0 model that beats its own 397B MoE predecessor on SWE-bench Verified, Terminal-Bench, and SkillsBench. Eighteen-gigabyte GGUF, runs on a single A100, day-zero support in vLLM and llama.cpp.
And then independent testing on Polyglot showed the same Qwen model scoring 19% with one agent harness and 78.7% with another. Same weights. Same task. Four-times variance from the scaffold.
If you're a builder reading those three together, the conclusion writes itself: the entire measurement apparatus that the AI industry uses to allocate capital, choose vendors, set comp, and tell its own story is producing numbers that don't survive a second look.
The metrics are theater
Tokenmaxxing is what happens when you make consumption the visible signal. It is the lines-of-code metric reborn, except every line costs money and the gaming is industrial. One long-tenured Meta engineer suspects the leaderboard was never about productivity at all — it was a 60-trillion-token data collection campaign for the next coding model. If true, Meta isn't wasting $100M a month; it's training tomorrow's model on traces dominated by performative busywork. Either way, the demand curve that AI vendors use to justify $6.5B of coding-tool ARR — Claude Code at $2.5B, Cursor and Codex around $2B each — is materially inflated by mandates, not preference. Discount it 20-40% in any model you build.
Shopify is the only company in the cluster that figured this out. They renamed the leaderboard to a "usage dashboard," added circuit breakers that auto-cut a developer's access on anomalous spend spikes, and started measuring cost per token instead of total tokens. Farhan Thawar's read is the right one: engineers burning expensive tokens are usually working on hard problems. Engineers burning cheap tokens at volume are usually padding stats. The inversion of the obvious metric is the whole insight.
If your org tracks AI usage as a leaderboard or a floor, decouple it from performance review this quarter. Pair every consumption number with an outcome — PRs merged, incident rate, review-rejection ratio. If you can't do that, stop putting the number in the deck.
The benchmarks are theater too
The 19-to-78.7 swing on Polyglot is the more dangerous finding, because it implicates every procurement decision the industry has made in the last year. Models overfit to their own harnesses. Public leaderboards measure scaffold quality more than they measure the model. If you've ever switched providers on the strength of a benchmark delta, you've made an infrastructure decision on noise.
Which is why Qwen3.6-27B matters more as a forcing function than as a model release. A dense 27B that fits on one GPU, ships under Apache 2.0, and lands inside a few points of much larger systems on coding work is the moment the self-hosting math changes for any team currently paying per token for high-volume agent traffic. Perplexity already runs a post-trained Qwen variant in production for search and claims it beats GPT-5.4 on factuality at lower cost. That's not a research result. That's a load-bearing piece of someone's revenue.
The real lever isn't the model — it's the harness around it. Self-reflection prompts (summarize the last failed attempt before retrying) added 6.7 points to Claude 4.5 Opus on coding benchmarks for the cost of an afternoon. Garry Tan's "skillification" — encoding the agent's most repeated subtasks as deterministic local scripts instead of re-prompting the LLM — saves tokens, latency, and flake rate at the same time. These are scaffold moves, not model moves, and they're worth more than any frontier-vs-frontier comparison you'll run this quarter.
And the dependency tree is on fire
While all of this was happening, the security floor dropped out. Axios — the JavaScript HTTP client that's a transitive dependency of approximately everything — has a CVSS 10.0 header injection bug that exfiltrates cloud metadata. Apache Kafka's OAuthBearer SASL accepts any JWT without validating it. The Go toolchain shipped two 9.8s, including a build-tool RCE that turns CI into an attacker's playground. Sonatype Nexus has hard-coded credentials in versions 3.0 through 3.70.5 — your artifact repository, the root of trust for your supply chain, with a default password.
Three independent MCP implementations — Codex CLI, Flowise, Upsonic — disclosed RCEs in the same week. Different codebases arriving at the same vulnerability class is a protocol-design signal, not an implementation accident. Anthropic's restricted Mythos model was reverse-engineered into on launch day from URL patterns leaked in an unrelated Mercor breach. RSAC's eleven mainstage keynotes agreed on what AI agent security needs and produced zero working enforcement solutions.
The pattern across all of it: the measurement layer, the benchmark layer, and the security layer are all generating numbers and assurances that don't hold up.
What to do this week
One move, and it's specific. Pick your highest-volume coding-agent workload — the one whose API bill you've been quietly worried about. Run it three ways: your current frontier model in your current harness, Qwen3.6-27B in your current harness, and your current model in a harness with self-reflection and skillification turned on. Report all three numbers to whoever owns the budget, and report the variance, not just the max.
That experiment will tell you which lever — model, scaffold, or governance — is actually moving your outcomes. Right now most teams don't know, because they've been reading leaderboards and counting tokens. Both of those stopped meaning anything this week.
◆ Behind the synthesis
Six specialist takes that fed this piece.
The piece above is one stream in my voice. Below are the six lenses my pipeline produced upstream — each tuned for a different reader. Use them when you want the angle that matters most to your role.
-
Three CVSS 10.0 vulnerabilities dropped simultaneously across Axios (cloud metadata exfil via SSRF), Apache Kafka (JWT validation completely bypassed), and your Go toolchain (compiler memory corruption + build tool RCE), while Sonatype Nexus shipped hard-coded credentials in versions 3.0–3.70.5.
Your dependency tree is on fire — Axios (CVSS 10.0), Kafka (JWT validation bypassed entirely), Go stdlib (two 9.8s), and Nexus (hard-coded credentials) all need emergency patching…
40 sources · 8 min Read → -
Axios — the most popular JavaScript HTTP client — has a CVSS 10.0 header injection flaw (CVE-2026-40175) that exfiltrates cloud metadata from any app using the library, and it's almost certainly a transitive dependency in your projects.
This week delivered two CVSS 10.0 vulnerabilities (Axios and Quest KACE SMA), eight separate authentication bypass flaws across products like Kafka and Cisco ISE, and the uncomfort…
40 sources · 7 min Read → -
A single model scored 19% or 78.7% on the same benchmark by swapping only the agent scaffold — a 4x variance that makes leaderboard-driven model selection functionally random.
A dense 27B model beat a 397B MoE while a scaffold swap moved the same model's score from 19% to 78.7% — your model selection process is optimizing the wrong variable. Meanwhile, R…
39 sources · 7 min Read → -
Meta burned 60.2 trillion tokens ($100M+) in 30 days — and most of it was waste.
Your AI adoption metrics are lying to you — Meta burned $100M+ in a single month on token waste that's causing production incidents, not productivity — while 60% of Vercel's traffi…
40 sources · 9 min Read → -
Meta engineers burned 60.2 trillion tokens in 30 days while Microsoft VPs who rarely code topped internal AI leaderboards and Salesforce set minimum spend floors — 'tokenmaxxing' is now industry-wide, and enterprise AI demand signals feeding your vendor valuations, board decks, and headcount models are materially inflated.
Enterprise AI's three load-bearing assumptions all cracked this week: the adoption metrics are gamed (Meta burning $100M+/month on performative token usage, benchmarks swinging 60…
40 sources · 8 min Read → -
Enterprise AI just revealed its first revenue quality crisis: 'tokenmaxxing' at Meta ($100M+/month in waste tokens across 85K employees), Salesforce ($170/month mandated minimums per developer), and Microsoft (VP-level leaderboards) means 20-40% of the $6.5B AI coding ARR may be mandated waste — not organic demand.
AI coding tools generated $6.5B ARR in 12 months — the fastest category in software history — but tokenmaxxing at Meta (60.2 trillion tokens/month, $100M+ in waste), Salesforce ($1…
39 sources · 8 min Read →