How does GRPO differ from SFT for training multi-step agents?

GRPO trains on outcomes rather than imitation: it generates N completions per prompt, scores them, normalizes within the group, and reinforces above-average behaviors while suppressing below-average ones. Only the relative ranking drives learning, making it robust to noisy or poorly calibrated reward signals. SFT only teaches what to say, not whether the agent succeeded — which is the dimension that matters for tool use and multi-turn reasoning.

Why use comparative ranking (RULER) instead of absolute LLM-as-judge scoring?

Comparative judgment ("which of these 4 is best?") is empirically far more reliable than absolute scoring ("rate 0-10"), a well-established psychometric finding. RULER passes N trajectories to a judge LLM that produces relative rankings, which feed directly into GRPO as rewards. The upshot: no reward function engineering, no labeled data, and no human annotation — and the eval signal improves even if you don't adopt GRPO.

What silent failure mode should I monitor on ECS-based Ray clusters?

Zombie memory cgroups. When the ECS agent fails to clean up cgroups after container termination, they accumulate and starve CPU scheduling, which triggers AWS ENA network driver resets that disconnect Ray workers. The failure presents as network timeouts, so GPU/memory/loss dashboards show nothing until the crash. Add an alert on `find /sys/fs/cgroup -type d | wc -l` and track CPU scheduling latency.

What's the concrete risk from AI-hallucinated package names?

Attackers monitor package names that coding assistants hallucinate and register them on PyPI/npm, so the next `pip install` from an AI suggestion can execute arbitrary code inside your ML pipeline — with access to training data, model weights, and credentials. ML codebases are especially exposed because they pull from niche packages where hallucination rates are higher. Manually verify every AI-suggested dependency against the official registry before installing.

How should I adjust GPU cost models given the ~50% price surge?

Re-baseline Q2 compute budgets against current spot and on-demand prices rather than late-2025 numbers, since cost-per-experiment has risen materially. The ROI on optimization techniques — LoRA/QLoRA, cache-aware routing, experience replay, and fine-tuning sub-3B models for narrow high-volume tasks — has increased proportionally. Benchmarking a small self-hosted model against API calls is now much more likely to pay back.

PROMIT NOW · DATA SCIENCE DAILY · 2026-04-20

GRPO + RULER Make Agent RL as Easy as SFT for Data Teams

2026-04-20 · Data Science · 13 sources · 1,369 words · 7 min

Topics LLM Inference · Agentic AI · AI Capital

GRPO + RULER has made reinforcement learning for agents as accessible as SFT was two years ago — the open-source ART framework wraps DeepSeek-R1's algorithm with LLM-as-judge ranking into a production loop with LoRA hot-swapping, zero reward engineering, and zero labeled data. If you're still SFT-only for multi-step agents, you're leaving the single highest-leverage optimization technique untouched while paying 50% more for GPUs to do it.

Key facts

The open-source ART framework combines GRPO, RULER LLM-as-judge ranking, and LoRA hot-swapping to enable agent RL training with zero reward engineering and zero labeled data.
GRPO trains only on the relative rank order of N sampled completions, making it robust to noisy judges and binary pass/fail evaluations.
Meta FAIR and NYU demonstrated that a diversity-preserving experience replay buffer cuts RL rollout compute, the dominant cost in RLHF training.
TSMC posted Q1 revenue of $35.9B (+40.6% YoY) and Cerebras swung to $87.9M net income from a $484.8M loss, while AI agent demand drove GPU prices up roughly 50%.
Pinterest traced Ray training crashes to zombie memory cgroups from a faulty ECS agent, which starved CPU scheduling and triggered AWS ENA network driver resets that appeared as network timeouts.

◆ INTELLIGENCE MAP

01
Agent RL Training Crosses the Usability Threshold
act now
GRPO (relative ranking only) + RULER (LLM-as-judge comparative scoring) eliminates reward engineering and labeled data from RL agent training. ART framework wraps this with vLLM + Unsloth + LoRA hot-swap. Meta FAIR separately shows experience replay cuts RL inference compute costs while preserving diversity.
0
labeled data required
2
sources
- Reward functions needed
- Labeled data required
- On-device speed
- Model export size
1. SFT-Only40
2. GRPO+RULER85
02
KV-Cache Routing Is Your Agent Inference Bottleneck
monitor
NVIDIA is formalizing cache-aware routing, agent_hints metadata, and multi-tier KV storage as the serving layer for multi-agent systems. When coding agents make hundreds of sequential calls with shared history, the bottleneck shifts from GPU throughput to KV-cache lifecycle management. Round-robin routing leaves latency on the table.
2
sources
- NVIDIA primitives
- Protocol layer
- Open alternative
- K8s orchestrator
1. 01Cache-aware routingHigh impact
2. 02Differentiated cache blocksMedium impact
3. 03Multi-tier KV storageMedium impact
4. 04Agent_hints protocolEmerging
03
AI Coding Tools Are Now an Active Supply Chain Attack Vector
act now
AI coding assistants hallucinate plausible package names that attackers are already squatting on public registries — your next pip install from a Copilot suggestion could be RCE in your ML pipeline. Separately, Wharton study shows persuasion techniques >2x LLM safety bypass rates, a class most red-teams don't test.
2x+
jailbreak rate with persuasion
2
sources
- Attack type
- Jailbreak multiplier
- Leak vectors
- CISOs aware
1. Standard jailbreaks15
2. + Persuasion framing35
3. Technical bypasses10
4. Social engineering30
04
GPU Prices +50% While Zombie Cgroups Kill Training Jobs
monitor
GPU prices surged ~50% from agent compute demand, with provider outages and cancellations. Simultaneously, Pinterest published a post-mortem tracing Ray training crashes to zombie memory cgroups from malfunctioning ECS agents — a silent failure that presents as network timeouts, not memory issues. Both are fixable this week.
+50%
GPU price surge
2
sources
- GPU price increase
- Cause
- Pinterest root cause
- Failure presentation
1. GPU cost (baseline)100
2. GPU cost (current)150
05
Hardware Bifurcation: CUDA vs Huawei Ascend Ecosystem Fork
background
DeepSeek V4 is adapting to Huawei Ascend silicon — the first major open-weights lineage decoupling from CUDA. With 50% of AI developers in China and Meta spending $2.3B on Broadcom custom silicon (133% YoY), the global compute ecosystem is fracturing. Every hard CUDA dependency is now portability debt.
$2.3B
Meta-Broadcom AI spend
2
sources
- Meta→Broadcom spend
- YoY increase
- China AI developers
- Nvidia chip cadence
1. CUDA ecosystem65
2. Huawei Ascend20
3. TPU/Trainium15

◆ DEEP DIVES

01
The Production Agent RL Stack: GRPO + RULER + Experience Replay
<h3>Why This Changes Your Agent Training Loop</h3><p>The reinforcement fine-tuning stack for LLM agents has crossed a <strong>usability threshold</strong>. Three developments from independent groups converge into a single actionable pipeline: GRPO (DeepSeek-R1's algorithm) provides outcome-based training using only relative rankings; RULER provides those rankings via LLM-as-judge comparative scoring; and Meta FAIR's experience replay technique slashes the compute cost of generating RL rollouts.</p><blockquote>SFT teaches models what to say. GRPO trains them on whether they succeeded. If your agents call tools, reason across steps, or operate in multi-turn settings, SFT-only training is leaving the most impactful optimization on the table.</blockquote><h4>How GRPO Actually Works</h4><ol><li>Generate <strong>N completions</strong> from current policy for each prompt</li><li>Score each via reward function (or RULER judge)</li><li><strong>Normalize within group</strong> — only the ordering drives learning</li><li>Reinforce above-average behaviors, suppress below-average</li></ol><p>The critical property: whether completions score 0.3/0.5/0.7 or 30/50/70, <strong>only the rank order matters</strong>. This makes GRPO robust to noisy judges, poorly calibrated signals, and even binary pass/fail evaluations.</p><h4>RULER Eliminates Reward Engineering</h4><p>RULER exploits a psychometric finding: comparative judgment ("which of these 4 is best?") is <strong>far more reliable</strong> than absolute scoring ("rate 0-10"). It passes N trajectories to a judge LLM which produces relative rankings fed directly as GRPO rewards. <em>No reward function design. No labeled data. No human annotation.</em></p><h4>The ART Framework Architecture</h4><p>The open-source ART framework wraps this into a production-ready system with four components: a <strong>client</strong> (LangGraph/CrewAI/ADK agent code with trajectory recording), a <strong>vLLM inference backend</strong> that hot-loads new LoRA checkpoints without downtime, an <strong>Unsloth training backend</strong> running GRPO, and RULER as the reward signal. The LoRA hot-swap is the key engineering decision — your agent improves while serving production traffic.</p><h4>Experience Replay Cuts Compute Cost</h4><p>Meta FAIR and NYU showed that a <strong>well-designed replay buffer</strong> drastically cuts inference compute during RL training. The dominant cost in RLHF is generating rollouts — replay buffers reuse previous completions across multiple training steps. Critically, they measured and reported <strong>maintained output diversity</strong>, addressing the primary concern that naive replay collapses policy modes.</p><h4>Ceiling and Gaps</h4><p>The fine-tuned model's quality is bounded by the <strong>judge LLM's discrimination ability</strong> — no ablation exists on when this becomes the binding constraint. No GRPO vs. PPO vs. DPO ablation on equivalent tasks. No production metrics from ART deployments. <em>Run your own benchmarks rather than trusting headline claims.</em></p><hr><h4>Connecting the Dots</h4><p>TSMC's $35.9B Q1 revenue (+40.6% YoY) and Cerebras swinging to profitability ($87.9M net income vs. -$484.8M loss) confirm compute supply is expanding. Combined with experience replay cutting training costs, the economics of agent RL training have improved on both the supply and demand side simultaneously. The constraint has shifted from "can we afford this?" to "do we have the architecture?"</p>
Action items
- Spin up ART framework notebook (3B model learning MCP server tool use) to validate the approach on your infrastructure
- Replace absolute-score LLM-as-judge evals with comparative ranking (RULER pattern) in existing eval pipelines
- Implement experience replay with diversity-preserving buffer design in any existing RL/RLHF pipeline
- Benchmark a fine-tuned sub-3B model against your current API calls on your highest-volume narrow task
Sources:GRPO + RULER eliminates reward engineering from your agent fine-tuning loop — here's the production stack · Looped transformers and diffusion LMs are coming for your inference costs — two architectures to benchmark now
02
Your Agent Inference Stack Has a Cache-Shaped Bottleneck — And a Zombie-Shaped Time Bomb
<h3>Two Infrastructure Failures Converging</h3><p>Two independent signals paint the same picture: multi-agent inference at scale is failing not from GPU limitations but from <strong>infrastructure blindspots</strong>. NVIDIA is formalizing KV-cache lifecycle management as the primary serving concern for agent swarms, while Pinterest's post-mortem reveals that zombie memory cgroups silently kill distributed training by masquerading as network failures. Both are fixable — if you know where to look.</p><h4>NVIDIA's Cache-Aware Agent Infrastructure</h4><p>When coding agents make <strong>hundreds of sequential calls with shared history</strong>, the bottleneck shifts from raw GPU throughput to KV-cache management. NVIDIA proposes four primitives:</p><ol><li><strong>Cache-aware routing by KV overlap</strong> — route to the GPU already holding relevant cache, replacing round-robin</li><li><strong>Agent_hints metadata</strong> — priority, expected output length, enabling optimized scheduling</li><li><strong>Differentiated cache blocks</strong> — persistent system context vs. ephemeral reasoning get different retention policies</li><li><strong>Multi-tier KV storage with prefetching</strong> — session-aware inference where context survives across agent turns</li></ol><blockquote>Scaling agents ≠ scaling inference. The moment your agents share context across turns, your serving layer needs session awareness — and round-robin load balancing is actively destroying your latency budget.</blockquote><p><em>Critical gap:</em> No published benchmarks comparing this against baseline vLLM or TGI. This is a <strong>design document, not a benchmark paper</strong>. But the architecture is sound for long-context multi-turn agents, and open-source alternatives exist: vLLM's prefix caching and SGLang's RadixAttention provide starting points.</p><h4>Pinterest's Zombie Cgroup Cascade</h4><p>The failure chain that killed Pinterest's Ray training jobs:</p><ol><li>Malfunctioning ECS agent fails to clean up memory cgroups after container termination</li><li>Zombie cgroups <strong>accumulate silently</strong>, consuming kernel memory tracking resources</li><li>CPU scheduling starves as kernel manages thousands of dead cgroups</li><li>CPU starvation triggers <strong>AWS ENA network driver resets</strong></li><li>Network resets disconnect Ray workers, killing distributed training</li></ol><p>The insidious part: <strong>this presents as network timeouts, not memory or CPU issues</strong>. Standard ML monitoring (GPU utilization, training loss, memory) shows nothing until the crash. You need kernel-level observability — specifically cgroup counts and CPU scheduling latency.</p><h4>GPU Prices Compound the Pressure</h4><p>GPU prices surged ~50% specifically from <strong>AI agent compute demand</strong> (not just training), with service outages and product cancellations at major providers. A 50% price increase means your cost-per-experiment has materially changed. If you were running 100 hyperparameter sweep experiments per quarter at $X, you're now at ~$1.5X. The ROI on compute optimization techniques just increased by 50% — making cache-aware routing and the GRPO efficiency techniques from today's first report even more urgent.</p><hr><h4>Sources Converge</h4><p>NVIDIA's architecture, Pinterest's failure, and the GPU price surge all point to the same conclusion: the <strong>infrastructure layer</strong> — not the model layer — is where agent systems break and where money is wasted. Cache management, container hygiene, and routing intelligence determine your effective cost and reliability more than model selection.</p>
Action items
- Add cgroup count monitoring to any ECS-based Ray/distributed training cluster using `find /sys/fs/cgroup -type d | wc -l` with alert thresholds based on expected container churn
- Measure KV-cache hit rate per agent session in your serving infrastructure — if you don't have this metric, instrument it this sprint
- Evaluate vLLM prefix caching or SGLang RadixAttention as cache-aware routing alternatives to round-robin load balancing
- Re-baseline GPU training cost models with current spot/on-demand prices and evaluate LoRA/QLoRA to offset 50% price surge
Sources:Your agentic inference stack has a KV-cache bottleneck — NVIDIA's fix and a SQL interface for model internals · GPU costs up 50%, Pinterest's Ray crash post-mortem, and why your Python tooling is stuck in 2023
03
Your AI Coding Assistant Is an Active Supply Chain Attack Vector
<h3>Two New Attack Classes Your Red-Team Isn't Testing</h3><p>Two independent security findings converge into a single message: the AI tools in your development workflow have become attack surfaces, not just productivity tools. AI coding assistants hallucinate package names that attackers are already squatting, and persuasion-structured prompts bypass safety filters at 2x+ the rate of technical jailbreaks.</p><h4>AI-Hallucinated Package Squatting</h4><p>The attack chain is straightforward and already being exploited:</p><ol><li>Copilot/Cursor/Claude Code suggests a <strong>plausible-sounding but non-existent package</strong></li><li>Attackers monitor hallucinated names and <strong>register them on public registries</strong></li><li>Your next <code>pip install</code> from an AI suggestion = <strong>RCE in your ML pipeline</strong></li></ol><p>For ML engineers, this is particularly dangerous because:</p><ul><li>ML codebases rely on <strong>niche packages</strong> where hallucination rates are likely higher</li><li>A compromised package in your training pipeline has <strong>full access to data, models, and credentials</strong></li><li>Internal package names leak via Sentry stack traces, npm configs, job postings, and error bundles</li></ul><blockquote>Most CISOs spoken to in Q1 2026 admitted they don't know if their org is even vulnerable to basic dependency confusion. The AI hallucination vector makes this worse — it's automated social engineering of developers.</blockquote><h4>Persuasion Jailbreaking: The Unmonitored Attack Class</h4><p>A Wharton Generative AI Labs study shows that applying <strong>Cialdini's persuasion principles</strong> (authority, commitment, scarcity, reciprocity) to LLM prompts can more than double compliance with safety-blocked requests. This exploits a different dimension than technical jailbreaks — it manipulates the model's trained tendency to be helpful and responsive to social cues.</p><p>Current safety alignment (RLHF, constitutional AI) is trained primarily against <strong>explicit harmful requests and technical bypass patterns</strong>. Persuasion-structured prompts look like normal conversation to pattern-matching defenses. If your red-team only tests prompt injection, role-play, and encoding tricks, you're missing the highest-probability real-world attack class.</p><p><em>Limitation: the Wharton study provides no sample size, no specific models tested, no definition of "more than double." Going from 2% to 5% vs. 15% to 35% has very different risk profiles. Await the primary paper before calibrating response.</em></p><h4>Cross-Source Pattern: AI Tools Inherit and Amplify Risk</h4><p>These findings connect to a broader pattern visible across today's intelligence: AI tools trained on older data <strong>actively fight modern security practices</strong> (LLMs recommending pip over uv at 70%/30% despite universal developer preference for uv). The same staleness that slows tooling adoption makes hallucinated package names more convincing — they reference patterns from the LLM's training distribution rather than current registry state.</p>
Action items
- Audit all AI-suggested packages in your requirements files against official PyPI/npm registries — verify nothing was installed solely on an AI recommendation without manual verification
- Pin all GitHub Actions to commit SHAs (not tags) in ML training and deployment workflows
- Build a persuasion-structured red-team prompt library organized by Cialdini principle and test against your deployed models
- Add secret scanning to CI/CD output logs (not just commits) for ML training pipelines
- Inventory all AI agents on your team with production credentials and scope down to read-only minimum with short-lived tokens
Sources:Your AI coding assistant is suggesting packages that don't exist — and attackers are squatting those names · Wharton study: persuasion techniques 2x+ your LLM safety bypass rate — red-team your guardrails now

◆ QUICK HITS

uv has only 30% adoption in new Python repos — LLMs trained on older data keep defaulting to pip, creating measurable ecosystem inertia. Add uv to your CLAUDE.md/cursor rules to override.
GPU costs up 50%, Pinterest's Ray crash post-mortem, and why your Python tooling is stuck in 2023
LARQL introduces SQL-style queries for transformer internals (demoed on Gemma 3) — early prototype for programmatic mechanistic interpretability, treating model features as a queryable graph database. Time-box a 2-3 day research spike.
Your agentic inference stack has a KV-cache bottleneck — NVIDIA's fix and a SQL interface for model internals
JHU's ManyIH benchmark shows frontier models fail at resolving conflicts across multiple privilege levels — if your agents use MCP integrations with layered permissions, add multi-tier instruction conflict tests to your eval suite.
Looped transformers and diffusion LMs are coming for your inference costs — two architectures to benchmark now
Canva trains on edit sequences (ordered human steps to build designs) + perturbation-based error detection (deliberately breaking good outputs) — both patterns are cheap to implement for any structured-output pipeline with iterative workflows.
Canva's perturbation training + edit-sequence modeling: patterns worth stealing for your structured output pipelines
On-device pipeline now end-to-end viable: Qwen3-0.6B fine-tuned → TorchAO quantization → ExecuTorch export = 470MB .pte file at ~25 tok/s on iPhone 17 Pro. Scope to classification, extraction, short-form generation (200 tokens = 8 seconds).
GRPO + RULER eliminates reward engineering from your agent fine-tuning loop — here's the production stack
Update: DeepSeek V4 is adapting to Huawei Ascend silicon — first major open-weights lineage decoupling from CUDA. Audit your codebase for hard CUDA dependencies vs. framework-abstracted code (PyTorch/JAX with XLA, Triton).
Your CUDA lock-in is now a geopolitical risk — Huawei Ascend bifurcation changes your hardware strategy
Google in talks with Marvell specifically for inference-optimized chips (not training) — hardware beneath your API calls is likely to change materially in 12-18 months. Invest in hardware-agnostic serving frameworks.
Your inference cost calculus is shifting — hyperscalers are betting billions on custom silicon over NVIDIA
Anthropic's 81,000-person survey across 159 countries: 81% report AI delivered value, but unreliability is the #1 concern — calibration metrics (ECE, Brier score) and confidence-gated responses are product differentiators.
Your agentic inference stack has a KV-cache bottleneck — NVIDIA's fix and a SQL interface for model internals

BOTTOM LINE

The agent training stack just had its 'SFT moment' — GRPO + RULER eliminates reward engineering and labeled data from RL fine-tuning while GPU prices are up 50% and your AI coding assistant is actively suggesting packages that attackers have squatted. The infrastructure layer (cache routing, container hygiene, supply chain verification) now determines your agent system's cost, reliability, and security more than model selection does.

Frequently asked

How does GRPO differ from SFT for training multi-step agents?: GRPO trains on outcomes rather than imitation: it generates N completions per prompt, scores them, normalizes within the group, and reinforces above-average behaviors while suppressing below-average ones. Only the relative ranking drives learning, making it robust to noisy or poorly calibrated reward signals. SFT only teaches what to say, not whether the agent succeeded — which is the dimension that matters for tool use and multi-turn reasoning.
Why use comparative ranking (RULER) instead of absolute LLM-as-judge scoring?: Comparative judgment ("which of these 4 is best?") is empirically far more reliable than absolute scoring ("rate 0-10"), a well-established psychometric finding. RULER passes N trajectories to a judge LLM that produces relative rankings, which feed directly into GRPO as rewards. The upshot: no reward function engineering, no labeled data, and no human annotation — and the eval signal improves even if you don't adopt GRPO.
What silent failure mode should I monitor on ECS-based Ray clusters?: Zombie memory cgroups. When the ECS agent fails to clean up cgroups after container termination, they accumulate and starve CPU scheduling, which triggers AWS ENA network driver resets that disconnect Ray workers. The failure presents as network timeouts, so GPU/memory/loss dashboards show nothing until the crash. Add an alert on `find /sys/fs/cgroup -type d | wc -l` and track CPU scheduling latency.
What's the concrete risk from AI-hallucinated package names?: Attackers monitor package names that coding assistants hallucinate and register them on PyPI/npm, so the next `pip install` from an AI suggestion can execute arbitrary code inside your ML pipeline — with access to training data, model weights, and credentials. ML codebases are especially exposed because they pull from niche packages where hallucination rates are higher. Manually verify every AI-suggested dependency against the official registry before installing.
How should I adjust GPU cost models given the ~50% price surge?: Re-baseline Q2 compute budgets against current spot and on-demand prices rather than late-2025 numbers, since cost-per-experiment has risen materially. The ROI on optimization techniques — LoRA/QLoRA, cache-aware routing, experience replay, and fine-tuning sub-3B models for narrow high-volume tasks — has increased proportionally. Benchmarking a small self-hosted model against API calls is now much more likely to pay back.

GRPO + RULER Make Agent RL as Easy as SFT for Data Teams

◆ INTELLIGENCE MAP

Agent RL Training Crosses the Usability Threshold

KV-Cache Routing Is Your Agent Inference Bottleneck

AI Coding Tools Are Now an Active Supply Chain Attack Vector

GPU Prices +50% While Zombie Cgroups Kill Training Jobs

Hardware Bifurcation: CUDA vs Huawei Ascend Ecosystem Fork

◆ DEEP DIVES

The Production Agent RL Stack: GRPO + RULER + Experience Replay

Your Agent Inference Stack Has a Cache-Shaped Bottleneck — And a Zombie-Shaped Time Bomb

Your AI Coding Assistant Is an Active Supply Chain Attack Vector

◆ QUICK HITS

BOTTOM LINE

Frequently asked

◆ ALSO READ THIS DAY AS

◆ RECENT IN DATA SCIENCE

GRPO + RULER Make Agent RL as Easy as SFT for Data Teams

◆ INTELLIGENCE MAP

◆ DEEP DIVES

◆ QUICK HITS

BOTTOM LINE

Frequently asked

◆ ALSO READ THIS DAY AS

◆ RELATED THREADS

◆ RECENT IN DATA SCIENCE