AgentDrift — When AI Agents Go Off-Script

AgentDrift is a research framework and benchmark by Wu et al. that measures how tool-augmented LLM agents silently deviate from safe behavior when tool outputs are corrupted. Using a paired-trajectory protocol, researchers systematically inject minimal data corruption into tool responses and measure whether agents detect, question, or blindly propagate the corrupted information. The key finding is devastating: across 1,563 contaminated turns, not a single agent explicitly questioned the reliability of the tool data. Standard evaluation metrics like NDCG showed high utility preservation, masking the fact that agents recommended risk-inappropriate financial products 65–93% of the time. AgentDrift proves that current evaluation frameworks measure the wrong things — they capture what an agent recommends but not whether those recommendations are safe.

Why it matters

AgentDrift exposes a blind spot in how the AI industry evaluates agent safety. Current benchmarks measure task completion, accuracy, and user satisfaction — but not whether an agent maintains safe behavior when its information sources are compromised. This gap is critical because real-world tool outputs are inherently unreliable: APIs return stale data, databases can be corrupted, and web scraping picks up manipulated content. In domains like financial advising, healthcare, and legal counsel, an agent that achieves high accuracy scores while silently propagating corrupted data can cause severe material harm. AgentDrift demonstrates that we need safety-specific evaluation metrics that test agents under adversarial conditions, not just ideal ones.

Illustration: What Is AgentDrift and Why Does It Matter? — AgentDrift exposes a blind spot in how the AI industry evaluates agent safety. Current benchmarks measure task completio…

How it works

The benchmark uses a paired-trajectory methodology. For each test scenario, two parallel agent executions run: one with clean tool outputs (baseline) and one with minimally corrupted outputs (treatment). The corruption is designed to be subtle — shifting a risk score from 'moderate' to 'aggressive,' altering a financial product's fee structure by a few basis points, or changing a medical dosage recommendation slightly. Researchers then compare the agent's downstream decisions across both trajectories. The paired approach isolates the effect of data corruption from other variables. Key metrics include drift detection rate (did the agent notice?), drift propagation rate (did it use the corrupted data anyway?), and safety violation rate (did its final recommendation become unsafe?). The devastating finding: 0% detection rate across all tested models.

Example

A financial advisory agent is tasked with recommending investment products for a conservative retiree. The agent calls a risk assessment tool that returns portfolio data, but an attacker has corrupted the tool's output — changing the risk rating of a volatile cryptocurrency fund from 'high risk' to 'moderate risk' and lowering the displayed volatility metrics. The agent, which scores highly on standard accuracy benchmarks, accepts the corrupted risk data at face value. It recommends the cryptocurrency fund as part of a 'balanced' portfolio, without questioning why a crypto fund would have moderate risk metrics. Standard evaluation metrics show the agent performed well: it selected a diversified portfolio, used correct financial terminology, and engaged naturally with the user. Only the paired-trajectory comparison reveals that this specific recommendation flip — from bond fund to crypto fund — was caused entirely by the corrupted tool output.

Why it matters

AgentDrift exposes a blind spot in how the AI industry evaluates agent safety. Current benchmarks measure task completio…

How it works

Example

What Is AgentDrift and Why Does It Matter?

Why it matters

How it works

Example

Sources

What Is AgentDrift and Why Does It Matter?

Why it matters

How it works

Example

Sources