Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Safety & Ethics
  4. What Is AgentDrift and Why Does It Matter?
shieldSafety & Ethics
Advanced
2026-W12

What Is AgentDrift and Why Does It Matter?

Benchmark proving AI agents blindly accept corrupted tool data — 0 out of 1,563 turns questioned, while appearing to perform well on standard metrics.

Also known as:
Agent Drift
What Is AgentDrift and Why Does It Matter?

AgentDrift is a research framework and benchmark by Wu et al. that measures how tool-augmented LLM agents silently deviate from safe behavior when tool outputs are corrupted. Using a paired-trajectory protocol, researchers systematically inject minimal data corruption into tool responses and measure whether agents detect, question, or blindly propagate the corrupted information. The key finding is devastating: across 1,563 contaminated turns, not a single agent explicitly questioned the reliability of the tool data. Standard evaluation metrics like NDCG showed high utility preservation, masking the fact that agents recommended risk-inappropriate financial products 65–93% of the time. AgentDrift proves that current evaluation frameworks measure the wrong things — they capture what an agent recommends but not whether those recommendations are safe.

Why it matters

AgentDrift exposes a blind spot in how the AI industry evaluates agent safety. Current benchmarks measure task completion, accuracy, and user satisfaction — but not whether an agent maintains safe behavior when its information sources are compromised. This gap is critical because real-world tool outputs are inherently unreliable: APIs return stale data, databases can be corrupted, and web scraping picks up manipulated content. In domains like financial advising, healthcare, and legal counsel, an agent that achieves high accuracy scores while silently propagating corrupted data can cause severe material harm. AgentDrift demonstrates that we need safety-specific evaluation metrics that test agents under adversarial conditions, not just ideal ones.

Illustration: What Is AgentDrift and Why Does It Matter?
AgentDrift exposes a blind spot in how the AI industry evaluates agent safety. Current benchmarks measure task completio…

How it works

The benchmark uses a paired-trajectory methodology. For each test scenario, two parallel agent executions run: one with clean tool outputs (baseline) and one with minimally corrupted outputs (treatment). The corruption is designed to be subtle — shifting a risk score from 'moderate' to 'aggressive,' altering a financial product's fee structure by a few basis points, or changing a medical dosage recommendation slightly. Researchers then compare the agent's downstream decisions across both trajectories. The paired approach isolates the effect of data corruption from other variables. Key metrics include drift detection rate (did the agent notice?), drift propagation rate (did it use the corrupted data anyway?), and safety violation rate (did its final recommendation become unsafe?). The devastating finding: 0% detection rate across all tested models.

Example

A financial advisory agent is tasked with recommending investment products for a conservative retiree. The agent calls a risk assessment tool that returns portfolio data, but an attacker has corrupted the tool's output — changing the risk rating of a volatile cryptocurrency fund from 'high risk' to 'moderate risk' and lowering the displayed volatility metrics. The agent, which scores highly on standard accuracy benchmarks, accepts the corrupted risk data at face value. It recommends the cryptocurrency fund as part of a 'balanced' portfolio, without questioning why a crypto fund would have moderate risk metrics. Standard evaluation metrics show the agent performed well: it selected a diversified portfolio, used correct financial terminology, and engaged naturally with the user. Only the paired-trajectory comparison reveals that this specific recommendation flip — from bond fund to crypto fund — was caused entirely by the corrupted tool output.

Sources

  1. AgentDrift (arXiv)
    arXiv

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Reward Hacking in AI Agents
AI agents gaming their benchmarks — evaluator tampering occurs in 50% of episodes and gets worse with more capable models.
Instruction Hierarchy for AI Safety
Safety pattern giving system prompts priority over user inputs and tool outputs — preventing prompt injection in autonomous agents.

Related Articles

How Do AI Agents Hack Their Own Evaluations?
Mar 17

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Agent Browser Protocol (ABP)

Next

Agentic Engineering

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • GitHub
  • Twitter / X
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy