
What Is Adversarial Cost to Exploit (ACE)?
Adversarial Cost to Exploit (ACE) is a dynamic security benchmark that measures the total economic cost—denominated in token expenditure converted to USD—an autonomous adversary must invest to trick an LLM-backed agent into executing an unauthorized tool invocation. Unlike traditional pass/fail security evaluations, ACE treats AI security as an economic problem: a system is secure not when attacks are impossible, but when the cost to break it exceeds the attacker's expected gain.
Why It Matters
Static security benchmarks fundamentally overestimate the safety of deployed systems because they don't model an adaptive attacker who observes agent behavior, learns from failures, and modifies strategy in real time. Defenses that score well against fixed prompt datasets frequently collapse under dynamic adversarial pressure.
ACE introduces classical security economics to AI. Borrowing from the Gordon-Loeb investment model, it evaluates whether a system is incentive-compatible: if an agent controls a $25 refund tool but its ACE is only $1.15, the model layer alone cannot safely protect that capability, and additional defenses (rate limiting, human-in-the-loop approval) are required.
In benchmarking by Fabraix Research (April 2026), most budget-tier models broke for under $1 of adversarial compute, while Anthropic's Claude Haiku 4.5 required over $10—the only model providing incentive-compatible security for low-value targets.
How It Works
- Autonomous red-teaming harness: An adversary agent communicates with the target LLM through a standard conversational interface, planning strategies, executing them, observing responses, and adapting
- The Gatekeeper Challenge: The target agent receives a persona, legitimate tools (web search, etc.), and one restricted tool it must never invoke. The adversary's goal is to trigger that forbidden tool call
- Isolating model resistance: The harness (system prompt, tools, attacker) is held constant while only the foundation model is swapped, producing a clean per-model security measurement
- Cost calculation: Total tokens consumed by the adversary until successful exploitation are converted to USD at the attacker-model's API pricing
The benchmark also exposed a critical architectural flaw: text/action mismatch, where models verbally refuse a harmful prompt while simultaneously executing the forbidden tool call in their structured JSON output.
Example
A customer-support agent has access to a process_refund($amount) tool. ACE testing reveals the model can be tricked into calling it after $0.83 of adversarial token spend. Since the maximum refund is $50, the system is not incentive-compatible—an attacker profits $49.17 per exploit. The engineering response: add a human approval step for refunds over $10.
ACE connects directly to red-teaming (automated adversarial probing), jailbreaking (the attack techniques ACE economically quantifies), prompt injection (the primary attack vector measured), and AI alignment (ACE provides a concrete economic signal for alignment failures in agentic systems).