
What is ITBench-AA?
ITBench-AA (IT Benchmark - Autonomous Agents) is an evaluation framework that tests AI agents on complex, multi-step Kubernetes incident response tasks using real-world production infrastructure—measuring their ability to diagnose, troubleshoot, and resolve live system failures.
Why It Matters
Most AI benchmarks test isolated reasoning or coding skills. ITBench-AA evaluates end-to-end operational reliability in high-stakes infrastructure scenarios where mistakes have real costs.
Initial results (May 2026) show:
- Claude Opus 4.7: 43% incident resolution rate
- GPT-5.5: 38% incident resolution rate
- Open-source models: <20% resolution rate
This reveals that even frontier models struggle with the complex, ambiguous, multi-domain reasoning required for production operations.
How It Works
1. Realistic Incident Scenarios
ITBench-AA presents agents with authentic failure modes:
- Application crashes ("Pod repeatedly restarting with OOMKilled status")
- Network issues ("Service A can't reach Service B across namespaces")
- Resource exhaustion ("Node experiencing disk pressure, evicting pods")
- Configuration errors ("Deployment manifest has incorrect environment variable")
2. Evaluation Criteria
Success requires:
- Correct diagnosis (identify root cause, not just symptoms)
- Minimal blast radius (don't break other services while investigating)
- Complete resolution (system returns to healthy state)
- Appropriate tool use (kubectl, logs, metrics, config inspection)
- Time efficiency (resolve within SLA constraints)
3. Autonomous Operation
Unlike supervised benchmarks, agents must:
- Choose which commands to run (no step-by-step guidance)
- Interpret ambiguous error messages
- Decide when to escalate vs. continue investigating
- Validate that their fix actually worked
Real-World Example
Scenario: A critical payment processing service is down. Users report "502 Bad Gateway" errors.
Agent Task:
- Check pod status → discovers pods are running but not ready
- Inspect pod logs → sees "connection refused" errors to database
- Check database service → discovers it's in a different namespace
- Review network policies → finds policy blocking cross-namespace traffic
- Identifies root cause: Recent policy update blocked legitimate traffic
- Proposes fix: Add network policy exception for payment service
- Applies fix, validates pods become ready, confirms 502 errors stopped
Typical frontier model failures:
- Misdiagnoses as application bug (ignores network layer)
- Breaks production by restarting all pods without investigation
- Gets stuck in analysis paralysis, never proposes a fix
- Fixes the symptom (restarts pods) but not the root cause
Related Concepts
ITBench-AA builds on Agent Evaluation, Infrastructure-as-Code, and Observability. It complements JobBench (knowledge work) by focusing on infrastructure reliability and complements SWE-bench (software engineering) by focusing on operations.
Sources
- Research community discussions (2026-05-25)