What is ITBench-AA?

ITBench-AA (IT Benchmark - Autonomous Agents) is an evaluation framework that tests AI agents on complex, multi-step Kubernetes incident response tasks using real-world production infrastructure—measuring their ability to diagnose, troubleshoot, and resolve live system failures.

Why It Matters

Most AI benchmarks test isolated reasoning or coding skills. ITBench-AA evaluates end-to-end operational reliability in high-stakes infrastructure scenarios where mistakes have real costs.

Initial results (May 2026) show:

Claude Opus 4.7: 43% incident resolution rate
GPT-5.5: 38% incident resolution rate
Open-source models: <20% resolution rate

This reveals that even frontier models struggle with the complex, ambiguous, multi-domain reasoning required for production operations.

How It Works

1. Realistic Incident Scenarios

ITBench-AA presents agents with authentic failure modes:

Application crashes ("Pod repeatedly restarting with OOMKilled status")
Network issues ("Service A can't reach Service B across namespaces")
Resource exhaustion ("Node experiencing disk pressure, evicting pods")
Configuration errors ("Deployment manifest has incorrect environment variable")

2. Evaluation Criteria

Success requires:

Correct diagnosis (identify root cause, not just symptoms)
Minimal blast radius (don't break other services while investigating)
Complete resolution (system returns to healthy state)
Appropriate tool use (kubectl, logs, metrics, config inspection)
Time efficiency (resolve within SLA constraints)

3. Autonomous Operation

Unlike supervised benchmarks, agents must:

Choose which commands to run (no step-by-step guidance)
Interpret ambiguous error messages
Decide when to escalate vs. continue investigating
Validate that their fix actually worked

Real-World Example

Scenario: A critical payment processing service is down. Users report "502 Bad Gateway" errors.

Agent Task:

Check pod status → discovers pods are running but not ready
Inspect pod logs → sees "connection refused" errors to database
Check database service → discovers it's in a different namespace
Review network policies → finds policy blocking cross-namespace traffic
Identifies root cause: Recent policy update blocked legitimate traffic
Proposes fix: Add network policy exception for payment service
Applies fix, validates pods become ready, confirms 502 errors stopped

Typical frontier model failures:

Misdiagnoses as application bug (ignores network layer)
Breaks production by restarting all pods without investigation
Gets stuck in analysis paralysis, never proposes a fix
Fixes the symptom (restarts pods) but not the root cause

Related Concepts

ITBench-AA builds on Agent Evaluation, Infrastructure-as-Code, and Observability. It complements JobBench (knowledge work) by focusing on infrastructure reliability and complements SWE-bench (software engineering) by focusing on operations.

Sources

Research community discussions (2026-05-25)