
What is JobBench?
JobBench is an AI agent evaluation benchmark that tests models against 130 realistic, high-stakes enterprise workflows that humans actually prioritize for delegation—moving beyond toy academic tasks to measure real-world agent reliability.
Why It Matters
Published in May 2026, JobBench addresses a critical gap: most AI benchmarks evaluate isolated capabilities (math, coding, reasoning) but ignore the workflows humans actually want agents to perform.
Key findings from the initial evaluation:
- Frontier models (GPT-5.5, Claude Opus 4.7) score below 50% on enterprise tasks
- Humans delegate based on task delegation value, not benchmark difficulty
- Multi-step coherence matters more than peak intelligence for real work
This reveals a dangerous disconnect between agent hype and actual execution capabilities.
How It Works
1. Human-Grounded Task Selection
JobBench derives its 130 tasks from surveys of 5,000+ knowledge workers asked:
- "What tasks would you most want an AI agent to handle?"
- "What's the cost of failure if the agent gets it wrong?"
Tasks include:
- Schedule a meeting across 5 timezones with calendar conflicts
- Triage customer support tickets by urgency and route to correct team
- Draft a quarterly report by synthesizing data from 6 different spreadsheets
- Research competitive pricing and update internal pricing models
2. Evaluation Methodology
Unlike benchmarks that test single-turn responses, JobBench measures:
- Task completion rate (did the agent fully finish the job?)
- Correctness (did it make errors in the process?)
- Efficiency (how many unnecessary steps did it take?)
- Human delegation willingness (would a human trust this result?)
3. Real-World Constraints
Tasks include realistic complexity:
- Ambiguous instructions ("prepare for the Q2 review"—what does that mean?)
- Tool failures (API rate limits, authentication errors)
- Conflicting information (two spreadsheets with different revenue numbers)
- Human-in-the-loop requirements (when to ask for clarification vs. proceed)
Real-World Example
Task: "Schedule next week's product sync"
What the benchmark tests:
- Agent retrieves calendars for all attendees
- Identifies free slots that work for everyone
- Detects that the CEO has a "tentative" hold on one slot
- Recognizes ambiguity: Does "product sync" mean just PMs or include engineering?
- Asks clarifying question instead of guessing
- Creates meeting with correct attendees, agenda, and Zoom link
- Sends calendar invites
Typical frontier model performance: Completes steps 1-3, skips step 4-5, creates meeting with wrong attendees (40% task success).
Related Concepts
JobBench builds on Agent Evaluation, Real-World AI Benchmarks, and Task Delegation. It complements technical benchmarks like ITBench-AA (infrastructure tasks) but focuses on white-collar knowledge work.