What is JobBench?

JobBench is an AI agent evaluation benchmark that tests models against 130 realistic, high-stakes enterprise workflows that humans actually prioritize for delegation—moving beyond toy academic tasks to measure real-world agent reliability.

Why It Matters

Published in May 2026, JobBench addresses a critical gap: most AI benchmarks evaluate isolated capabilities (math, coding, reasoning) but ignore the workflows humans actually want agents to perform.

Key findings from the initial evaluation:

Frontier models (GPT-5.5, Claude Opus 4.7) score below 50% on enterprise tasks
Humans delegate based on task delegation value, not benchmark difficulty
Multi-step coherence matters more than peak intelligence for real work

This reveals a dangerous disconnect between agent hype and actual execution capabilities.

How It Works

1. Human-Grounded Task Selection

JobBench derives its 130 tasks from surveys of 5,000+ knowledge workers asked:

"What tasks would you most want an AI agent to handle?"
"What's the cost of failure if the agent gets it wrong?"

Tasks include:

Schedule a meeting across 5 timezones with calendar conflicts
Triage customer support tickets by urgency and route to correct team
Draft a quarterly report by synthesizing data from 6 different spreadsheets
Research competitive pricing and update internal pricing models

2. Evaluation Methodology

Unlike benchmarks that test single-turn responses, JobBench measures:

Task completion rate (did the agent fully finish the job?)
Correctness (did it make errors in the process?)
Efficiency (how many unnecessary steps did it take?)
Human delegation willingness (would a human trust this result?)

3. Real-World Constraints

Tasks include realistic complexity:

Ambiguous instructions ("prepare for the Q2 review"—what does that mean?)
Tool failures (API rate limits, authentication errors)
Conflicting information (two spreadsheets with different revenue numbers)
Human-in-the-loop requirements (when to ask for clarification vs. proceed)

Real-World Example

Task: "Schedule next week's product sync"

What the benchmark tests:

Agent retrieves calendars for all attendees
Identifies free slots that work for everyone
Detects that the CEO has a "tentative" hold on one slot
Recognizes ambiguity: Does "product sync" mean just PMs or include engineering?
Asks clarifying question instead of guessing
Creates meeting with correct attendees, agenda, and Zoom link
Sends calendar invites

Typical frontier model performance: Completes steps 1-3, skips step 4-5, creates meeting with wrong attendees (40% task success).

Related Concepts

JobBench builds on Agent Evaluation, Real-World AI Benchmarks, and Task Delegation. It complements technical benchmarks like ITBench-AA (infrastructure tasks) but focuses on white-collar knowledge work.

Sources

arXiv: JobBench - Aligning Agent Work With Human Will (2026-05-25)

What is JobBench?

Why It Matters

Key findings from the initial evaluation:

Frontier models (GPT-5.5, Claude Opus 4.7) score below 50% on enterprise tasks
Humans delegate based on task delegation value, not benchmark difficulty
Multi-step coherence matters more than peak intelligence for real work

This reveals a dangerous disconnect between agent hype and actual execution capabilities.

How It Works

1. Human-Grounded Task Selection

JobBench derives its 130 tasks from surveys of 5,000+ knowledge workers asked:

"What tasks would you most want an AI agent to handle?"
"What's the cost of failure if the agent gets it wrong?"

Tasks include:

Schedule a meeting across 5 timezones with calendar conflicts
Triage customer support tickets by urgency and route to correct team
Draft a quarterly report by synthesizing data from 6 different spreadsheets
Research competitive pricing and update internal pricing models

2. Evaluation Methodology

Unlike benchmarks that test single-turn responses, JobBench measures:

Task completion rate (did the agent fully finish the job?)
Correctness (did it make errors in the process?)
Efficiency (how many unnecessary steps did it take?)
Human delegation willingness (would a human trust this result?)

3. Real-World Constraints

Tasks include realistic complexity:

Ambiguous instructions ("prepare for the Q2 review"—what does that mean?)
Tool failures (API rate limits, authentication errors)
Conflicting information (two spreadsheets with different revenue numbers)
Human-in-the-loop requirements (when to ask for clarification vs. proceed)

Real-World Example

Task: "Schedule next week's product sync"

What the benchmark tests:

Agent retrieves calendars for all attendees
Identifies free slots that work for everyone
Detects that the CEO has a "tentative" hold on one slot
Recognizes ambiguity: Does "product sync" mean just PMs or include engineering?
Asks clarifying question instead of guessing
Creates meeting with correct attendees, agenda, and Zoom link
Sends calendar invites

Typical frontier model performance: Completes steps 1-3, skips step 4-5, creates meeting with wrong attendees (40% task success).

Related Concepts

Sources

arXiv: JobBench - Aligning Agent Work With Human Will (2026-05-25)