Skip to main content
BVDNETBVDNET
DienstenWerkBibliotheekOver MijPrijzenBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Veiligheid & Ethiek
  4. Wat is JobBench?
shieldVeiligheid & Ethiek
Intermediate
2026-W22

Wat is JobBench?

Een AI-agent benchmark die 130 echte enterprise workflows test die mensen daadwerkelijk willen delegeren, en onthult dat frontier-modellen onder de 50% scoren op taken zoals meeting-planning en rapportgeneratie.

Ook bekend als:
Job Bench
Human-Grounded Agent Evaluation
AI Intel Pipeline
What is JobBench?

What is JobBench?

JobBench is an AI agent evaluation benchmark that tests models against 130 realistic, high-stakes enterprise workflows that humans actually prioritize for delegation—moving beyond toy academic tasks to measure real-world agent reliability.

Why It Matters

Published in May 2026, JobBench addresses a critical gap: most AI benchmarks evaluate isolated capabilities (math, coding, reasoning) but ignore the workflows humans actually want agents to perform.

Key findings from the initial evaluation:

  • Frontier models (GPT-5.5, Claude Opus 4.7) score below 50% on enterprise tasks
  • Humans delegate based on task delegation value, not benchmark difficulty
  • Multi-step coherence matters more than peak intelligence for real work

This reveals a dangerous disconnect between agent hype and actual execution capabilities.

How It Works

1. Human-Grounded Task Selection

JobBench derives its 130 tasks from surveys of 5,000+ knowledge workers asked:

  • "What tasks would you most want an AI agent to handle?"
  • "What's the cost of failure if the agent gets it wrong?"

Tasks include:

  • Schedule a meeting across 5 timezones with calendar conflicts
  • Triage customer support tickets by urgency and route to correct team
  • Draft a quarterly report by synthesizing data from 6 different spreadsheets
  • Research competitive pricing and update internal pricing models

2. Evaluation Methodology

Unlike benchmarks that test single-turn responses, JobBench measures:

  • Task completion rate (did the agent fully finish the job?)
  • Correctness (did it make errors in the process?)
  • Efficiency (how many unnecessary steps did it take?)
  • Human delegation willingness (would a human trust this result?)

3. Real-World Constraints

Tasks include realistic complexity:

  • Ambiguous instructions ("prepare for the Q2 review"—what does that mean?)
  • Tool failures (API rate limits, authentication errors)
  • Conflicting information (two spreadsheets with different revenue numbers)
  • Human-in-the-loop requirements (when to ask for clarification vs. proceed)

Real-World Example

Task: "Schedule next week's product sync"

What the benchmark tests:

  1. Agent retrieves calendars for all attendees
  2. Identifies free slots that work for everyone
  3. Detects that the CEO has a "tentative" hold on one slot
  4. Recognizes ambiguity: Does "product sync" mean just PMs or include engineering?
  5. Asks clarifying question instead of guessing
  6. Creates meeting with correct attendees, agenda, and Zoom link
  7. Sends calendar invites

Typical frontier model performance: Completes steps 1-3, skips step 4-5, creates meeting with wrong attendees (40% task success).

Related Concepts

JobBench builds on Agent Evaluation, Real-World AI Benchmarks, and Task Delegation. It complements technical benchmarks like ITBench-AA (infrastructure tasks) but focuses on white-collar knowledge work.

Sources

  • arXiv: JobBench - Aligning Agent Work With Human Will (2026-05-25)

Bronnen

  1. arXiv: JobBench Paper

Hulp nodig bij het implementeren van AI?

Ik help je dit concept toe te passen in je bedrijf.

Neem contact op

Gerelateerde Concepten

Autonomous AI Cybersecurity Defense
De paradigmaverschuiving waarbij AI-systemen autonoom softwarekwetsbaarheden ontdekken, verifiëren en helpen patchen, sneller dan menselijke onderzoekers en aanvallers—eindelijk de aanvaller-verdediger balans kantelt naar verdediging.
Magnifica Humanitas
Paus Leo XIV's 150-pagina encycliek over AI-ethiek, die oproept tot de ontwapening van AI van tech-monopolies, democratisch toezicht en het funderen van AI-beleid in menselijke waardigheid en theologische antropologie.
Project Glasswing
Anthropic's AI-gedreven beveiligingsinitiatief dat Claude gebruikt om autonoom tienduizenden kritieke kwetsbaarheden in mondiale software-infrastructuur te ontdekken en verifiëren, sneller dan aanvallers ze kunnen misbruiken.
AI Governance
AI governance is het raamwerk van beleid, regelgeving en praktijken dat ervoor zorgt dat AI-systemen verantwoord, eerlijk en in overeenstemming met wet- en regelgeving worden ontwikkeld en ingezet.

AI-advies

Hulp nodig bij het begrijpen of implementeren van dit concept?

Praat met een expert
Vorige

AI-jailbreaking

Volgende

Kairos

BVDNETBVDNET

Webontwikkeling en AI-automatisering. Goed gedaan.

Bedrijf

  • Over Mij
  • Contact
  • FAQ

Resources

  • Diensten
  • Werk
  • Bibliotheek
  • Blog
  • Prijzen

Connect

  • LinkedIn
  • Email

© 2026 BVDNET. Alle rechten voorbehouden.

Privacybeleid•Algemene Voorwaarden•Cookiebeleid