AgenticVBench

May 2026

Can AI agents do real-world post-production work?

We gave the 7 best frontier models 100 expert-authored tasks across the four stages of post-production. The best agent barely crosses 30%. Human experts scored 89%.

100

Tasks

Industry experts

Frontier models

Task families

Why this benchmark exists

Verification is not here for free.

RLVR works in math and code because centuries of humanistic work built the verifiers, the bill was paid before we got there. Creative work hasn't paid that bill. AgenticVBench is what paying it looks like in film.

Read the full essay →

Rank	Agent	Avg	Repurpose	Seq	Repair	Assembly
·	Human expertsreference	88.5%	95%	90%	88%	81%
1	GPT-5.5· Codex	31.0%± 4.0	30%	26%	30%	38%
2	GPT-5.5· OpenCode	27.4%± 3.5	27%	20%	27%	37%
4	Claude Opus 4.7· Claude Code	22.1%± 3.5	30%	20%	17%	22%
5	GPT-5.5· OpenClaw	21.9%± 2.9	20%	29%	21%	18%
6	Claude Opus 4.7· OpenClaw	21.1%± 3.4	18%	19%	25%	22%

What the bench tests

Four task families spanning the real-world post-production workflow.

Authored by 20 industry experts averaging 6 years of post-production experience. Tasks span 30 minutes to one week of human work.

Assembly

18 tasks

pp gap

Given a storyboard with 3–6 slots and a shuffled pool of candidate clips, select the clip that matches each slot.

Best agent 38%Human 81%

Repair

18 tasks

pp gap

Given a video with defects (frozen scene, scene swap, color drift, or audio noise), localize them and produce a fixed cut.

Best agent 30%Human 88%

Sequencing

28 tasks

pp gap

Given a brief story overview and a shuffled set of clips, recover the correct narrative order.

Best agent 29%Human 90%

Repurpose

36 tasks

pp gap

Given 4-150 minutes of source video and a creative brief, repurpose it into a short deliverable that follows the brief and preserves the story.

Best agent 30%Human 95%

The harness finding

The harness matters as much as the model.

Holding the model fixed and varying the harness shifts GPT-5.5's Assembly score by 20 percentage points, comparable to the gap between adjacent models on the leaderboard.

Most benchmarks today are still model-based. The data here says that's wrong. Agent performance is determined by both the model and the scaffolding around it. Reporting only the model misses the larger story.

Agent = model × harness.

GPT-5.5 on Assembly · score by harness

Codex

38%

OpenCode

37%

OpenClaw

18%

Same model. 20-point swing.

推荐订阅源

Hacker News - Newest: "AI"

Can AI agents do real-world post-production work?

Verification is not here for free.

Four task families spanning the real-world post-production workflow.

The harness matters as much as the model.