Assembly
18 tasks
43
pp gap
Given a storyboard with 3–6 slots and a shuffled pool of candidate clips, select the clip that matches each slot.
Best agent 38%Human 81%




























May 2026
We gave the 7 best frontier models 100 expert-authored tasks across the four stages of post-production. The best agent barely crosses 30%. Human experts scored 89%.
100
Tasks
20
Industry experts
7
Frontier models
4
Task families
Why this benchmark exists
RLVR works in math and code because centuries of humanistic work built the verifiers, the bill was paid before we got there. Creative work hasn't paid that bill. AgenticVBench is what paying it looks like in film.
| Rank | Agent | Avg | Repurpose | Seq | Repair | Assembly |
|---|---|---|---|---|---|---|
| · | Human expertsreference | 88.5% | 95% | 90% | 88% | 81% |
| 1 | GPT-5.5· Codex | 31.0%± 4.0 | 30% | 26% | 30% | 38% |
| 2 | GPT-5.5· OpenCode | 27.4%± 3.5 | 27% | 20% | 27% | 37% |
| 4 | Claude Opus 4.7· Claude Code | 22.1%± 3.5 | 30% | 20% | 17% | 22% |
| 5 | GPT-5.5· OpenClaw | 21.9%± 2.9 | 20% | 29% | 21% | 18% |
| 6 | Claude Opus 4.7· OpenClaw | 21.1%± 3.4 | 18% | 19% | 25% | 22% |
What the bench tests
Authored by 20 industry experts averaging 6 years of post-production experience. Tasks span 30 minutes to one week of human work.
Assembly
18 tasks
43
pp gap
Given a storyboard with 3–6 slots and a shuffled pool of candidate clips, select the clip that matches each slot.
Best agent 38%Human 81%
Repair
18 tasks
59
pp gap
Given a video with defects (frozen scene, scene swap, color drift, or audio noise), localize them and produce a fixed cut.
Best agent 30%Human 88%
Sequencing
28 tasks
61
pp gap
Given a brief story overview and a shuffled set of clips, recover the correct narrative order.
Best agent 29%Human 90%
Repurpose
36 tasks
65
pp gap
Given 4-150 minutes of source video and a creative brief, repurpose it into a short deliverable that follows the brief and preserves the story.
Best agent 30%Human 95%
The harness finding
Holding the model fixed and varying the harness shifts GPT-5.5's Assembly score by 20 percentage points, comparable to the gap between adjacent models on the leaderboard.
Most benchmarks today are still model-based. The data here says that's wrong. Agent performance is determined by both the model and the scaffolding around it. Reporting only the model misses the larger story.
Agent = model × harness.
GPT-5.5 on Assembly · score by harness
Codex
38%
OpenCode
37%
OpenClaw
18%
Same model. 20-point swing.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。