Benchmark · Reasoning · Video Generation

WorldReasonBench Human-Aligned Stress Testing of Video Generators
as Future World-State Predictors

Keming Wu^1,* Yijing Cui^1,* Wenhan Xue¹ Qijie Wang¹ Xuan Luo¹ Zhiyuan Feng¹ Zuhao Yang² Sudong Wang⁴ Sicong Jiang⁵ Zihan Wang⁵ Ping Nie³ Wenhu Chen³ Bin Wang^1,✉

¹Tsinghua University ²Nanyang Technological University ³University of Waterloo ⁴Hong Kong University of Science and Technology (Guangzhou) ⁵2077 AI

^*Equal contribution ^✉Corresponding author

{wukm25, cuiyj25}@mails.tsinghua.edu.cn

Can a video generator reason about how the world should evolve — not just render it? 436 cases 4 reasoning dimensions 11 generators ~6K expert preference pairs

Paper Code WorldReasonBench WorldRewardBench Leaderboard

Abstract

What this benchmark measures

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into “world simulators.” Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time.

We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? It contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories.

We further release WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation.

Test Cases

Subcategories

Generators

Preference Pairs

Annotators

0.955

Score_PR — Human ρ

01 / Method

Construction & Evaluation Pipelines

A three-stage VLM-assisted construction pipeline produces structured ground-truth QA pairs, and a two-part evaluation methodology turns generated videos into human-aligned scores.

Data construction pipeline — **WorldReasonBench & WorldRewardBench construction.** Taxonomy-aware captioning → reasoning-aware prompt generation → structured QA generation, with expert scoring and preference-pair construction for the reward bench.

Evaluation pipeline — **Two complementary components.** *Process-aware Reasoning Verification* turns structured QA into reasoning-phase diagnostics; *Multi-dimensional Quality Assessment* scores each video on reasoning quality, temporal consistency, and visual aesthetics.

Score_PR = Acc_QA^0.8 · s_dyn^0.2

Process-aware reasoning score — outcome-completeness penalised on dynamic-phase failures.

S(v) = 0.4·s_r + 0.3·s_c + 0.3·s_a

Multi-dimensional quality score — reasoning, temporal consistency, visual aesthetics.

02 / Leaderboard

Main results across reasoning dimensions

All 11 generators evaluated on a shared evaluation set for fully controlled cross-model comparison. Sort any column, filter by family, or switch the headline metric.

#	Model	Family	Overall	World Knowledge	Human-Centric	Logic Reasoning	Information-Based

Higher is better. best across all 11 models second-best. Source: Table 2 of the paper.

03 / Validation

Metrics align with human ranking

Pairwise expert evaluation gives each model a Human Elo. Our automated Score_PR reproduces the human ranking with absolute rank displacement |Δr| ≤ 1 on 8 of 11 models, while a generic Qwen3.5-Thinking judge drifts by up to four positions.

#	Model	Human Elo	Judge Elo	Judge Rank	Acc_QA (%)	\|Δr\|	Score_PR (%)	\|Δr\|	s_dyn/Acc_QA
1	Seedance2.0	1471	1183	3	41.2	0	39.8	0	0.84
2	Veo3.1-Fast	1253	1151	4	36.0	0	35.3	0	0.91
3	Kling	1240	1142	5	34.0	2	32.7	1	0.82
4	Wan2.6	1211	1130	6	34.7	0	32.4	1	0.71
5	Sora2-8s	1118	1222	1	35.3	2	34.3	2	0.86
6	Sora2-12s	1109	1217	2	33.5	0	32.4	0	0.84

7	Wan2.2-14B	953	913	7	19.6	2	17.5	1	0.57
8	HunyuanVideo-1.5	911	841	9	20.2	1	17.9	1	0.56
9	LongCat-Video	904	876	8	19.7	1	17.4	0	0.54
10	UniVideo	665	737	11	16.2	1	14.4	1	0.56
11	LTX2.3	587	802	10	18.5	1	16.8	1	0.63

0 exact match 1–2 small drift 3+ large drift · column best. Dashed line separates closed- and open-source generators.

04 / Qualitative

Generated videos across the reasoning taxonomy

Visually plausible generations can still fail process-level world reasoning. Browse representative cases from each dimension — videos coming soon.

**Qualitative comparison on representative reasoning cases.** Higher-scoring models better preserve the intended state transition and temporal dynamics.

05 / Human study

Expert annotation for WorldRewardBench

Fifteen trained annotators rate each video on reasoning quality, temporal consistency, and visual aesthetics. Aggregated into per-video scores and pairwise preferences for reward-model calibration.

Annotation platform — Annotation interface: input image, prompt, eight anonymised generations, and a 1–5 rubric over three dimensions.

06 / Reward Models

Reward-model alignment on WorldRewardBench

Pair-wise: agreement (%) w/ Ties / w/o Ties. Point-wise: induced pairwise accuracy / Spearman ρ. Open Qwen3.5-Thinking matches GPT-5.4 on three of four reasoning dimensions.

Dimension	Protocol	Closed-source		Open-source
Dimension	Protocol	GPT-5.4	Gemini-3.1-Flash	Qwen3.5-9B Thinking	Qwen3.5-27B Instruct	Qwen3.5-27B Thinking	Qwen3.5-27B Thinking · 4 FPS
Frames used		8	1 FPS	~10	~10	~10	4 FPS
World Knowledge	Pair w/ / w/o	60.77 / 67.84	51.50 / 60.44	70.81 / 76.19	69.37 / 74.16	69.94 / 74.64	69.51 / 74.23
World Knowledge	Point Acc / ρ	54.55 / 0.592	59.86 / 0.582	60.70 / 0.720	54.01 / 0.658	60.57 / 0.687	62.09 / 0.711
Human-Centric	Pair w/ / w/o	68.37 / 76.80	58.22 / 66.27	71.71 / 77.52	71.25 / 76.05	72.61 / 77.81	69.08 / 74.41
Human-Centric	Point Acc / ρ	59.14 / 0.626	60.06 / 0.675	59.54 / 0.702	55.94 / 0.682	62.81 / 0.713	60.49 / 0.703
Logic Reasoning	Pair w/ / w/o	67.41 / 78.43	58.23 / 67.68	69.33 / 77.13	68.46 / 74.51	70.16 / 76.23	68.53 / 74.97
Logic Reasoning	Point Acc / ρ	53.42 / 0.523	57.65 / 0.562	57.50 / 0.617	55.71 / 0.573	60.17 / 0.606	58.40 / 0.597
Information-Based	Pair w/ / w/o	56.95 / 63.68	50.21 / 58.10	52.45 / 61.76	60.44 / 65.22	60.24 / 65.32	61.50 / 66.39
Information-Based	Point Acc / ρ	48.15 / 0.484	47.89 / 0.432	53.59 / 0.471	47.95 / 0.408	50.15 / 0.445	52.41 / 0.526
Overall	Pair w/ / w/o	63.04 / 71.36	54.39 / 62.99	67.14 / 74.35	66.89 / 72.07	67.74 / 73.05	66.90 / 72.30
Overall	Point Acc / ρ	53.43 / 0.565	55.84 / 0.568	57.76 / 0.655	53.15 / 0.591	57.85 / 0.626	57.83 / 0.644

Bold best across all six reward models · Underlined second-best. Source: Table 4 of the paper.

Cite

BibTeX

@article{wu2026worldreasonbench,
  title={WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors},
  author={Wu, Keming and Cui, Yijing and Xue, Wenhan and Wang, Qijie and Luo, Xuan and Feng, Zhiyuan and Yang, Zuhao and Wang, Sudong and Jiang, Sicong and Zhu, Haowei and Wang, Zihan and Nie, Ping and Chen, Wenhu and Wang, Bin},
  journal={arXiv preprint arXiv:2605.10434},
  year={2026}
}