Benchmark · Reasoning · Video Generation

WorldReasonBench Human-Aligned Stress Testing of Video Generators
as Future World-State Predictors

Keming Wu1,* Yijing Cui1,* Wenhan Xue1 Qijie Wang1 Xuan Luo1 Zhiyuan Feng1 Zuhao Yang2 Sudong Wang4 Sicong Jiang5 Zihan Wang5 Ping Nie3 Wenhu Chen3 Bin Wang1,✉
1Tsinghua University 2Nanyang Technological University 3University of Waterloo 4Hong Kong University of Science and Technology (Guangzhou) 52077 AI
*Equal contribution Corresponding author

Can a video generator reason about how the world should evolve — not just render it? 436 cases 4 reasoning dimensions 11 generators ~6K expert preference pairs

WorldReasonBench overview
Abstract

What this benchmark measures

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into “world simulators.” Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time.

We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? It contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories.

We further release WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation.

0
Test Cases
0
Subcategories
0
Generators
0
Preference Pairs
0
Annotators
0.955
ScorePR — Human ρ
01 / Method

Construction & Evaluation Pipelines

A three-stage VLM-assisted construction pipeline produces structured ground-truth QA pairs, and a two-part evaluation methodology turns generated videos into human-aligned scores.

Construction
Data construction pipeline
WorldReasonBench & WorldRewardBench construction. Taxonomy-aware captioning → reasoning-aware prompt generation → structured QA generation, with expert scoring and preference-pair construction for the reward bench.
Evaluation
Evaluation pipeline
Two complementary components. Process-aware Reasoning Verification turns structured QA into reasoning-phase diagnostics; Multi-dimensional Quality Assessment scores each video on reasoning quality, temporal consistency, and visual aesthetics.
ScorePR = AccQA0.8 · sdyn0.2
Process-aware reasoning score — outcome-completeness penalised on dynamic-phase failures.
S(v) = 0.4·sr + 0.3·sc + 0.3·sa
Multi-dimensional quality score — reasoning, temporal consistency, visual aesthetics.
02 / Leaderboard

Main results across reasoning dimensions

All 11 generators evaluated on a shared evaluation set for fully controlled cross-model comparison. Sort any column, filter by family, or switch the headline metric.

# Model Family Overall World Knowledge Human-Centric Logic Reasoning Information-Based

Higher is better. best across all 11 models   second-best. Source: Table 2 of the paper.

03 / Validation

Metrics align with human ranking

Pairwise expert evaluation gives each model a Human Elo. Our automated ScorePR reproduces the human ranking with absolute rank displacement |Δr| ≤ 1 on 8 of 11 models, while a generic Qwen3.5-Thinking judge drifts by up to four positions.

# Model Human
Elo
Judge
Elo
Judge
Rank
AccQA (%) |Δr| ScorePR (%) |Δr| sdyn/AccQA
1 Seedance2.0 1471 1183 3 41.2 0 39.8 0 0.84
2 Veo3.1-Fast 1253 1151 4 36.0 0 35.3 0 0.91
3 Kling 1240 1142 5 34.0 2 32.7 1 0.82
4 Wan2.6 1211 1130 6 34.7 0 32.4 1 0.71
5 Sora2-8s 1118 1222 1 35.3 2 34.3 2 0.86
6 Sora2-12s 1109 1217 2 33.5 0 32.4 0 0.84
7 Wan2.2-14B 953 913 7 19.6 2 17.5 1 0.57
8 HunyuanVideo-1.5 911 841 9 20.2 1 17.9 1 0.56
9 LongCat-Video 904 876 8 19.7 1 17.4 0 0.54
10 UniVideo 665 737 11 16.2 1 14.4 1 0.56
11 LTX2.3 587 802 10 18.5 1 16.8 1 0.63

0 exact match  1–2 small drift  3+ large drift  ·   column best. Dashed line separates closed- and open-source generators.

04 / Qualitative

Generated videos across the reasoning taxonomy

Visually plausible generations can still fail process-level world reasoning. Browse representative cases from each dimension — videos coming soon.

Qualitative comparison
Qualitative comparison on representative reasoning cases. Higher-scoring models better preserve the intended state transition and temporal dynamics.
05 / Human study

Expert annotation for WorldRewardBench

Fifteen trained annotators rate each video on reasoning quality, temporal consistency, and visual aesthetics. Aggregated into per-video scores and pairwise preferences for reward-model calibration.

Annotation platform
Annotation interface: input image, prompt, eight anonymised generations, and a 1–5 rubric over three dimensions.
06 / Reward Models

Reward-model alignment on WorldRewardBench

Pair-wise: agreement (%) w/ Ties / w/o Ties. Point-wise: induced pairwise accuracy / Spearman ρ. Open Qwen3.5-Thinking matches GPT-5.4 on three of four reasoning dimensions.

Dimension Protocol Closed-source Open-source
GPT-5.4 Gemini-3.1-Flash Qwen3.5-9B
Thinking
Qwen3.5-27B
Instruct
Qwen3.5-27B
Thinking
Qwen3.5-27B
Thinking · 4 FPS
Frames used 8 1 FPS ~10 ~10 ~10 4 FPS
World Knowledge Pair w/ / w/o 60.77 / 67.84 51.50 / 60.44 70.81 / 76.19 69.37 / 74.16 69.94 / 74.64 69.51 / 74.23
Point Acc / ρ 54.55 / 0.592 59.86 / 0.582 60.70 / 0.720 54.01 / 0.658 60.57 / 0.687 62.09 / 0.711
Human-Centric Pair w/ / w/o 68.37 / 76.80 58.22 / 66.27 71.71 / 77.52 71.25 / 76.05 72.61 / 77.81 69.08 / 74.41
Point Acc / ρ 59.14 / 0.626 60.06 / 0.675 59.54 / 0.702 55.94 / 0.682 62.81 / 0.713 60.49 / 0.703
Logic Reasoning Pair w/ / w/o 67.41 / 78.43 58.23 / 67.68 69.33 / 77.13 68.46 / 74.51 70.16 / 76.23 68.53 / 74.97
Point Acc / ρ 53.42 / 0.523 57.65 / 0.562 57.50 / 0.617 55.71 / 0.573 60.17 / 0.606 58.40 / 0.597
Information-Based Pair w/ / w/o 56.95 / 63.68 50.21 / 58.10 52.45 / 61.76 60.44 / 65.22 60.24 / 65.32 61.50 / 66.39
Point Acc / ρ 48.15 / 0.484 47.89 / 0.432 53.59 / 0.471 47.95 / 0.408 50.15 / 0.445 52.41 / 0.526
Overall Pair w/ / w/o 63.04 / 71.36 54.39 / 62.99 67.14 / 74.35 66.89 / 72.07 67.74 / 73.05 66.90 / 72.30
Point Acc / ρ 53.43 / 0.565 55.84 / 0.568 57.76 / 0.655 53.15 / 0.591 57.85 / 0.626 57.83 / 0.644

Bold best across all six reward models  ·  Underlined second-best. Source: Table 4 of the paper.

Cite

BibTeX

@article{wu2026worldreasonbench,
  title={WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors},
  author={Wu, Keming and Cui, Yijing and Xue, Wenhan and Wang, Qijie and Luo, Xuan and Feng, Zhiyuan and Yang, Zuhao and Wang, Sudong and Jiang, Sicong and Zhu, Haowei and Wang, Zihan and Nie, Ping and Chen, Wenhu and Wang, Bin},
  journal={arXiv preprint arXiv:2605.10434},
  year={2026}
}