Can a video generator reason about how the world should evolve — not just render it? 436 cases 4 reasoning dimensions 11 generators ~6K expert preference pairs
Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into “world simulators.” Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time.
We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? It contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories.
We further release WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation.
A three-stage VLM-assisted construction pipeline produces structured ground-truth QA pairs, and a two-part evaluation methodology turns generated videos into human-aligned scores.
All 11 generators evaluated on a shared evaluation set for fully controlled cross-model comparison. Sort any column, filter by family, or switch the headline metric.
| # | Model | Family | Overall | World Knowledge | Human-Centric | Logic Reasoning | Information-Based |
|---|
Higher is better. best across all 11 models second-best. Source: Table 2 of the paper.
Pairwise expert evaluation gives each model a Human Elo. Our automated ScorePR reproduces the human ranking with absolute rank displacement |Δr| ≤ 1 on 8 of 11 models, while a generic Qwen3.5-Thinking judge drifts by up to four positions.
| # | Model | Human Elo |
Judge Elo |
Judge Rank |
AccQA (%) | |Δr| | ScorePR (%) | |Δr| | sdyn/AccQA |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Seedance2.0 | 1471 | 1183 | 3 | 41.2 | 0 | 39.8 | 0 | 0.84 |
| 2 | Veo3.1-Fast | 1253 | 1151 | 4 | 36.0 | 0 | 35.3 | 0 | 0.91 |
| 3 | Kling | 1240 | 1142 | 5 | 34.0 | 2 | 32.7 | 1 | 0.82 |
| 4 | Wan2.6 | 1211 | 1130 | 6 | 34.7 | 0 | 32.4 | 1 | 0.71 |
| 5 | Sora2-8s | 1118 | 1222 | 1 | 35.3 | 2 | 34.3 | 2 | 0.86 |
| 6 | Sora2-12s | 1109 | 1217 | 2 | 33.5 | 0 | 32.4 | 0 | 0.84 |
| 7 | Wan2.2-14B | 953 | 913 | 7 | 19.6 | 2 | 17.5 | 1 | 0.57 |
| 8 | HunyuanVideo-1.5 | 911 | 841 | 9 | 20.2 | 1 | 17.9 | 1 | 0.56 |
| 9 | LongCat-Video | 904 | 876 | 8 | 19.7 | 1 | 17.4 | 0 | 0.54 |
| 10 | UniVideo | 665 | 737 | 11 | 16.2 | 1 | 14.4 | 1 | 0.56 |
| 11 | LTX2.3 | 587 | 802 | 10 | 18.5 | 1 | 16.8 | 1 | 0.63 |
0 exact match 1–2 small drift 3+ large drift · column best. Dashed line separates closed- and open-source generators.
Visually plausible generations can still fail process-level world reasoning. Browse representative cases from each dimension — videos coming soon.
Fifteen trained annotators rate each video on reasoning quality, temporal consistency, and visual aesthetics. Aggregated into per-video scores and pairwise preferences for reward-model calibration.
Pair-wise: agreement (%) w/ Ties / w/o Ties. Point-wise: induced pairwise accuracy / Spearman ρ. Open Qwen3.5-Thinking matches GPT-5.4 on three of four reasoning dimensions.
| Dimension | Protocol | Closed-source | Open-source | ||||
|---|---|---|---|---|---|---|---|
| GPT-5.4 | Gemini-3.1-Flash | Qwen3.5-9B Thinking |
Qwen3.5-27B Instruct |
Qwen3.5-27B Thinking |
Qwen3.5-27B Thinking · 4 FPS |
||
| Frames used | 8 | 1 FPS | ~10 | ~10 | ~10 | 4 FPS | |
| World Knowledge | Pair w/ / w/o | 60.77 / 67.84 | 51.50 / 60.44 | 70.81 / 76.19 | 69.37 / 74.16 | 69.94 / 74.64 | 69.51 / 74.23 |
| Point Acc / ρ | 54.55 / 0.592 | 59.86 / 0.582 | 60.70 / 0.720 | 54.01 / 0.658 | 60.57 / 0.687 | 62.09 / 0.711 | |
| Human-Centric | Pair w/ / w/o | 68.37 / 76.80 | 58.22 / 66.27 | 71.71 / 77.52 | 71.25 / 76.05 | 72.61 / 77.81 | 69.08 / 74.41 |
| Point Acc / ρ | 59.14 / 0.626 | 60.06 / 0.675 | 59.54 / 0.702 | 55.94 / 0.682 | 62.81 / 0.713 | 60.49 / 0.703 | |
| Logic Reasoning | Pair w/ / w/o | 67.41 / 78.43 | 58.23 / 67.68 | 69.33 / 77.13 | 68.46 / 74.51 | 70.16 / 76.23 | 68.53 / 74.97 |
| Point Acc / ρ | 53.42 / 0.523 | 57.65 / 0.562 | 57.50 / 0.617 | 55.71 / 0.573 | 60.17 / 0.606 | 58.40 / 0.597 | |
| Information-Based | Pair w/ / w/o | 56.95 / 63.68 | 50.21 / 58.10 | 52.45 / 61.76 | 60.44 / 65.22 | 60.24 / 65.32 | 61.50 / 66.39 |
| Point Acc / ρ | 48.15 / 0.484 | 47.89 / 0.432 | 53.59 / 0.471 | 47.95 / 0.408 | 50.15 / 0.445 | 52.41 / 0.526 | |
| Overall | Pair w/ / w/o | 63.04 / 71.36 | 54.39 / 62.99 | 67.14 / 74.35 | 66.89 / 72.07 | 67.74 / 73.05 | 66.90 / 72.30 |
| Point Acc / ρ | 53.43 / 0.565 | 55.84 / 0.568 | 57.76 / 0.655 | 53.15 / 0.591 | 57.85 / 0.626 | 57.83 / 0.644 | |
Bold best across all six reward models · Underlined second-best. Source: Table 4 of the paper.
@article{wu2026worldreasonbench,
title={WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors},
author={Wu, Keming and Cui, Yijing and Xue, Wenhan and Wang, Qijie and Luo, Xuan and Feng, Zhiyuan and Yang, Zuhao and Wang, Sudong and Jiang, Sicong and Zhu, Haowei and Wang, Zihan and Nie, Ping and Chen, Wenhu and Wang, Bin},
journal={arXiv preprint arXiv:2605.10434},
year={2026}
}