- Introduction
- Why Video Future Prediction?
- Dataset & Download
- Data Construction
- On-Going Work
- Citation
Recent Large Reasoning Models (LRMs) such as DeepSeek-R1 have demonstrated impressive reasoning abilities; however, their capabilities are limited to textual data. Current models capture only a small part of the rich information that humans naturally use, which limits our progress toward AGI.
To advance multimodal reasoning, we introduce a future prediction task and its corresponding dataset. Predicting the future is a deeply desired ability, yet forecasting upcoming events from historical video data presents significant challenges for current Multi-modal Large Models (MLMs). Our task pushes these models to infer future events based on the first part of a video, with the second part serving as open-ended ground truth for evaluation.
🤔