V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks

🚀 Toward Multimodal Reasoning via Unsupervised Task -- Future Prediction 🌟

Multimodal Reasoning

Recent Large Reasoning Models (LRMs) such as DeepSeek-R1 have demonstrated impressive reasoning abilities; however, their capabilities are limited to textual data. Current models capture only a small part of the rich information that humans naturally use, which limits our progress toward AGI.

Future Prediction Task and Dataset

To advance multimodal reasoning, we introduce a future prediction task and its corresponding dataset. Predicting the future is a deeply desired ability, yet forecasting upcoming events from historical video data presents significant challenges for current Multi-modal Large Models (MLMs). Our task pushes these models to infer future events based on the first part of a video, with the second part serving as open-ended ground truth for evaluation.

🤔

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks

Table of Contents

Multimodal Reasoning

Future Prediction Task and Dataset

About

Releases

Packages

Languages

haonan3/V1

Folders and files

Latest commit

History

Repository files navigation

V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks

Table of Contents

Multimodal Reasoning

Future Prediction Task and Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages