Skip to content
/ V1 Public

V1: Toward Multimodal Reasoning by Designing Auxiliary Task

Notifications You must be signed in to change notification settings

haonan3/V1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

V1: Toward Multimodal Reasoning by Designing Auxiliary Tasks

🚀 Toward Multimodal Reasoning via Unsupervised Task -- Future Prediction 🌟

Github Notion Twitter Hugging Face Collection


Table of Contents


Multimodal Reasoning

Recent Large Reasoning Models (LRMs) such as DeepSeek-R1 have demonstrated impressive reasoning abilities; however, their capabilities are limited to textual data. Current models capture only a small part of the rich information that humans naturally use, which limits our progress toward AGI.

Future Prediction Task and Dataset

To advance multimodal reasoning, we introduce a future prediction task and its corresponding dataset. Predicting the future is a deeply desired ability, yet forecasting upcoming events from historical video data presents significant challenges for current Multi-modal Large Models (MLMs). Our task pushes these models to infer future events based on the first part of a video, with the second part serving as open-ended ground truth for evaluation.

🤔

About

V1: Toward Multimodal Reasoning by Designing Auxiliary Task

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages