MindCube Logo MindCube

Key Takeaway: Guiding VLMs to first generate cognitive maps, then reason upon them, is an effective approach to approximate spatial mental modeling with limited views.



Get Started
1
0
Spatial Mental Modeling Challenge
Three Cognitive Scaffold Data Structures
Elicitation Methods and Performance Overview

Baiqiao Yin1, 3*, Qineng Wang1*‡, Pingyue Zhang1, Jianshu Zhang1

Kangrui Wang1, Zihan Wang1, Jieyu Zhang4, Keshigeyan Chandrasegaran2

Han Liu1, Ranjay Krishna4, Saining Xie3

Manling Li1†, Jiajun Wu2†, Li Fei-Fei2†

*Equal contribution, ‡Project lead, †Equal advising

1Northwestern University, 2Stanford University, 3New York University, 4University of Washington

Dataset Viewer

Explore our MindCube-tinybench version dataset.

21K
Total Samples
3
Settings
2-4
Views per Sample
100%
Annotated

Loading dataset...

Leaderboard

Performance comparison of different models and approaches on our spatial reasoning benchmark.

Click on column headers to sort the results

Baseline Open-Weight Multi Image Models Proprietary Models Spatial Models
Method Overall Rotation Among Around
Random (chance) 32.35 36.36 32.29 30.66
Random (frequency) 33.02 38.30 32.66 35.79
LLaVA-Onevision-7B 47.43 36.45 48.42 44.09
DeepSeek-VL2-Small 47.62 37.00 50.38 26.91
Gemma-3-12B-it 46.67 38.39 48.38 34.63
mPLUG-Owl3-7B-241101 44.85 37.84 47.11 26.91
LLaVA-Video-Qwen-7B 41.96 35.71 43.55 30.12
Mantis-8B (SigLip) 41.05 37.65 40.23 50.99
Idefics-8B-Llama3 35.86 35.15 35.94 35.49
Qwen2.5-VL-3B-Instruct 33.21 37.37 33.26 30.34
LongVA-7B 29.46 35.89 29.55 24.88
Qwen2.5-VL-7B-Instruct 29.26 38.76 29.50 21.35
InternVL2.5-8B 18.68 36.45 18.20 13.11
GPT-4o 38.81 32.65 40.17 29.16
Claude-4-Sonnet-20250514 44.75 48.42 44.21 47.62
RoboBrain 37.38 35.80 38.28 29.53
Space-Qwen 33.28 38.02 33.71 26.32
Spatial-MLLM 32.06 38.39 20.92 32.82
VLM-3R 42.09 36.73 44.22 24.45
SpaceMantis 22.81 37.65 21.26 29.32

Key Findings

Our research reveals three critical insights about teaching VLMs spatial reasoning through structured scaffolding.

3.1 Scaffolding Spatial Reasoning in Frozen VLMs

💡

Key Takeaways: Scaffolding Spatial Reasoning in Frozen VLMs

  • Explicit reasoning is crucial for improving performance.
  • Cognitive maps can help guide the reasoning process.
  • Passive structures (like maps as input) alone and visual continuity offer little benefit.

Performance Analysis: Limited Gains from External Scaffolds

+
37.81%
Raw-QA Baseline
41.33%
Best w/ Reasoning
+3.52%
Maximum Gain

Critical Finding: Structure alone fails. View interpolation shows no meaningful gains (+0.09%), while providing pre-computed maps actually degrades performance (-5.81%). Only explicit reasoning yields consistent improvements.

3.2 Teaching VLMs to Reason Spatially

💡

Key Takeaways: Teaching VLMs to Reason Spatially

  • Joint cogmap and reasoning setting yields optimal performance through synergistic effects.
  • Reasoning shapes spatial representations for functional utility, not just structural perfection.
  • Neither map generation nor reasoning alone largely outperforms the SFT QA baseline.

SFT Performance: The Power of Joint Training

+
60.76%
Plain-CGMap-FFR-Out
54.38%
Map Generation Only
53.52%
Reasoning Only

Breakthrough: Combining map generation with reasoning achieves +8.48% improvement over baseline. The synergy forces models to build functionally effective spatial representations, not just structural perfection.

Training Dynamics: Structure vs. Function Trade-off

+

Key Insight: Models trained only on map generation learn structure rapidly (91.73% similarity, 89.05% isomorphism) but plateau in QA performance. Joint training learns structure more slowly but achieves superior functional utility.

3.3 Reinforcement Learning for Spatial Reasoning

💡

Key Takeaways: Reinforcement Learning for Spatial Reasoning

  • Combining cognitive maps with reasoning consistently improves all learning outcomes.
  • Starting from scratch, RL provides only marginal gains for spatial reasoning; its true power is unlocked when building upon a strong SFT foundation.

RL Performance: Building on Strong Foundations

+
70.67%
RL from SFT
+9.91% vs SFT
53.71%
RL from Scratch
Limited gains
85.79%
Map Quality
Plain-CGMap Superior

Critical Discovery: RL's power is unlocked when building upon strong SFT foundations. While both Plain and Augmented variants reach identical 70.67% QA accuracy, Plain-CGMap maintains superior geometric quality (71.52% vs 58.86% isomorphism).

Citation

If you find our work useful in your research, please cite:

@misc{yin2025spatialmentalmodelinglimited,
      title={Spatial Mental Modeling from Limited Views}, 
      author={Baiqiao Yin and Qineng Wang and Pingyue Zhang and Jianshu Zhang and Kangrui Wang and Zihan Wang and Jieyu Zhang and Keshigeyan Chandrasegaran and Han Liu and Ranjay Krishna and Saining Xie and Manling Li and Jiajun Wu and Li Fei-Fei},
      year={2025},
      eprint={2506.21458},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.21458}, 
}