
Spatial Mental Modeling from Limited Views
Key Takeaway: Guiding VLMs to first generate cognitive maps, then reason upon them, is an effective approach to approximate spatial mental modeling with limited views.
Dataset Viewer
Explore our MindCube-tinybench version dataset.
Loading dataset...
Access Dataset
Leaderboard
Performance comparison of different models and approaches on our spatial reasoning benchmark.
Click on column headers to sort the results
Method ↕ | Overall ↕ | Rotation ↕ | Among ↕ | Around ↕ |
---|---|---|---|---|
Random (chance) | 32.35 | 36.36 | 32.29 | 30.66 |
Random (frequency) | 33.02 | 38.30 | 32.66 | 35.79 |
LLaVA-Onevision-7B | 47.43 | 36.45 | 48.42 | 44.09 |
DeepSeek-VL2-Small | 47.62 | 37.00 | 50.38 | 26.91 |
Gemma-3-12B-it | 46.67 | 38.39 | 48.38 | 34.63 |
mPLUG-Owl3-7B-241101 | 44.85 | 37.84 | 47.11 | 26.91 |
LLaVA-Video-Qwen-7B | 41.96 | 35.71 | 43.55 | 30.12 |
Mantis-8B (SigLip) | 41.05 | 37.65 | 40.23 | 50.99 |
Idefics-8B-Llama3 | 35.86 | 35.15 | 35.94 | 35.49 |
Qwen2.5-VL-3B-Instruct | 33.21 | 37.37 | 33.26 | 30.34 |
LongVA-7B | 29.46 | 35.89 | 29.55 | 24.88 |
Qwen2.5-VL-7B-Instruct | 29.26 | 38.76 | 29.50 | 21.35 |
InternVL2.5-8B | 18.68 | 36.45 | 18.20 | 13.11 |
GPT-4o | 38.81 | 32.65 | 40.17 | 29.16 |
Claude-4-Sonnet-20250514 | 44.75 | 48.42 | 44.21 | 47.62 |
RoboBrain | 37.38 | 35.80 | 38.28 | 29.53 |
Space-Qwen | 33.28 | 38.02 | 33.71 | 26.32 |
Spatial-MLLM | 32.06 | 38.39 | 20.92 | 32.82 |
VLM-3R | 42.09 | 36.73 | 44.22 | 24.45 |
SpaceMantis | 22.81 | 37.65 | 21.26 | 29.32 |
Key Findings
Our research reveals three critical insights about teaching VLMs spatial reasoning through structured scaffolding.
3.1 Scaffolding Spatial Reasoning in Frozen VLMs
Key Takeaways: Scaffolding Spatial Reasoning in Frozen VLMs
- Explicit reasoning is crucial for improving performance.
- Cognitive maps can help guide the reasoning process.
- Passive structures (like maps as input) alone and visual continuity offer little benefit.
Performance Analysis: Limited Gains from External Scaffolds
Critical Finding: Structure alone fails. View interpolation shows no meaningful gains (+0.09%), while providing pre-computed maps actually degrades performance (-5.81%). Only explicit reasoning yields consistent improvements.
3.2 Teaching VLMs to Reason Spatially
Key Takeaways: Teaching VLMs to Reason Spatially
- Joint cogmap and reasoning setting yields optimal performance through synergistic effects.
- Reasoning shapes spatial representations for functional utility, not just structural perfection.
- Neither map generation nor reasoning alone largely outperforms the SFT QA baseline.
SFT Performance: The Power of Joint Training
Breakthrough: Combining map generation with reasoning achieves +8.48% improvement over baseline. The synergy forces models to build functionally effective spatial representations, not just structural perfection.
Training Dynamics: Structure vs. Function Trade-off
Key Insight: Models trained only on map generation learn structure rapidly (91.73% similarity, 89.05% isomorphism) but plateau in QA performance. Joint training learns structure more slowly but achieves superior functional utility.
3.3 Reinforcement Learning for Spatial Reasoning
Key Takeaways: Reinforcement Learning for Spatial Reasoning
- Combining cognitive maps with reasoning consistently improves all learning outcomes.
- Starting from scratch, RL provides only marginal gains for spatial reasoning; its true power is unlocked when building upon a strong SFT foundation.
RL Performance: Building on Strong Foundations
Critical Discovery: RL's power is unlocked when building upon strong SFT foundations. While both Plain and Augmented variants reach identical 70.67% QA accuracy, Plain-CGMap maintains superior geometric quality (71.52% vs 58.86% isomorphism).
Citation
If you find our work useful in your research, please cite:
@misc{yin2025spatialmentalmodelinglimited,
title={Spatial Mental Modeling from Limited Views},
author={Baiqiao Yin and Qineng Wang and Pingyue Zhang and Jianshu Zhang and Kangrui Wang and Zihan Wang and Jieyu Zhang and Keshigeyan Chandrasegaran and Han Liu and Ranjay Krishna and Saining Xie and Manling Li and Jiajun Wu and Li Fei-Fei},
year={2025},
eprint={2506.21458},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.21458},
}