Spatial Mental Modeling from Limited Views

Key Takeaway: Guiding VLMs to first generate cognitive maps, then reason upon them, is an effective approach to approximate spatial mental modeling with limited views.

Get Started

Paper Dataset Checkpoints Data Viewer

Three Cognitive Scaffold Data Structures

Elicitation Methods and Performance Overview

Baiqiao Yin^{1, 3*}, Qineng Wang^1*‡, Pingyue Zhang¹, Jianshu Zhang¹

Kangrui Wang¹, Zihan Wang¹, Jieyu Zhang⁴, Keshigeyan Chandrasegaran²

Han Liu¹, Ranjay Krishna⁴, Saining Xie³

Manling Li^1†, Jiajun Wu^2†, Li Fei-Fei^2†

*Equal contribution, ‡Project lead, †Equal advising

¹Northwestern University, ²Stanford University, ³New York University, ⁴University of Washington

Dataset Viewer

Explore our MindCube-tinybench version dataset.

21K

Total Samples

Settings

2-4

Views per Sample

100%

Annotated

Setting Filter:

Sample:

Loading dataset...

Access Dataset

Download Dataset Documentation

Leaderboard

Performance comparison of different models and approaches on our spatial reasoning benchmark.

Click on column headers to sort the results

Baseline Open-Weight Multi Image Models Proprietary Models Spatial Models

Method ↕	Overall ↕	Rotation ↕	Among ↕	Around ↕
Random (chance)	32.35	36.36	32.29	30.66
Random (frequency)	33.02	38.30	32.66	35.79
LLaVA-Onevision-7B	47.43	36.45	48.42	44.09
DeepSeek-VL2-Small	47.62	37.00	50.38	26.91
Gemma-3-12B-it	46.67	38.39	48.38	34.63
mPLUG-Owl3-7B-241101	44.85	37.84	47.11	26.91
LLaVA-Video-Qwen-7B	41.96	35.71	43.55	30.12
Mantis-8B (SigLip)	41.05	37.65	40.23	50.99
Idefics-8B-Llama3	35.86	35.15	35.94	35.49
Qwen2.5-VL-3B-Instruct	33.21	37.37	33.26	30.34
LongVA-7B	29.46	35.89	29.55	24.88
Qwen2.5-VL-7B-Instruct	29.26	38.76	29.50	21.35
InternVL2.5-8B	18.68	36.45	18.20	13.11
GPT-4o	38.81	32.65	40.17	29.16
Claude-4-Sonnet-20250514	44.75	48.42	44.21	47.62
RoboBrain	37.38	35.80	38.28	29.53
Space-Qwen	33.28	38.02	33.71	26.32
Spatial-MLLM	32.06	38.39	20.92	32.82
VLM-3R	42.09	36.73	44.22	24.45
SpaceMantis	22.81	37.65	21.26	29.32

Key Findings

Our research reveals three critical insights about teaching VLMs spatial reasoning through structured scaffolding.

3.1 Scaffolding Spatial Reasoning in Frozen VLMs

💡

Key Takeaways: Scaffolding Spatial Reasoning in Frozen VLMs

Explicit reasoning is crucial for improving performance.
Cognitive maps can help guide the reasoning process.
Passive structures (like maps as input) alone and visual continuity offer little benefit.

Performance Analysis: Limited Gains from External Scaffolds

37.81%

Raw-QA Baseline

41.33%

Best w/ Reasoning

+3.52%

Maximum Gain

Critical Finding: Structure alone fails. View interpolation shows no meaningful gains (+0.09%), while providing pre-computed maps actually degrades performance (-5.81%). Only explicit reasoning yields consistent improvements.

3.2 Teaching VLMs to Reason Spatially

💡

Key Takeaways: Teaching VLMs to Reason Spatially

Joint cogmap and reasoning setting yields optimal performance through synergistic effects.
Reasoning shapes spatial representations for functional utility, not just structural perfection.
Neither map generation nor reasoning alone largely outperforms the SFT QA baseline.

SFT Performance: The Power of Joint Training

60.76%

Plain-CGMap-FFR-Out

54.38%

Map Generation Only

53.52%

Reasoning Only

Breakthrough: Combining map generation with reasoning achieves +8.48% improvement over baseline. The synergy forces models to build functionally effective spatial representations, not just structural perfection.

Training Dynamics: Structure vs. Function Trade-off

Key Insight: Models trained only on map generation learn structure rapidly (91.73% similarity, 89.05% isomorphism) but plateau in QA performance. Joint training learns structure more slowly but achieves superior functional utility.

3.3 Reinforcement Learning for Spatial Reasoning

💡

Key Takeaways: Reinforcement Learning for Spatial Reasoning

Combining cognitive maps with reasoning consistently improves all learning outcomes.
Starting from scratch, RL provides only marginal gains for spatial reasoning; its true power is unlocked when building upon a strong SFT foundation.

RL Performance: Building on Strong Foundations

70.67%

RL from SFT

+9.91% vs SFT

53.71%

RL from Scratch

Limited gains

85.79%

Map Quality

Plain-CGMap Superior

Critical Discovery: RL's power is unlocked when building upon strong SFT foundations. While both Plain and Augmented variants reach identical 70.67% QA accuracy, Plain-CGMap maintains superior geometric quality (71.52% vs 58.86% isomorphism).

Citation

If you find our work useful in your research, please cite:

@misc{yin2025spatialmentalmodelinglimited,
      title={Spatial Mental Modeling from Limited Views}, 
      author={Baiqiao Yin and Qineng Wang and Pingyue Zhang and Jianshu Zhang and Kangrui Wang and Zihan Wang and Jieyu Zhang and Keshigeyan Chandrasegaran and Han Liu and Ranjay Krishna and Saining Xie and Manling Li and Jiajun Wu and Li Fei-Fei},
      year={2025},
      eprint={2506.21458},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.21458}, 
}