VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation

Sihao Lin1,2, Zerui Li1,2, Xunyi Zhao1,2 Gengze Zhou1 Liuyi Wang3 Rong Wei4 Rui Tang5 Juncheng Li5 Hanqing Wang6 Jiangmiao Pang6 Anton van den Hengel1,2 Jiajun Liu7 Qi Wu1,2
1Adelaide University, 2Responsible AI Research Centre, Australian Institute for Machine Learning, 3Tongji University, 4ManyCore, 5Zhejiang University, 6Shanghai AI Lab, 7CSIRO Data61

Qualitative Examples

We visualize the agent's trajectories.

Abstract

Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting "ghost" agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.

Trajectory Visualization

Instruction:

[Instruction placeholder - to be filled with actual navigation instructions]

🎮 Interactive Playground

Do you believe you are a good navigator? We provide a human-in-the-loop demo where you can act as the robot. Navigate through our realistic scenes following instructions, and our system will score your trajectory in real-time.

*Screen recording of a user navigating the "Living Room" scene.

🚀 Quick Start GitHub

# 1. Clone repo
git clone https://github.com/william13077/IAmGoodNavigator

# 2. Download data
bash download.sh

# 3. Launch Demo
python demo.py --task fine --index 0
🏆 Real-time Scoring

Benchmark Statistics

Navigable Area

Navigation Instruction

Instruction Hierarchy

Landmarks

Instruction Length