UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces
The benchmark is designed to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation.
- Arxiv: https://arxiv.org/pdf/2503.06157
- Project: https://embodiedcity.github.io/UrbanVideo-Bench/
- Dataset: https://huggingface.co/datasets/EmbodiedCity/UrbanVideo-Bench
🎉 Accepted as an oral presentation at ACL 2025!
✅ Dataset Upload
✅ Dataset generation code
✅ Example code for running the benchmark with Video-LLMs
The pipeline includes four steps: video curation, MCQ generation, blind filtering, and human refinement.
The dataset statistics are shown in the following figure b-f.
We provide three seperated scripts: Basic, Goal, and Route for generating questions with Gemini, as the workflows are slightly different due to differences in the forms of input data and features of tasks:
-
For data consists of videos with destinations, we use the Goal script for the generation. This script is capable of generating question of the following categories:
question_categories = [ "Trajectory Captioning", "Landmark Position", "Goal Detection", "Association", "Cognitive Map", "High-level Planning", "Action Generation" ]
-
For data that include videos collected with specific movement instructions, we use the Route script to generate the questions. This script is capable of generating question of the following categories:
question_categories = [ "Progress Evaluation", "Landmark Position", "Action Generation" ]
-
For some questions in the Recall and Perception category, details from the videos are important, so we introduce an extra chain of thought in the Basic script, where objects and movements from the videos are extracted in advance and fed into the model for final generation. This script is capable of generating question of the following categories:
question_categories = [ "Trajectory Captioning", "Start/End Position", "Object Recall", "Sequence Recall", "Scene Recall", "Proximity", "Duration", "Causal", "Counterfactual" ]
Follow the steps below to configure and execute the script question_generation/MCQ_generation_basic.py
(the other two are similar):
-
Set your Gemini API key and select the appropriate model version:
model = "gemini-1.5-flash" genai.configure(api_key="SET_YOUR_API_KEY_HERE")
-
Configure the input and output paths:
- Input Path: Specify the folder path containing
video_list.json
and the.MP4
videos to be processed.
video_path = rf"DIRECT\PATH\TO\YOUR\VIDEO" # Replace with your video path. This path should contain video files with video_list.json.
DIRECT\PATH\TO\YOUR\VIDEO/ ├── video_list.json ├── video_1.mp4 ├── video_2.mp4 └── ... # All the videos recorded in video_list.json
- Output Path: Specify the path to the
.CSV
file where the results will be saved.
MCQ_PATH = rf"DIRECT\PATH\TO\YOUR\MCQ\FILE.csv" # Set your output MCQ file here.
- Input Path: Specify the folder path containing
-
Finally, if you have set the direct path to your input and output files, you can execute the script by simply running the following command in the terminal:
python question_generation/MCQ_generation_basic.py
The results will be saved to the specified file.
To get started, download the dataset from Hugging Face
and place it in the dataset
folder within the project directory.
After downloading, ensure the folder structure matches the one described below.
UrbanVideo-Bench.code/
├── dataset/
│ ├── videos/ # Contains video files used as input for the model
│ ├── MCQ.parquet # Contains multiple-choice questions
│ └── ...
├── run.py # Script for running the model and generating predictions
├── eval.py # Script for evaluating the model's predictions
├── README.md # Documentation for the project
└── ... # Other potential files or subdirectories
We provide a sample script, run.py
, to run the dataset using an OpenAI-style API. Follow the steps below to configure and execute the script:
-
Set the model name in
run.py
:model = "your_model_name"
-
Configure OpenAI API credentials:
client = OpenAI( api_key='your_api_key', base_url='your_base_url' )
-
Run the script:
python run.py
Results will be saved to
result/%s_output.csv
.
The eval.py
script is provided to evaluate the model's predictions. It extracts the options from the model's output and calculates the accuracy by comparing them to the ground truth.
-
Modify the file path in
eval.py
to match the output file fromrun.py
:file_path = 'result/gpt-4o_output.csv' # Replace with your output file path
-
Run the script:
python eval.py
-
The script compares predictions to ground truth and calculates accuracy. Results are saved to:
result/%s_acc.xlsx
Note: The extraction method here is the simplest regular matching. However, the output of small-sized models often does not follow instructions. So it needs to be adjusted separately.
We would like to express our sincere gratitude to ZhanxyR for the valuable contribution to the VLMEvalKit library in the project, which has further improved the usability of this project. Thank you for your efforts and dedication!
If you use this project in your research, please cite the following paper:
@misc{zhao2025urbanvideobench,
title={UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces},
author={Baining Zhao and Jianjie Fang and Zichao Dai and Ziyou Wang and Jirong Zha and Weichen Zhang and Chen Gao and Yue Wang and Jinqiang Cui and Xinlei Chen and Yong Li},
year={2025},
eprint={2503.06157},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.06157},
}