📢 We are currently organizing the code for IS-Bench. If you are interested in our work, please star ⭐ our project.
📆[2025-07-07] 🎈 Our paper, code and dataset are released! 🎈
Existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps.
Our experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion.
OS: Linux (Ubuntu 20.04+), Windows 10+
RAM: 32GB+ recommended
VRAM: 8GB+
GPU: NVIDIA RTX 2080+
[Optional] If you are unable to support the environment required by the OmniGibson simulator, we have provided scene images from the golden planning execution process. You can use these images to construct QA pairs for VLM evaluation.
wget https://huggingface.co/datasets/Ursulalala/IS_Bench_dataset/resolve/main/scene_images.tar.gz
- Install Omnigibson
conda create -n isbench python=3.10 pytorch torchvision torchaudio pytorch-cuda=12.1 "numpy<2" -c pytorch -c nvidia
conda activate isbench
pip install omnigibson==1.1.1
python -m omnigibson.install # install omnigibson assets and datasets
If you want to install Omnigibson in Docker, please see this document.
- Download Source Code and BDDL of IS-Bench
git clone https://github.com/AI45Lab/IS-Bench
pip install -r requirements.txt
cd IS-Bench/bddl
pip install -e .
- Download Scene Dataset
cd ../data
wget https://huggingface.co/datasets/Ursulalala/IS_Bench_scenes/resolve/main/scenes.tar.gz
tar -xzvf scenes.tar.gz
rm scenes.tar.gz
If you are using slurm to run IS-Bench, please first revise your launcher for benchmark at scripts/launcher.sh
Our code supports api-based model with openai or google-genai format.
- Configure api_base and api_key in
entrypoints/env.sh
- Add proxy at
og_ego_prim/models/server_inference.py
if needed. - Execute the following script:
bash entrypoints/eval_close.sh $MODEL_NAME $DATA_PARALLEL
- Execute
entrypoints/vllm_serve.sh
to deploy a serve for the evaluated model and check the serve ip.
bash entrypoints/vllm_serve.sh $LOCAL_MODEL_PATH $GPUS
- Execute the following script:
bash entrypoints/eval_open.sh $MODEL_NAME_OR_PATH $SERVER_IP $DATA_PARALLEL
-
Revise
entrypoints/task_list.txt
to specify the tasks that need to be evaluated. -
Revise
prompt_setting
to change safety reminder:
- v0: no safety reminder.
- v1: implicit safety reminder.
- v2: safety Chain-of-Thought (CoT) reminder.
- v3: explicit safety reminder.
- Set the following parameters for optional scene information:
- draw_bbox_2d
- use_initial_setup
- use_self_caption
- Set the following parameters for partial evaluation:
- not_eval_process_safety
- not_eval_termination_safety
- not_eval_awareness
- not_eval_execution
- Since the performance of Omnigibson may vary depending on the hardware environment, you can run the following script to check whether the tasks in IS-Bench can be successfully executed in your environment.
bash entrypoints/validate_gt.sh
Leveraged part of data and code framework fromBehavior-1K dataset and Omnigibson simulator.
@misc{lu2025isbench,
title={IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks},
author={Xiaoya Lu and Zeren Chen and Xuhao Hu and Yijin Zhou and Weichen Zhang and Dongrui Liu and Lu Sheng and Jing Shao},
year={2025},
eprint={2506.16402},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.16402},
}