IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

📢 We are currently organizing the code for IS-Bench. If you are interested in our work, please star ⭐ our project.

🔥 Updates

📆[2025-07-07] 🎈 Our paper, code and dataset are released! 🎈

🎉 Introduction

Existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps.

📍 Results of IS-Bench

Our experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion.

⚙️ Installation

System Requirements of Omnigibson

OS: Linux (Ubuntu 20.04+), Windows 10+

RAM: 32GB+ recommended

VRAM: 8GB+

GPU: NVIDIA RTX 2080+

[Optional] If you are unable to support the environment required by the OmniGibson simulator, we have provided scene images from the golden planning execution process. You can use these images to construct QA pairs for VLM evaluation.

wget https://huggingface.co/datasets/Ursulalala/IS_Bench_dataset/resolve/main/scene_images.tar.gz

Install Omnigibson

conda create -n isbench python=3.10 pytorch torchvision torchaudio pytorch-cuda=12.1 "numpy<2" -c pytorch -c nvidia
conda activate isbench
pip install omnigibson==1.1.1
python -m omnigibson.install    # install omnigibson assets and datasets

If you want to install Omnigibson in Docker, please see this document.

Download Source Code and BDDL of IS-Bench

git clone https://github.com/AI45Lab/IS-Bench
pip install -r requirements.txt
cd IS-Bench/bddl
pip install -e .

Download Scene Dataset

cd ../data
wget https://huggingface.co/datasets/Ursulalala/IS_Bench_scenes/resolve/main/scenes.tar.gz
tar -xzvf scenes.tar.gz
rm scenes.tar.gz

🚀 Usage

If you are using slurm to run IS-Bench, please first revise your launcher for benchmark at scripts/launcher.sh

Evaluate Close-Source Models

Our code supports api-based model with openai or google-genai format.

Configure api_base and api_key in entrypoints/env.sh
Add proxy at og_ego_prim/models/server_inference.py if needed.
Execute the following script:

bash entrypoints/eval_close.sh $MODEL_NAME $DATA_PARALLEL

Evaluate Open-Source Models

Execute entrypoints/vllm_serve.sh to deploy a serve for the evaluated model and check the serve ip.

bash entrypoints/vllm_serve.sh $LOCAL_MODEL_PATH $GPUS

Execute the following script:

bash entrypoints/eval_open.sh $MODEL_NAME_OR_PATH $SERVER_IP $DATA_PARALLEL

Advanced Configuration

Revise entrypoints/task_list.txt to specify the tasks that need to be evaluated.
Revise prompt_setting to change safety reminder:

v0: no safety reminder.
v1: implicit safety reminder.
v2: safety Chain-of-Thought (CoT) reminder.
v3: explicit safety reminder.

Set the following parameters for optional scene information:

draw_bbox_2d
use_initial_setup
use_self_caption

Set the following parameters for partial evaluation:

not_eval_process_safety
not_eval_termination_safety
not_eval_awareness
not_eval_execution

Since the performance of Omnigibson may vary depending on the hardware environment, you can run the following script to check whether the tasks in IS-Bench can be successfully executed in your environment.

bash entrypoints/validate_gt.sh

🙏 Acknowledge

Leveraged part of data and code framework fromBehavior-1K dataset and Omnigibson simulator.

📑 Citation

@misc{lu2025isbench,
      title={IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks}, 
      author={Xiaoya Lu and Zeren Chen and Xuhao Hu and Yijin Zhou and Weichen Zhang and Dongrui Liu and Lu Sheng and Jing Shao},
      year={2025},
      eprint={2506.16402},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.16402}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

🔥 Updates

🎉 Introduction

📍 Results of IS-Bench

⚙️ Installation

System Requirements of Omnigibson

🚀 Usage

Evaluate Close-Source Models

Evaluate Open-Source Models

Advanced Configuration

🙏 Acknowledge

📑 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
assets		assets
bddl		bddl
data		data
entrypoints		entrypoints
generation		generation
og_ego_prim		og_ego_prim
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

AI45Lab/IS-Bench

Folders and files

Latest commit

History

Repository files navigation

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

🔥 Updates

🎉 Introduction

📍 Results of IS-Bench

⚙️ Installation

System Requirements of Omnigibson

🚀 Usage

Evaluate Close-Source Models

Evaluate Open-Source Models

Advanced Configuration

🙏 Acknowledge

📑 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages