Skip to content

AI45Lab/IS-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

📢 We are currently organizing the code for IS-Bench. If you are interested in our work, please star ⭐ our project.

🔥 Updates

📆[2025-07-07] 🎈 Our paper, code and dataset are released! 🎈

🎉 Introduction

Intro_img

Existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps.

📍 Results of IS-Bench

Intro_img

Our experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion.

⚙️ Installation

System Requirements of Omnigibson

OS: Linux (Ubuntu 20.04+), Windows 10+

RAM: 32GB+ recommended

VRAM: 8GB+

GPU: NVIDIA RTX 2080+

[Optional] If you are unable to support the environment required by the OmniGibson simulator, we have provided scene images from the golden planning execution process. You can use these images to construct QA pairs for VLM evaluation.

wget https://huggingface.co/datasets/Ursulalala/IS_Bench_dataset/resolve/main/scene_images.tar.gz
  1. Install Omnigibson
conda create -n isbench python=3.10 pytorch torchvision torchaudio pytorch-cuda=12.1 "numpy<2" -c pytorch -c nvidia
conda activate isbench
pip install omnigibson==1.1.1
python -m omnigibson.install    # install omnigibson assets and datasets

If you want to install Omnigibson in Docker, please see this document.

  1. Download Source Code and BDDL of IS-Bench
git clone https://github.com/AI45Lab/IS-Bench
pip install -r requirements.txt
cd IS-Bench/bddl
pip install -e .
  1. Download Scene Dataset
cd ../data
wget https://huggingface.co/datasets/Ursulalala/IS_Bench_scenes/resolve/main/scenes.tar.gz
tar -xzvf scenes.tar.gz
rm scenes.tar.gz

🚀 Usage

If you are using slurm to run IS-Bench, please first revise your launcher for benchmark at scripts/launcher.sh

Evaluate Close-Source Models

Our code supports api-based model with openai or google-genai format.

  1. Configure api_base and api_key in entrypoints/env.sh
  2. Add proxy at og_ego_prim/models/server_inference.py if needed.
  3. Execute the following script:
bash entrypoints/eval_close.sh $MODEL_NAME $DATA_PARALLEL

Evaluate Open-Source Models

  1. Execute entrypoints/vllm_serve.sh to deploy a serve for the evaluated model and check the serve ip.
bash entrypoints/vllm_serve.sh $LOCAL_MODEL_PATH $GPUS
  1. Execute the following script:
bash entrypoints/eval_open.sh $MODEL_NAME_OR_PATH $SERVER_IP $DATA_PARALLEL

Advanced Configuration

  1. Revise entrypoints/task_list.txt to specify the tasks that need to be evaluated.

  2. Revise prompt_setting to change safety reminder:

  • v0: no safety reminder.
  • v1: implicit safety reminder.
  • v2: safety Chain-of-Thought (CoT) reminder.
  • v3: explicit safety reminder.
  1. Set the following parameters for optional scene information:
  • draw_bbox_2d
  • use_initial_setup
  • use_self_caption
  1. Set the following parameters for partial evaluation:
  • not_eval_process_safety
  • not_eval_termination_safety
  • not_eval_awareness
  • not_eval_execution
  1. Since the performance of Omnigibson may vary depending on the hardware environment, you can run the following script to check whether the tasks in IS-Bench can be successfully executed in your environment.
bash entrypoints/validate_gt.sh

🙏 Acknowledge

Leveraged part of data and code framework fromBehavior-1K dataset and Omnigibson simulator.

📑 Citation

@misc{lu2025isbench,
      title={IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks}, 
      author={Xiaoya Lu and Zeren Chen and Xuhao Hu and Yijin Zhou and Weichen Zhang and Dongrui Liu and Lu Sheng and Jing Shao},
      year={2025},
      eprint={2506.16402},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.16402}, 
}

About

Data and Code for Paper IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •