Skip to content

aim-uofa/Diception

Repository files navigation


🎯 DICEPTION: A Generalist Diffusion Model for Vision Perception

📖 Project Page | 📄 Paper Link | 🤗 Huggingface Demo

One single model solves multiple perception tasks, on par with SOTA!

📰 News

  • 2025-09-21: 🚀 Model and inference code released
  • 2025-09-19: 🌟 Accepted as NeurIPS 2025 Spotlight
  • 2025-02-25: 📝 Paper released

🛠️ Installation

conda create -n diception python=3.10 -y

conda activate diception

pip install -r requirements.txt

👾 Inference

⚡ Quick Start

🧩 Model Setup

  1. Download SD3 Base Model: Download the Stable Diffusion 3 medium model from: https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers

  2. Download Trained Weights: Please download the model from Hugging Face: https://huggingface.co/Canyu/DICEPTION

  3. Update Paths: Set --pretrained_model_path to your SD3 path, and set --diception_path to the local path of the downloaded DICEPTION_v1.pth.

  4. Sample JSON for Batch Inference: We provide several JSON examples for batch inference in the DATA/jsons/evaluate directory.

▶️ Option 1: Simple Inference Script

For single image inference:

python inference.py \
    --image path/to/your/image.jpg \
    --prompt "[[image2depth]]" \
    --pretrained_model_path PATH_TO_SD3 \
    --diception_path PATH_TO_DICEPTION_v1.PTH \
    --output_dir ./outputs \
    --guidance_scale 2 \
    --num_inference_steps 28

With coordinate points (for interactive segmentation):

python inference.py \
    --image path/to/your/image.jpg \
    --prompt "[[image2segmentation]]" \
    --pretrained_model_path PATH_TO_SD3 \
    --diception_path PATH_TO_DICEPTION_v1.PTH \
    --output_dir ./outputs \
    --guidance_scale 2 \
    --num_inference_steps 28 \
    --points "0.3,0.5;0.7,0.2"

The --points parameter accepts coordinates in format "y1,x1;y2,x2;y3,x3" where:

  • Coordinates are normalized to [0,1] range
  • Format is (y,x) where y=height/image_height, x=width/image_width
  • Multiple points are separated by semicolons
  • Maximum 5 points are supported

📦 Option 2: Batch Inference

For batch processing with a JSON dataset:

python batch_inference.py \
    --pretrained_model_path PATH_TO_SD3 \
    --diception_path PATH_TO_DICEPTION_v1.PTH \
    --input_path example_batch.json \
    --data_root_path ./ \
    --save_path ./batch_results \
    --batch_size 4 \
    --guidance_scale 2 \
    --num_inference_steps 28
    # --save_npy (for depth and normal value)

JSON Format for Batch Inference: The input JSON file should contain a list of tasks in the following format:

[
  {
    "input": "path/to/image1.jpg",
    "caption": "[[image2segmentation]]"
  },
  {
    "input": "path/to/image2.jpg", 
    "caption": "[[image2depth]]"
  },
  {
    "input": "path/to/image3.jpg",
    "caption": "[[image2segmentation]]",
    "target": {
      "path": "path/to/sa1b.json"   (For convenience, randomly select a region for point prompt from the GT json)
    }
  }
]

📋 Supported Tasks

DICEPTION supports various vision perception tasks:

  • Depth Estimation: [[image2depth]]
  • Surface Normal Estimation: [[image2normal]]
  • Pose Estimation: [[image2pose]]
  • Interactive Segmentation: [[image2segmentation]]
  • Semantic Segmentation: [[image2semantic]] + (category in coco), e.g. [[image2semantic]] person
  • Entity Segmentation: [[image2entity]]

💡 Inference Tips

  • General settings: For best overall results, use --num_inference_steps 28 and --guidance_scale 2.0.
  • 1-step/few-step inference: We found flow-matching diffusion models naturally support few-step inference, especially for tasks like depth and surface normals. DICEPTION can run with --num_inference_steps 1 and --guidance_scale 1.0 with barely quality loss. If you prioritize speed, consider this setting. We provide a detailed analysis in our NeurIPS paper.

🗺️ Plan

  • Release inference code and pretrained model v1
  • Release training code
  • Release few-shot finetuning code

🎫 License

For academic use, this project is licensed under the 2-clause BSD License. For commercial use, please contact Chunhua Shen.

🖊️ Citation

@article{zhao2025diception,
  title={Diception: A generalist diffusion model for visual perceptual tasks},
  author={Zhao, Canyu and Liu, Mingyu and Zheng, Huanyi and Zhu, Muzhi and Zhao, Zhiyue and Chen, Hao and He, Tong and Shen, Chunhua},
  journal={arXiv preprint arXiv:2502.17157},
  year={2025}
}

About

[NeurIPS 2025 Spotlight] A Generalist Diffusion Model for Vision Perception

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages