One single model solves multiple perception tasks, on par with SOTA!
- 2025-09-21: 🚀 Model and inference code released
- 2025-09-19: 🌟 Accepted as NeurIPS 2025 Spotlight
- 2025-02-25: 📝 Paper released
conda create -n diception python=3.10 -y
conda activate diception
pip install -r requirements.txt
-
Download SD3 Base Model: Download the Stable Diffusion 3 medium model from: https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers
-
Download Trained Weights: Please download the model from Hugging Face: https://huggingface.co/Canyu/DICEPTION
-
Update Paths: Set
--pretrained_model_path
to your SD3 path, and set--diception_path
to the local path of the downloadedDICEPTION_v1.pth
. -
Sample JSON for Batch Inference: We provide several JSON examples for batch inference in the
DATA/jsons/evaluate
directory.
For single image inference:
python inference.py \
--image path/to/your/image.jpg \
--prompt "[[image2depth]]" \
--pretrained_model_path PATH_TO_SD3 \
--diception_path PATH_TO_DICEPTION_v1.PTH \
--output_dir ./outputs \
--guidance_scale 2 \
--num_inference_steps 28
With coordinate points (for interactive segmentation):
python inference.py \
--image path/to/your/image.jpg \
--prompt "[[image2segmentation]]" \
--pretrained_model_path PATH_TO_SD3 \
--diception_path PATH_TO_DICEPTION_v1.PTH \
--output_dir ./outputs \
--guidance_scale 2 \
--num_inference_steps 28 \
--points "0.3,0.5;0.7,0.2"
The --points
parameter accepts coordinates in format "y1,x1;y2,x2;y3,x3"
where:
- Coordinates are normalized to [0,1] range
- Format is (y,x) where y=height/image_height, x=width/image_width
- Multiple points are separated by semicolons
- Maximum 5 points are supported
For batch processing with a JSON dataset:
python batch_inference.py \
--pretrained_model_path PATH_TO_SD3 \
--diception_path PATH_TO_DICEPTION_v1.PTH \
--input_path example_batch.json \
--data_root_path ./ \
--save_path ./batch_results \
--batch_size 4 \
--guidance_scale 2 \
--num_inference_steps 28
# --save_npy (for depth and normal value)
JSON Format for Batch Inference: The input JSON file should contain a list of tasks in the following format:
[
{
"input": "path/to/image1.jpg",
"caption": "[[image2segmentation]]"
},
{
"input": "path/to/image2.jpg",
"caption": "[[image2depth]]"
},
{
"input": "path/to/image3.jpg",
"caption": "[[image2segmentation]]",
"target": {
"path": "path/to/sa1b.json" (For convenience, randomly select a region for point prompt from the GT json)
}
}
]
DICEPTION supports various vision perception tasks:
- Depth Estimation:
[[image2depth]]
- Surface Normal Estimation:
[[image2normal]]
- Pose Estimation:
[[image2pose]]
- Interactive Segmentation:
[[image2segmentation]]
- Semantic Segmentation:
[[image2semantic]] + (category in coco)
, e.g.[[image2semantic]] person
- Entity Segmentation:
[[image2entity]]
- General settings: For best overall results, use
--num_inference_steps 28
and--guidance_scale 2.0
. - 1-step/few-step inference: We found flow-matching diffusion models naturally support few-step inference, especially for tasks like depth and surface normals. DICEPTION can run with
--num_inference_steps 1
and--guidance_scale 1.0
with barely quality loss. If you prioritize speed, consider this setting. We provide a detailed analysis in our NeurIPS paper.
- Release inference code and pretrained model v1
- Release training code
- Release few-shot finetuning code
For academic use, this project is licensed under the 2-clause BSD License. For commercial use, please contact Chunhua Shen.
@article{zhao2025diception,
title={Diception: A generalist diffusion model for visual perceptual tasks},
author={Zhao, Canyu and Liu, Mingyu and Zheng, Huanyi and Zhu, Muzhi and Zhao, Zhiyue and Chen, Hao and He, Tong and Shen, Chunhua},
journal={arXiv preprint arXiv:2502.17157},
year={2025}
}