Sihui Ji
·
Hao Luo
·
Xi Chen
·
Yuanpeng Tu
·
Yiyang Wang
·
Hengshuang Zhao
The University of Hong Kong | DAMO Academy, Alibaba Group | Hupan Lab
|
- [2025.06.17]: Release the inference code.
- [2025.06.04]: Release the project page and the arxiv paper.
- [2025.03.29]: LayerFLow is accepted by Siggragh 2025 🎉🎉🎉.
TL;DR: We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa.
- Inference code
- Model checkpoints
- Training code
Begin by cloning the repository:
git clone https://github.com/SihuiJi/LayerFlow.git
cd LayerFlowOur project is developed based on the SAT version code of CogVideoX. You can follow the instructions of CogVideoX to install dependencies or:
conda create -n layer python==3.10
conda activate layer
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0
pip install -r requirements.txt| Models | Download Link (RGB version) | Download Link (RGBA version) |
|---|---|---|
| Multi-layer generation | 🤗 Huggingface 🤖 ModelScope | 🤗 Huggingface 🤖 ModelScope |
| Multi-layer decomposition | 🤗 Huggingface 🤖 ModelScope | 🤗 Huggingface 🤖 ModelScope |
| Foreground-conditioned generation | 🤗 Huggingface 🤖 ModelScope | 🤗 Huggingface 🤖 ModelScope |
| Background-conditioned generation | 🤗 Huggingface 🤖 ModelScope | 🤗 Huggingface 🤖 ModelScope |
💡Note:
- All models are finetuned from CogVideoX-2B.
- RGB version represents the models generating foreground layer without alpha-matte, while the model of RGBA version simultaneously generate foreground videos and its alpha-matte which can be combined into RGBA videos. However, due to difficulties in cross-domain generation and channel alignment, the results are generally less stable compared to RGB version.
Download models using huggingface-cli:
pip install "huggingface_hub[cli]"
huggingface-cli download zjuJish/LayerFlow --local-dir ./sat/ckpts_2b_loraor using git:
git lfs install
git clone https://huggingface.co/zjuJish/LayerFlowFor the pretrained VAE from CogVideoX-2B model, download as follows:
mkdir CogVideoX-2b-sat
cd CogVideoX-2b-sat
wget https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
mv 'index.html?dl=1' vae.zip
unzip vae.zip
Since model weight files are large, it’s recommended to use git lfs.
See here for git lfs installation.
git lfs install
Next, clone the T5 model, which is used as an encoder and doesn’t require training or fine-tuning.
git clone https://huggingface.co/THUDM/CogVideoX-2b.git # Download model from Huggingface
# git clone https://www.modelscope.cn/ZhipuAI/CogVideoX-2b.git # Download from Modelscope
mkdir t5-v1_1-xxl
mv CogVideoX-2b/text_encoder/* CogVideoX-2b/tokenizer/* CogVideoX-2b-sat/t5-v1_1-xxl
You may also use the model file location on Modelscope.
Arrange the above model files in the following structure:
CogVideoX-2b-sat
│
├── t5-v1_1-xxl
│ ├── added_tokens.json
│ ├── config.json
│ ├── model-00001-of-00002.safetensors
│ ├── model-00002-of-00002.safetensors
│ ├── model.safetensors.index.json
│ ├── special_tokens_map.json
│ ├── spiece.model
│ └── tokenizer_config.json
└── vae
└── 3d-vae.pt
sat
│
├── ckpts_2b_lora
│ ├── multi-layer-generation
│ ├── 1000
│ │ └── mp_rank_00_model_states.pt
│ └── latest
│ ├── multi-layer-decomposition
│ ├── 1000
│ │ └── mp_rank_00_model_states.pt
│ └── latest
│ ├── foreground-conditioned-generation
│ ├── 1000
│ │ └── mp_rank_00_model_states.pt
│ └── latest
│ ├── background-conditioned-generation
│ ├── 1000
│ │ └── mp_rank_00_model_states.pt
│ └── latest
cd sat
Run multi-layer generation (RGB version)
bash 'inference_stage2_gen_rgb.sh'
Run multi-layer generation (RGBA version)
bash 'inference_stage2_gen_rgba.sh'
Run multi-layer decomposition (RGB version)
bash 'inference_stage2_seg_rgb.sh'
Run multi-layer decomposition (RGBA version)
bash 'inference_stage2_seg_rgba.sh'
Run foreground-conditioned generation (RGB version)
bash 'inference_stage2_fg2bg_rgb.sh'
Run foreground-conditioned generation (RGBA version)
bash 'inference_stage2_fg2bg_rgba.sh'
Run background-conditioned generation (RGB version)
bash 'inference_stage2_bg2fg_rgb.sh'
Run background-conditioned generation (RGBA version)
bash 'inference_stage2_bg2fg_rgba.sh'
This project is developed on the codebase of CogVideoX. We appreciate this great work!
Please leave us a star 🌟 and cite our paper if you find our work helpful.
@article{ji2025layerflow,
title={LayerFlow : A Unified Model for Layer-aware Video Generation},
author={Ji, Sihui and Luo, Hao and Chen, Xi and Tu, Yuanpeng and Wang, Yiyang and Zhao, Hengshuang},
year={2025},
journal={arXiv preprint arXiv:2506.04228},
}
