-
-
Notifications
You must be signed in to change notification settings - Fork 166
feat: Qwen-Image-Edit inference and training #473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Token max length calculation is fixed. Please re-run Text Encoder output caching. |
Thank you for testing! The model weights consume just under 40GB, and the tensor sequence length is twice that of Qwen-Image training, so memory consumption is quite high. Although there are quality issues, specifying |
|
By removing unnecessary variables, the peak memory was reduced by about 1GB when training 1328x1328 Qwen-Image-Edit. |
|
The original Diffusers implementation required the generated image and the control image to have the same resolution, but this appears to have been a bug in Diffusers. huggingface/diffusers#12188 Allowing control images of any size, similar to training FLUX.1 Kontext, may reduce memory consumption. |
|
I feel that qwen edit learns very slowly, and I'm not sure if it's a problem with the model itself... |
modelscope/DiffSynth-Studio#814 I checked the diffsyth code, and it seems they modified the TE template and drop idx when use qwen-image edit |
We are already doing with that 😄 :
|
|
some difficulties in the same training. |
|
If neither option is present, the control image will be resized to the same size as the target image. Please re-run latent and Text Encoder caching if these options are changed.
FLUX.1 Kontext caching/training are also updated, so the cache must be re-cached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for Qwen-Image-Edit inference and training, extending the existing Qwen-Image support to include image editing capabilities with control images. The implementation mirrors the structure of existing architectures but adds support for conditioning on input images for editing tasks.
- Adds
--editflag to enable Qwen-Image-Edit mode with control image support - Extends existing Qwen-Image utilities to handle vision-language processing with images
- Updates dataset handling to support control images for both training and inference
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/musubi_tuner/utils/sai_model_spec.py | Fixed architecture constant references for metadata generation |
| src/musubi_tuner/utils/safetensors_utils.py | Added utility function to find keys in safetensors files |
| src/musubi_tuner/utils/image_utils.py | Corrected documentation for image preprocessing return format |
| src/musubi_tuner/qwen_image_train_network.py | Added edit mode support with control image processing and VL encoding |
| src/musubi_tuner/qwen_image_generate_image.py | Added edit mode inference with control image handling and CLI options |
| src/musubi_tuner/qwen_image_cache_text_encoder_outputs.py | Extended to support image-conditioned text encoding for edit mode |
| src/musubi_tuner/qwen_image_cache_latents.py | Added control latent caching support for edit mode |
| src/musubi_tuner/qwen_image/qwen_image_utils.py | Added VL processor loading and image-conditioned prompt encoding functions |
| src/musubi_tuner/qwen_image/qwen_image_model.py | Optimized attention implementation and fixed RoPE computation for better memory usage |
| src/musubi_tuner/flux_kontext_train_network.py | Standardized control latent handling to match other architectures |
| src/musubi_tuner/flux_kontext_cache_latents.py | Unified control image processing and latent batching |
| src/musubi_tuner/dataset/image_video_dataset.py | Added edit mode configuration options and control image resizing logic |
| src/musubi_tuner/dataset/config_utils.py | Added configuration schema for edit mode parameters |
| src/musubi_tuner/cache_text_encoder_outputs.py | Extended to support content-requiring encoders for VL models |
| .ai/context/overview.md | Updated documentation to include Qwen-Image support |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
Documents are added. |
|
After testing, inputting [1024,1024] seems to result in significantly better training performance. Perhaps it needs to match the official input of 1M pixels... |
By specifying However, it seems like it would also be a good idea to set the training images to [1024,1024].
Thank you! I think the default should be a conservative value, and I also use a lower learning rate than other models, so I changed it to 5e-5. I also changed rank(dim) to 16 because LoRA becomes huge with 32. |
…ontext for README
|
@kohya-ss |
Thank you! Hmm... It would probably be better to make this available as an option. It might be a good idea to add the |


flux_kontext_no_resize_controlis not supported. Control images are resized/cropped to the target image size.00001to--dit) or ComyUI repackaged (use bf16, not tested.)--editoption to Text Encoder output caching, inference and training script.--control_image_pathto specify control image for inference script like FLUX.1 Kontext.--ci path/to/control.png(or jpg etc.) option to specify the control image.