Skip to content

Conversation

@kohya-ss
Copy link
Owner

@kohya-ss kohya-ss commented Aug 19, 2025

@kohya-ss
Copy link
Owner Author

Token max length calculation is fixed. Please re-run Text Encoder output caching.

@sdbds
Copy link
Contributor

sdbds commented Aug 20, 2025

image After testing, normal training works, but it seems that under the bucket [1328,1328], 50GB VRAM is required

@kohya-ss
Copy link
Owner Author

After testing, normal training works, but it seems that under the bucket [1328,1328], 50GB VRAM is required

Thank you for testing!

The model weights consume just under 40GB, and the tensor sequence length is twice that of Qwen-Image training, so memory consumption is quite high. Although there are quality issues, specifying --fp8_base and --fp8_scaled may be more practical.

@kohya-ss
Copy link
Owner Author

By removing unnecessary variables, the peak memory was reduced by about 1GB when training 1328x1328 Qwen-Image-Edit.

@kohya-ss
Copy link
Owner Author

The original Diffusers implementation required the generated image and the control image to have the same resolution, but this appears to have been a bug in Diffusers.

huggingface/diffusers#12188
huggingface/diffusers#12190

Allowing control images of any size, similar to training FLUX.1 Kontext, may reduce memory consumption.

@sdbds
Copy link
Contributor

sdbds commented Aug 20, 2025

I feel that qwen edit learns very slowly, and I'm not sure if it's a problem with the model itself...

@sdbds
Copy link
Contributor

sdbds commented Aug 20, 2025

image

modelscope/DiffSynth-Studio#814

I checked the diffsyth code, and it seems they modified the TE template and drop idx when use qwen-image edit

@kohya-ss
Copy link
Owner Author

modelscope/DiffSynth-Studio#814

I checked the diffsyth code, and it seems they modified the TE template and drop idx when use qwen-image edit

We are already doing with that 😄 :

def get_qwen_prompt_embeds_with_image(

@mliand
Copy link

mliand commented Aug 21, 2025

some difficulties in the same training.

@kohya-ss
Copy link
Owner Author

dataset_config.md now has following two new options for Qwen-Image-Edit.

  • qwen_image_edit_no_resize_control : This is the same as FLUX.1 Kontext, does not resize the control image.
  • qwen_image_edit_control_resolution: Resizes to a specific bucket size. To resize the control images the same as the official code, specify [1024,1024].

If neither option is present, the control image will be resized to the same size as the target image. Please re-run latent and Text Encoder caching if these options are changed.

qwen_image_generate_image.py has following two new options.

  • --resize_control_to_image_size: Resize the control image for generation size.
  • --resize_control_to_official_size: Resize the control image to match official size (1M pixels keeping aspect ratio).

FLUX.1 Kontext caching/training are also updated, so the cache must be re-cached.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for Qwen-Image-Edit inference and training, extending the existing Qwen-Image support to include image editing capabilities with control images. The implementation mirrors the structure of existing architectures but adds support for conditioning on input images for editing tasks.

  • Adds --edit flag to enable Qwen-Image-Edit mode with control image support
  • Extends existing Qwen-Image utilities to handle vision-language processing with images
  • Updates dataset handling to support control images for both training and inference

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/musubi_tuner/utils/sai_model_spec.py Fixed architecture constant references for metadata generation
src/musubi_tuner/utils/safetensors_utils.py Added utility function to find keys in safetensors files
src/musubi_tuner/utils/image_utils.py Corrected documentation for image preprocessing return format
src/musubi_tuner/qwen_image_train_network.py Added edit mode support with control image processing and VL encoding
src/musubi_tuner/qwen_image_generate_image.py Added edit mode inference with control image handling and CLI options
src/musubi_tuner/qwen_image_cache_text_encoder_outputs.py Extended to support image-conditioned text encoding for edit mode
src/musubi_tuner/qwen_image_cache_latents.py Added control latent caching support for edit mode
src/musubi_tuner/qwen_image/qwen_image_utils.py Added VL processor loading and image-conditioned prompt encoding functions
src/musubi_tuner/qwen_image/qwen_image_model.py Optimized attention implementation and fixed RoPE computation for better memory usage
src/musubi_tuner/flux_kontext_train_network.py Standardized control latent handling to match other architectures
src/musubi_tuner/flux_kontext_cache_latents.py Unified control image processing and latent batching
src/musubi_tuner/dataset/image_video_dataset.py Added edit mode configuration options and control image resizing logic
src/musubi_tuner/dataset/config_utils.py Added configuration schema for edit mode parameters
src/musubi_tuner/cache_text_encoder_outputs.py Extended to support content-requiring encoders for VL models
.ai/context/overview.md Updated documentation to include Qwen-Image support

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@kohya-ss
Copy link
Owner Author

Documents are added.

@sdbds
Copy link
Contributor

sdbds commented Aug 21, 2025

After testing, inputting [1024,1024] seems to result in significantly better training performance. Perhaps it needs to match the official input of 1M pixels...
Also, the default learning rate of 1E-4 seems a bit high for this model? It feels like it's easily overfitting.

@kohya-ss
Copy link
Owner Author

After testing, inputting [1024,1024] seems to result in significantly better training performance. Perhaps it needs to match the official input of 1M pixels...

By specifying qwen_image_edit_control_resolution = [1024, 1024] in the dataset config, The control images are fixed to 1M pixels regardless of the resolution of the training images.

However, it seems like it would also be a good idea to set the training images to [1024,1024].

Also, the default learning rate of 1E-4 seems a bit high for this model? It feels like it's easily overfitting.

Thank you! I think the default should be a conservative value, and I also use a lower learning rate than other models, so I changed it to 5e-5. I also changed rank(dim) to 16 because LoRA becomes huge with 32.

@kohya-ss kohya-ss marked this pull request as ready for review August 21, 2025 14:21
@kohya-ss kohya-ss merged commit fe99c71 into main Aug 21, 2025
@kohya-ss kohya-ss deleted the feat-qwen-image-edit-support branch August 21, 2025 22:29
@sdbds
Copy link
Contributor

sdbds commented Aug 25, 2025

@kohya-ss
Because the effect is very strange, I specifically asked the official technical staff, and they told me that they use specialized bucket training for the edit model.
I think we can consider setting up these specialized buckets, like wan 2.1.

1024, 1024
1168, 864
864, 1168
1280, 720
720, 1280
896, 1120
1120, 896
800, 1200
1200, 800

@kohya-ss
Copy link
Owner Author

Because the effect is very strange, I specifically asked the official technical staff, and they told me that they use specialized bucket training for the edit model.
I think we can consider setting up these specialized buckets, like wan 2.1.

Thank you! Hmm... It would probably be better to make this available as an option.

It might be a good idea to add the use_qwen_image_default_buckets option in the dataset config, or to allow users to specify any resolution using the custom_bucket_resolutions option, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants