RuntimeError: The size of tensor a (4173) must match the size of tensor b (16461) at non-singleton dimension 2

### Describe the bug

I am trying to work on the flux lora quantization example as per the link
https://github.com/huggingface/diffusers/tree/main/examples/research_projects/flux_lora_quantization

but facing RuntimeError: The size of tensor a (4173) must match the size of tensor b (16461) at non-singleton dimension 2 - error

### Reproduction

Steps to reproduce:

python compute_embeddings.py

accelerate launch --config_file=accelerate.yaml \
  train_dreambooth_lora_flux_miniature.py \
  --pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev" \ # used flux-dev locally downloaded model
  --data_df_path="embeddings.parquet" \
  --output_dir="yarn_art_lora_flux_nf4" \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --weighting_scheme="none" \
  --resolution=1024 \
  --train_batch_size=1 \
  --repeats=1 \
  --learning_rate=1e-4 \
  --guidance_scale=1 \
  --report_to="wandb" \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --cache_latents \
  --rank=4 \
  --max_train_steps=700 \
  --seed="0"

### Logs

```shell
(env) root:~/tharun/Flux-HF# accelerate launch --config_file=accelerate.yaml \
  train_dreambooth_lora_flux_miniature.py \
  --pretrained_model_name_or_path="/root/tharun/black-forest-labs/FLUX.1-dev" \
  --data_df_path="embeddings.parquet" \
  --output_dir="yarn_art_lora_flux_nf4" \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --weighting_scheme="none" \
  --resolution=1024 \
  --train_batch_size=1 \
  --repeats=1 \
  --learning_rate=1e-4 \
  --guidance_scale=1 \
  --report_to="wandb" \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --cache_latents \
  --rank=4 \
  --max_train_steps=700 \
  --seed="0"
10/29/2024 16:58:01 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

Merged sharded checkpoints as `hf_quantizer` is not None.
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Caching latents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:01<00:00, 10.35it/s]
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: tharunsivamani (tharunsivamani-student). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.5
wandb: Run data is saved locally in /root/tharun/Flux-HF/wandb/run-20241029_165827-66cke5nx
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run feasible-bird-3
wandb: ⭐️ View project at https://wandb.ai/tharunsivamani-student/dreambooth-flux-dev-lora-nf4
wandb: 🚀 View run at https://wandb.ai/tharunsivamani-student/dreambooth-flux-dev-lora-nf4/runs/66cke5nx
10/29/2024 16:58:28 - INFO - __main__ - ***** Running training *****
10/29/2024 16:58:28 - INFO - __main__ -   Num examples = 18
10/29/2024 16:58:28 - INFO - __main__ -   Num batches each epoch = 18
10/29/2024 16:58:28 - INFO - __main__ -   Num Epochs = 140
10/29/2024 16:58:28 - INFO - __main__ -   Instantaneous batch size per device = 1
10/29/2024 16:58:28 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
10/29/2024 16:58:28 - INFO - __main__ -   Gradient Accumulation steps = 4
10/29/2024 16:58:28 - INFO - __main__ -   Total optimization steps = 700
Steps:   0%|                                                                                                                                                                                                  | 0/700 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/root/tharun/Flux-HF/train_dreambooth_lora_flux_miniature.py", line 1183, in <module>
    main(args)
  File "/root/tharun/Flux-HF/train_dreambooth_lora_flux_miniature.py", line 1072, in main
    model_pred = transformer(
  File "/root/tharun/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/tharun/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/tharun/env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 820, in forward
    return model_forward(*args, **kwargs)
  File "/root/tharun/env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 808, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/root/tharun/env/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/tharun/env/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py", line 490, in forward
    encoder_hidden_states, hidden_states = torch.utils.checkpoint.checkpoint(
  File "/root/tharun/env/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
    return disable_fn(*args, **kwargs)
  File "/root/tharun/env/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
    return fn(*args, **kwargs)
  File "/root/tharun/env/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 496, in checkpoint
    ret = function(*args, **kwargs)
  File "/root/tharun/env/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py", line 485, in custom_forward
    return module(*inputs)
  File "/root/tharun/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/tharun/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/tharun/env/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py", line 175, in forward
    attn_output, context_attn_output = self.attn(
  File "/root/tharun/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/tharun/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/tharun/env/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 495, in forward
    return self.processor(
  File "/root/tharun/env/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 1872, in __call__
    query = apply_rotary_emb(query, image_rotary_emb)
  File "/root/tharun/env/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 770, in apply_rotary_emb
    out = (x.float() * cos + x_rotated.float() * sin).to(x.dtype)
RuntimeError: The size of tensor a (4173) must match the size of tensor b (16461) at non-singleton dimension 2
wandb: 🚀 View run feasible-bird-3 at: https://wandb.ai/tharunsivamani-student/dreambooth-flux-dev-lora-nf4/runs/66cke5nx
wandb: Find logs at: wandb/run-20241029_165827-66cke5nx/logs
Traceback (most recent call last):
  File "/root/tharun/env/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/tharun/env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/root/tharun/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1168, in launch_command
    simple_launcher(args)
  File "/root/tharun/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 763, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/tharun/env/bin/python', 'train_dreambooth_lora_flux_miniature.py', '--pretrained_model_name_or_path=/root/tharun/black-forest-labs/FLUX.1-dev', '--data_df_path=embeddings.parquet', '--output_dir=yarn_art_lora_flux_nf4', '--mixed_precision=fp16', '--use_8bit_adam', '--weighting_scheme=none', '--resolution=1024', '--train_batch_size=1', '--repeats=1', '--learning_rate=1e-4', '--guidance_scale=1', '--report_to=wandb', '--gradient_accumulation_steps=4', '--gradient_checkpointing', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--cache_latents', '--rank=4', '--max_train_steps=700', '--seed=0']' returned non-zero exit status 1.
```


### System Info

- 🤗 Diffusers version: 0.32.0.dev0
- Platform: Linux-5.15.0-119-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.10.12
- PyTorch version (GPU?): 2.5.0+cu124 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.24.7
- Transformers version: 4.46.1
- Accelerate version: 1.0.1
- PEFT version: 0.13.2
- Bitsandbytes version: 0.44.1
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA L40S, 46068 MiB
NVIDIA L40S, 46068 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError: The size of tensor a (4173) must match the size of tensor b (16461) at non-singleton dimension 2 #9799

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RuntimeError: The size of tensor a (4173) must match the size of tensor b (16461) at non-singleton dimension 2 #9799

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions