Skip to content

Conversation

@kyleengan
Copy link

Hello! Hopefully I'm not being a pain in the neck here 😅

While trying to plan out a different possible change, I noticed a couple ways that one might be able to speed up Lora Loading by a bit when loading >1 lora in a single call to load_lora_for_models

Specifically:

  1. Extract unet and clip keys from model once per overall lora load operation (rather than re-extract them each time each individual lora is loaded)
  2. Attempt to maintain a list of "known modified" unet and clip keys. When loading loras, parse through this list of keys first, and only parse through the full model's unet or clip key set if the lora contains keys that aren't present in the "known modified" list.

In the PR below, the first item is a non-conditional change. I've moved the unet and clip key extraction to networks.py -> load_networks() and pass those two sets along when calling load_lora_for_models. This one's a fairly small change in terms of lines of code affected.

The second item is larger in scale and gives an approximately equal performance boost from my limited testing. First, it sets do_referential_load to True if there are >1 loras in the compiled_lora_targets list, and passes that value along in the call to load_lora_for_models call(s).

If that value is True in load_lora_for_models, it will maintain a separate set() each of the unet and clip that each lora has modified thus far, and will search those keys first in the load_lora() function. If all the keys a given lora is modifying are found in that pass, it won't loop over the full set of unet/clip keys for that lora. If any keys remain, it'll do the full loop and add any new keys to the known modified set() of that type.


Insofar as testing goes, I mostly did my testing with both of these modifications present. All tests were performed by executing lines 74-99 of load_networks() 100 times.

I noticed that if I attempted to do Item #2 when only 1 lora is being loaded, it actually suffered about a 3-5% performance hit (thus the "do_referential_load" logic).

Using the logic below, I saw 0 performance hit for 1 lora, and roughly a 15-25% performance boost for each lora loaded beyond the first (depending on the types of loras being loaded and in what order).

My test sets were as follows:

  • DMD2 speed lora, 2 unet sdxl loras, 1 tiny sdxl lora, 1 sd1.5 mismatch
  • 2 unet sdxl loras, 1 tiny sdxl lora, 1 sd1.5 mismatch
  • DMD2 speed lora, 2 full sdxl loras, 1 tiny sdxl lora
  • 2 unet sdxl loras, 1 tiny sdxl lora
  • 2 unet sdxl loras
  • 2 unet sdxl loras, 1 full sdxl lora
  • 1 unet sdxl loras, 1 full sdxl lora
  • 1 full sdxl lora

the "unet" loras had 788 unet keys and 0 clip keys each. the "full" lora had 722 unet and 288 clip keys. The "tiny" lora had 4 unet and 0 clip keys. The DMD2 speed lora had 788 unet keys that were mostly or entirely different from the other loras.


Feel free to drop me a line if you have any questions, concerns, or corrections.

Thanks, and sorry again for blind-siding you. 😅

@Haoming02 Haoming02 force-pushed the classic branch 2 times, most recently from 8706acd to 44e47b5 Compare August 12, 2025 05:52
@Haoming02 Haoming02 marked this pull request as draft August 18, 2025 01:39
@Haoming02 Haoming02 closed this Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants