Proposed lora loading optimizations #120
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hello! Hopefully I'm not being a pain in the neck here 😅
While trying to plan out a different possible change, I noticed a couple ways that one might be able to speed up Lora Loading by a bit when loading >1 lora in a single call to
load_lora_for_modelsSpecifically:
In the PR below, the first item is a non-conditional change. I've moved the unet and clip key extraction to networks.py -> load_networks() and pass those two sets along when calling
load_lora_for_models. This one's a fairly small change in terms of lines of code affected.The second item is larger in scale and gives an approximately equal performance boost from my limited testing. First, it sets
do_referential_loadto True if there are >1 loras in thecompiled_lora_targetslist, and passes that value along in the call toload_lora_for_modelscall(s).If that value is True in load_lora_for_models, it will maintain a separate set() each of the unet and clip that each lora has modified thus far, and will search those keys first in the
load_lora()function. If all the keys a given lora is modifying are found in that pass, it won't loop over the full set of unet/clip keys for that lora. If any keys remain, it'll do the full loop and add any new keys to the known modified set() of that type.Insofar as testing goes, I mostly did my testing with both of these modifications present. All tests were performed by executing lines 74-99 of load_networks() 100 times.
I noticed that if I attempted to do Item #2 when only 1 lora is being loaded, it actually suffered about a 3-5% performance hit (thus the "do_referential_load" logic).
Using the logic below, I saw 0 performance hit for 1 lora, and roughly a 15-25% performance boost for each lora loaded beyond the first (depending on the types of loras being loaded and in what order).
My test sets were as follows:
the "unet" loras had 788 unet keys and 0 clip keys each. the "full" lora had 722 unet and 288 clip keys. The "tiny" lora had 4 unet and 0 clip keys. The DMD2 speed lora had 788 unet keys that were mostly or entirely different from the other loras.
Feel free to drop me a line if you have any questions, concerns, or corrections.
Thanks, and sorry again for blind-siding you. 😅