-
Notifications
You must be signed in to change notification settings - Fork 13.6k
convert : parse safetensors directly #15667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
85edafe to
786b32d
Compare
|
I can confirm that this helped me convert glm 4.5 air, whereas current |
786b32d to
e582f1a
Compare
|
Is there anything preventing this PR from being merged/unset for draft? It's impossible for me to convert GLM Air reliably without this PR so I think it's quite useful to have. |
This comment was marked as off-topic.
This comment was marked as off-topic.
|
@whatever1983 this has nothing to do with fp8 conversion. This is simply a more memory efficient way of performing the GGUF convert that prevents OOMs/crashing during the conversion process, which I need in order to convert GLM Air. As for politics I can't advise on that. I just want to successfully convert my models hence me bumping the issue. |
Applies to both local and remote safetensors custom parsing. This matches the behavior of the official safetensors implementation. * convert : rename from_safetensors_meta to from_local_tensor For consistency with from_remote_tensor
e582f1a to
e996f3a
Compare
Should fix #15623
(originally targeted #14810, but was rebased)
This replaces the approach from #8482 to avoid using
get_slicebecause it turns out it eagerly memmaps tensors which means on Windows this uses a lot of memory, and on Linux this inflates the resident set size.Safetensors files are now parsed directly, since the format is simple enough. This will also eventually allow tracking the file ranges of tensors to maybe use
os.copy_file_rangewhen possible to make conversion of COW filesystems very fast (in #15727).On Linux, when using
memray(a memory profiler), this change reduces the peak heap memory usage by quite a lot, and with GNUtime, it also reduces the peak resident set size memory usage.The previous behavior when observed with
memrayseems to be thatsafe_openputs all of the model into the heap (likely memmaped, though since the resident set size is smaller and grows). The new behavior when observed withmemrayis more similar to what I thought happened in the first place (bumps of memory usage at each processed tensor, but it goes back down between each).Here's a table of the "Maximum resident set size (kbytes)" from
time -v(when using GNUtime) on a few models:$ $(which time) -v python3 convert_hf_to_gguf.py /path/to/model_dir --outfile /path/to/model.gguf --outtype f16master(kbytes)Safetensors are already directly parsed since #12820 for remote models. This is similar, but for local models.
TODO:
safetensorslibrary automatically byteswaps when running on a big-endian platform (since the format is always little-endian), butGGUFWriterbyteswaps unconditionnaly when the target endianness is big, so this never really worked anyway? (double-byteswapping in this case would produce little endian tensors...)Make sure to read the contributing guidelines before submitting a PR