Skip to content

Conversation

@compilade
Copy link
Collaborator

@compilade compilade commented Aug 29, 2025

Should fix #15623
(originally targeted #14810, but was rebased)

This replaces the approach from #8482 to avoid using get_slice because it turns out it eagerly memmaps tensors which means on Windows this uses a lot of memory, and on Linux this inflates the resident set size.

Safetensors files are now parsed directly, since the format is simple enough. This will also eventually allow tracking the file ranges of tensors to maybe use os.copy_file_range when possible to make conversion of COW filesystems very fast (in #15727).

On Linux, when using memray (a memory profiler), this change reduces the peak heap memory usage by quite a lot, and with GNU time, it also reduces the peak resident set size memory usage.

The previous behavior when observed with memray seems to be that safe_open puts all of the model into the heap (likely memmaped, though since the resident set size is smaller and grows). The new behavior when observed with memray is more similar to what I thought happened in the first place (bumps of memory usage at each processed tensor, but it goes back down between each).

Here's a table of the "Maximum resident set size (kbytes)" from time -v (when using GNU time) on a few models:

$ $(which time) -v python3 convert_hf_to_gguf.py /path/to/model_dir --outfile /path/to/model.gguf --outtype f16
Model Target type master (kbytes) This PR (kbytes)
https://huggingface.co/mistralai/Mistral-7B-v0.1 F16 10 334 248 1 129 248
https://huggingface.co/meta-llama/Llama-3.2-1B F16 3 023 112 2 104 256
https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct F16 9 165 048 2 680 124

Safetensors are already directly parsed since #12820 for remote models. This is similar, but for local models.


TODO:

  • Handle byteswapping on big-endian platforms?
    • The safetensors library automatically byteswaps when running on a big-endian platform (since the format is always little-endian), but GGUFWriter byteswaps unconditionnaly when the target endianness is big, so this never really worked anyway? (double-byteswapping in this case would produce little endian tensors...)

Make sure to read the contributing guidelines before submitting a PR

@LostRuins
Copy link
Collaborator

I can confirm that this helped me convert glm 4.5 air, whereas current main fails.

@compilade compilade force-pushed the compilade/convert-safetensors-parse branch from 786b32d to e582f1a Compare September 9, 2025 18:49
@LostRuins
Copy link
Collaborator

Is there anything preventing this PR from being merged/unset for draft?

It's impossible for me to convert GLM Air reliably without this PR so I think it's quite useful to have.

@whatever1983

This comment was marked as off-topic.

@LostRuins
Copy link
Collaborator

@whatever1983 this has nothing to do with fp8 conversion. This is simply a more memory efficient way of performing the GGUF convert that prevents OOMs/crashing during the conversion process, which I need in order to convert GLM Air.

As for politics I can't advise on that. I just want to successfully convert my models hence me bumping the issue.

Applies to both local and remote safetensors custom parsing.
This matches the behavior of the official safetensors implementation.

* convert : rename from_safetensors_meta to from_local_tensor

For consistency with from_remote_tensor
@compilade compilade force-pushed the compilade/convert-safetensors-parse branch from e582f1a to e996f3a Compare November 7, 2025 03:39
@compilade compilade changed the base branch from compilade/convert-prequant to master November 7, 2025 03:39
@compilade compilade marked this pull request as ready for review November 7, 2025 04:11
@compilade compilade requested a review from CISC as a code owner November 7, 2025 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: convert_hf_to_gguf.py runs out of memory

5 participants