Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 7, 2025

Mirrored from ggml-org/llama.cpp#15667

Should fix #15623
(originally targeted #14810, but was rebased)

This replaces the approach from #8482 to avoid using get_slice because it turns out it eagerly memmaps tensors which means on Windows this uses a lot of memory, and on Linux this inflates the resident set size.

Safetensors files are now parsed directly, since the format is simple enough. This will also eventually allow tracking the file ranges of tensors to maybe use os.copy_file_range when possible to make conversion of COW filesystems very fast (in #15727).

On Linux, when using memray (a memory profiler), this change reduces the peak heap memory usage by quite a lot, and with GNU time, it also reduces the peak resident set size memory usage.

The previous behavior when observed with memray seems to be that safe_open puts all of the model into the heap (likely memmaped, though since the resident set size is smaller and grows). The new behavior when observed with memray is more similar to what I thought happened in the first place (bumps of memory usage at each processed tensor, but it goes back down between each).

Here's a table of the "Maximum resident set size (kbytes)" from time -v (when using GNU time) on a few models:

$ $(which time) -v python3 convert_hf_to_gguf.py /path/to/model_dir --outfile /path/to/model.gguf --outtype f16
Model Target type master (kbytes) This PR (kbytes)
https://huggingface.co/mistralai/Mistral-7B-v0.1 F16 10 334 248 1 129 248
https://huggingface.co/meta-llama/Llama-3.2-1B F16 3 023 112 2 104 256
https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct F16 9 165 048 2 680 124

Safetensors are already directly parsed since #12820 for remote models. This is similar, but for local models.


TODO:

  • Handle byteswapping on big-endian platforms?

Make sure to read the contributing guidelines before submitting a PR

Applies to both local and remote safetensors custom parsing.
This matches the behavior of the official safetensors implementation.

* convert : rename from_safetensors_meta to from_local_tensor

For consistency with from_remote_tensor
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 128d5cb8-641f-44a1-af71-4e4bf67bde8a compared to baseline 52cd5469-e814-4a51-818a-57d1618fc442 reveals minimal performance variations within measurement precision limits. The changes are confined to Python conversion scripts and do not affect C++ runtime performance.

Key Findings

Performance Metrics:

  • Highest Response Time change: llama_supports_rpc (+0.08%, +0.024 ns)
  • Highest Throughput change: std::make_unique<llm_graph_input_pos_bucket> (+0.12%, +0.12 ns)
  • Both functions are non-core utility functions unrelated to inference performance

Core Function Impact:
No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize). The measured variations do not affect tokenization or inference pathways, therefore no impact on tokens per second performance is expected.

Power Consumption Analysis:
All binaries maintain identical power consumption profiles with 0.0% change across:

  • libllama.so, libggml.so (core inference libraries)
  • Command-line tools and utilities
  • Total estimated power consumption remains stable at ~1.77 million nanojoules

Flame Graph and CFG Analysis:

  • llama_supports_rpc shows identical assembly code and control flow structure between versions
  • Linear execution pattern with single external dependency (ggml_backend_reg_by_name@plt)
  • 0.02 ns timing difference attributed to micro-architectural variations rather than code changes

GitHub Code Review:
The PR introduces Python-only changes for safetensors parsing optimization, achieving significant memory usage reductions (up to 89% for large models) during conversion. No C++ runtime code modifications were identified.

Conclusion:
The analysis confirms no meaningful performance impact on the LLaMA.cpp inference engine. Observed timing variations represent measurement precision limits rather than functional changes. The Python conversion improvements enhance memory efficiency without affecting runtime performance.

@DajanaV DajanaV force-pushed the main branch 20 times, most recently from 96c975c to aa2fc28 Compare November 9, 2025 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants