Skip to content
Discussion options

You must be logged in to vote

Yeah this happens a lot, especially if you’ve only compared the Python wrapper against the raw llama.cpp binary.

The wrapper adds extra cost just for being in Python. If it calls into C++ too often (say per-token callbacks or too much tokenization in Python), that overhead is noticeable when the prompt is short. The GIL can also cause delays if you mix threading or async badly.

Also, The first forward pass is always the heaviest. It needs to build KV caches, maybe dequantize weights, and on GPU it often has to spin up kernels for the first time. That’s why TTFT feels much higher than per-token latency. In C++ CLI they do some of this a bit more directly, so Python shows more of the raw cost.

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@quantumtensors
Comment options

@chandraprvkvsh
Comment options

@quantumtensors
Comment options

Answer selected by quantumtensors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants