Diagnosing Latency in llama.cpp Python Wrapper for Short Prompts #2073

quantumtensors · 2025-09-29T16:50:28Z

quantumtensors
Sep 29, 2025

When using the llama.cpp Python wrapper, I've noticed that inference performance can be unexpectedly slow for very short prompts (under around 50 tokens). Specifically, the time to first token (TTFT) seems worse than when using the plain C++ CLI, and in some cases even slower than when running longer prompts.

Why might this be happening? I'm particularly interested in how Python overhead, backend execution, tokenization, and caching behavior could be contributing to the slowdown. Also, if you were trying to debug and optimize this—ideally without hurting overall throughput—how would you go about identifying the bottleneck and improving TTFT for short prompts?

Answered by chandraprvkvsh

Sep 29, 2025

Yeah this happens a lot, especially if you’ve only compared the Python wrapper against the raw llama.cpp binary.

The wrapper adds extra cost just for being in Python. If it calls into C++ too often (say per-token callbacks or too much tokenization in Python), that overhead is noticeable when the prompt is short. The GIL can also cause delays if you mix threading or async badly.

Also, The first forward pass is always the heaviest. It needs to build KV caches, maybe dequantize weights, and on GPU it often has to spin up kernels for the first time. That’s why TTFT feels much higher than per-token latency. In C++ CLI they do some of this a bit more directly, so Python shows more of the raw cost.

View full answer

chandraprvkvsh · 2025-09-29T17:08:45Z

chandraprvkvsh
Sep 29, 2025

Yeah this happens a lot, especially if you’ve only compared the Python wrapper against the raw llama.cpp binary.

The wrapper adds extra cost just for being in Python. If it calls into C++ too often (say per-token callbacks or too much tokenization in Python), that overhead is noticeable when the prompt is short. The GIL can also cause delays if you mix threading or async badly.

Also, The first forward pass is always the heaviest. It needs to build KV caches, maybe dequantize weights, and on GPU it often has to spin up kernels for the first time. That’s why TTFT feels much higher than per-token latency. In C++ CLI they do some of this a bit more directly, so Python shows more of the raw cost.

If the wrapper is doing tokenization in Python, that can dominate when the prompt is short. Also if you aren’t using prompt caching, you pay the full cost every time even if the prefix is repeated.

How I’d debug it:

Add simple perf_counter timers around tokenization, generation call, and first token callback. That tells me if Python or backend is eating the time.
Compare with the C++ CLI on the same prompt to get the baseline. If CLI is faster, the wrapper layer is guilty.
Run a warm-up generation right after loading the model. If the second run is much faster, that means initialization/dequantization was the big hit.
If it’s GPU, I’d check with Nsight or nvprof to see if kernel launch overhead or transfers are the delay. On CPU, perf or just logging thread pool startup is enough.

How I’d fix it:

Move tokenization into the C++ side if possible.
Pre-warm the model on startup with a dummy forward pass so users never see that first-hit latency.
Keep the model in memory and reuse it between calls.
For streaming, batch tokens instead of sending them back to Python one by one.
If I absolutely need really low TTFT, I’d consider serving from a thin C++ worker and only talk to it from Python, so Python isn’t in the hot path.

Long story short : Python overhead + cold start cost + tokenization are the usual suspects. The fix is mostly warm-up and pushing more work into native code.

3 replies

quantumtensors Sep 29, 2025
Author

Thanks Chandra, that was super helpful.

Quick follow up on that ; even after moving things into C++, are there still hardware-level factors that limit TTFT? Like CPU cache misses, NUMA issues, or thread pool startup? And on GPU, how much does kernel launch or PCIe latency actually affect TTFT compared to later tokens?

chandraprvkvsh Sep 29, 2025

I don't have a good understanding of the c++ or the hardware side of things, But try tagging the contributors of the repo as they might have an answer for you.

quantumtensors Sep 29, 2025
Author

Thankyou, Can you guys have a look at it?

@abetlen @Stonelinks @Smartappli @gjmulder @CISC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Diagnosing Latency in llama.cpp Python Wrapper for Short Prompts #2073

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Diagnosing Latency in llama.cpp Python Wrapper for Short Prompts #2073

Uh oh!

quantumtensors Sep 29, 2025

Replies: 1 comment · 3 replies

Uh oh!

chandraprvkvsh Sep 29, 2025

Uh oh!

quantumtensors Sep 29, 2025 Author

Uh oh!

chandraprvkvsh Sep 29, 2025

Uh oh!

quantumtensors Sep 29, 2025 Author

quantumtensors
Sep 29, 2025

Replies: 1 comment 3 replies

chandraprvkvsh
Sep 29, 2025

quantumtensors Sep 29, 2025
Author

quantumtensors Sep 29, 2025
Author