Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Just a draft PR for my own personal use. I'm terrible at C++, so no one should trust this. Upstream card is #15049
What?
This PR adds
GGML_CUDA_ALLOW_LARGE_TENSORS
. When enabled, it allows 64 bit sizes in the CUDA copy routines.Q. What is the difference in INT_MAX and
SIZE_MAX / 4
? How much larger of a tensor will this accomodate?A. The difference between INT_MAX and SIZE_MAX/4 is enormous:
INT_MAX: 2,147,483,647 bytes ≈ 2.00 GB
SIZE_MAX/4: 4,611,686,018,427,387,903 bytes ≈ 4,294,967,296 GB ≈ 4.3 PB
How?
Then:
./build/bin/llama-server \ --model /data/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF/UD-Q4_K_XL/Qwen3-Coder-480B-A35B-Instruct-1M-UD-Q4_K_XL-00001-of-00006.gguf \ --alias Qwen3-Coder-480B-A35B-Instruct-GGUF:UD-Q4_K_XL \ --no-webui \ --numa numactl \ --threads 32 \ --ctx-size 400000 \ --n-gpu-layers 63 \ -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*=CUDA0" \ -ot exps=CPU \ -ub 4096 -b 4096 \ --cache-type-k q4_1 \ --cache-type-v q4_1 \ --seed 3407 \ --prio 3 \ --temp 0.7 \ --top-p 0.8 \ --top-k 20 \ --repeat-penalty 1.05 \ --min-p 0.0 \ --log-colors \ --flash-attn \ --host 0.0.0.0 \ --jinja \ --port 11434
Why?
Cards with a lot of VRAM like the blackwell 6000 pro may enable us to use larger in-GPU context lengths than INT_MAX allows.