Fix 131k context ggml assert #3

createthis · 2025-08-13T14:53:07Z

Just a draft PR for my own personal use. I'm terrible at C++, so no one should trust this. Upstream card is #15049

What?

This PR adds GGML_CUDA_ALLOW_LARGE_TENSORS. When enabled, it allows 64 bit sizes in the CUDA copy routines.

Q. What is the difference in INT_MAX and SIZE_MAX / 4? How much larger of a tensor will this accomodate?

A. The difference between INT_MAX and SIZE_MAX/4 is enormous:

INT_MAX: 2,147,483,647 bytes ≈ 2.00 GB
SIZE_MAX/4: 4,611,686,018,427,387,903 bytes ≈ 4,294,967,296 GB ≈ 4.3 PB

How?

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON
cmake --build build --config Release

Then:

./build/bin/llama-server \
    --model /data/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF/UD-Q4_K_XL/Qwen3-Coder-480B-A35B-Instruct-1M-UD-Q4_K_XL-00001-of-00006.gguf \
    --alias Qwen3-Coder-480B-A35B-Instruct-GGUF:UD-Q4_K_XL \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 400000 \
    --n-gpu-layers 63 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --cache-type-k q4_1 \
    --cache-type-v q4_1 \
    --seed 3407 \
    --prio 3 \
    --temp 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --repeat-penalty 1.05 \
    --min-p 0.0 \
    --log-colors \
    --flash-attn \
    --host 0.0.0.0 \
    --jinja \
    --port 11434

Why?

Cards with a lot of VRAM like the blackwell 6000 pro may enable us to use larger in-GPU context lengths than INT_MAX allows.

… check in ggml_cuda_cpy

beware.

…by Qwen3-Coder-480B-A35B-Instruct-1M-GGUF

… CUDA large tensor support This change by gpt-oss-120b-mxfp4.

createthis · 2025-08-13T19:24:25Z

Closing in favor of ggml-org#15298

createthis added 2 commits August 13, 2025 09:21

Add compile-time flag GGML_CUDA_ALLOW_LARGE_TENSORS to bypass INT_MAX…

73ef5b9

… check in ggml_cuda_cpy

R1-0528's attempt to implement this. I doubt this code works. User

d3ea7d2

beware.

createthis self-assigned this Aug 13, 2025

createthis added 2 commits August 13, 2025 11:19

New assertions for GGML_CUDA_ALLOW_LARGE_TENSORS upper bounds, coded …

39fbbb8

…by Qwen3-Coder-480B-A35B-Instruct-1M-GGUF

Add compile option GGML_CUDA_ALLOW_LARGE_TENSORS and define macro for…

e40e6a6

… CUDA large tensor support This change by gpt-oss-120b-mxfp4.

createthis closed this Aug 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix 131k context ggml assert #3

Fix 131k context ggml assert #3

createthis commented Aug 13, 2025 •

edited

Loading

Uh oh!

createthis commented Aug 13, 2025

Uh oh!

Uh oh!

Fix 131k context ggml assert #3

Fix 131k context ggml assert #3

Conversation

createthis commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

How?

Why?

Uh oh!

createthis commented Aug 13, 2025

Uh oh!

Uh oh!

createthis commented Aug 13, 2025 •

edited

Loading