Skip to content

Fix 131k context ggml assert #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

createthis
Copy link
Owner

@createthis createthis commented Aug 13, 2025

Just a draft PR for my own personal use. I'm terrible at C++, so no one should trust this. Upstream card is #15049

What?

This PR adds GGML_CUDA_ALLOW_LARGE_TENSORS. When enabled, it allows 64 bit sizes in the CUDA copy routines.

Q. What is the difference in INT_MAX and SIZE_MAX / 4? How much larger of a tensor will this accomodate?

A. The difference between INT_MAX and SIZE_MAX/4 is enormous:

INT_MAX: 2,147,483,647 bytes ≈ 2.00 GB
SIZE_MAX/4: 4,611,686,018,427,387,903 bytes ≈ 4,294,967,296 GB ≈ 4.3 PB

How?

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON
cmake --build build --config Release

Then:

./build/bin/llama-server \
    --model /data/Qwen3-Coder-480B-A35B-Instruct-1M-GGUF/UD-Q4_K_XL/Qwen3-Coder-480B-A35B-Instruct-1M-UD-Q4_K_XL-00001-of-00006.gguf \
    --alias Qwen3-Coder-480B-A35B-Instruct-GGUF:UD-Q4_K_XL \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 400000 \
    --n-gpu-layers 63 \
    -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --cache-type-k q4_1 \
    --cache-type-v q4_1 \
    --seed 3407 \
    --prio 3 \
    --temp 0.7 \
    --top-p 0.8 \
    --top-k 20 \
    --repeat-penalty 1.05 \
    --min-p 0.0 \
    --log-colors \
    --flash-attn \
    --host 0.0.0.0 \
    --jinja \
    --port 11434

Why?

Cards with a lot of VRAM like the blackwell 6000 pro may enable us to use larger in-GPU context lengths than INT_MAX allows.

@createthis createthis self-assigned this Aug 13, 2025
… CUDA large tensor support

This change by gpt-oss-120b-mxfp4.
@createthis
Copy link
Owner Author

Closing in favor of ggml-org#15298

@createthis createthis closed this Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant