Skip to content

Conversation

@dranger003
Copy link
Contributor

This replaces PR #7033 as a result of merging PR #6511.

Closes #7030 and #7040.

@github-actions
Copy link
Contributor

github-actions bot commented May 3, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 536 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8756.89ms p(95)=21734.99ms fails=, finish reason: stop=469 truncated=67
  • Prompt processing (pp): avg=103.54tk/s p(95)=469.12tk/s
  • Token generation (tg): avg=32.32tk/s p(95)=46.55tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=bpe-pretok-command-r-2 commit=f5806b2d09ba2dcf60d8d66046ed5853234f28de

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714885613 --> 1714886239
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 675.74, 675.74, 675.74, 675.74, 675.74, 513.3, 513.3, 513.3, 513.3, 513.3, 531.59, 531.59, 531.59, 531.59, 531.59, 577.54, 577.54, 577.54, 577.54, 577.54, 621.43, 621.43, 621.43, 621.43, 621.43, 644.79, 644.79, 644.79, 644.79, 644.79, 647.69, 647.69, 647.69, 647.69, 647.69, 688.4, 688.4, 688.4, 688.4, 688.4, 691.16, 691.16, 691.16, 691.16, 691.16, 709.44, 709.44, 709.44, 709.44, 709.44, 731.08, 731.08, 731.08, 731.08, 731.08, 743.11, 743.11, 743.11, 743.11, 743.11, 724.87, 724.87, 724.87, 724.87, 724.87, 770.53, 770.53, 770.53, 770.53, 770.53, 794.5, 794.5, 794.5, 794.5, 794.5, 789.99, 789.99, 789.99, 789.99, 789.99, 791.02, 791.02, 791.02, 791.02, 791.02, 816.51, 816.51, 816.51, 816.51, 816.51, 813.72, 813.72, 813.72, 813.72, 813.72, 816.15, 816.15, 816.15, 816.15, 816.15, 822.75, 822.75, 822.75, 822.75, 822.75, 826.64, 826.64, 826.64, 826.64, 826.64, 832.54, 832.54, 832.54, 832.54, 832.54, 817.58, 817.58, 817.58, 817.58, 817.58, 820.87, 820.87, 820.87, 820.87, 820.87, 822.5, 822.5, 822.5, 822.5, 822.5, 837.84, 837.84, 837.84, 837.84, 837.84, 835.14, 835.14, 835.14, 835.14, 835.14, 834.02, 834.02, 834.02, 834.02, 834.02, 835.37, 835.37, 835.37, 835.37, 835.37, 840.63, 840.63, 840.63, 840.63, 840.63, 840.19, 840.19, 840.19, 840.19, 840.19, 840.33, 840.33, 840.33, 840.33, 840.33, 842.91, 842.91, 842.91, 842.91, 842.91, 845.73, 845.73, 845.73, 845.73, 845.73, 850.35, 850.35, 850.35, 850.35, 850.35, 861.33, 861.33, 861.33, 861.33, 861.33, 860.72, 860.72, 860.72, 860.72, 860.72, 858.59, 858.59, 858.59, 858.59, 858.59, 861.43, 861.43, 861.43, 861.43, 861.43, 864.45, 864.45, 864.45, 864.45, 864.45, 876.09, 876.09, 876.09, 876.09, 876.09, 860.41, 860.41, 860.41, 860.41, 860.41, 836.69, 836.69, 836.69, 836.69, 836.69, 836.74, 836.74, 836.74, 836.74, 836.74, 834.68, 834.68, 834.68, 834.68, 834.68, 832.0, 832.0, 832.0, 832.0, 832.0, 835.99, 835.99, 835.99, 835.99, 835.99, 838.72, 838.72, 838.72, 838.72, 838.72, 839.72, 839.72, 839.72, 839.72, 839.72, 842.33, 842.33, 842.33, 842.33, 842.33, 844.19, 844.19, 844.19, 844.19, 844.19, 847.05, 847.05, 847.05, 847.05, 847.05, 847.78, 847.78, 847.78, 847.78, 847.78, 849.13, 849.13, 849.13, 849.13, 849.13, 853.68, 853.68, 853.68, 853.68, 853.68, 854.56, 854.56, 854.56, 854.56, 854.56, 854.47, 854.47, 854.47, 854.47, 854.47, 855.58, 855.58, 855.58, 855.58, 855.58, 856.38, 856.38, 856.38, 856.38, 856.38, 856.17, 856.17, 856.17, 856.17]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714885613 --> 1714886239
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 43.8, 43.8, 43.8, 43.8, 43.8, 40.63, 40.63, 40.63, 40.63, 40.63, 34.12, 34.12, 34.12, 34.12, 34.12, 33.15, 33.15, 33.15, 33.15, 33.15, 32.78, 32.78, 32.78, 32.78, 32.78, 32.89, 32.89, 32.89, 32.89, 32.89, 33.74, 33.74, 33.74, 33.74, 33.74, 34.58, 34.58, 34.58, 34.58, 34.58, 34.84, 34.84, 34.84, 34.84, 34.84, 34.7, 34.7, 34.7, 34.7, 34.7, 34.54, 34.54, 34.54, 34.54, 34.54, 34.41, 34.41, 34.41, 34.41, 34.41, 33.57, 33.57, 33.57, 33.57, 33.57, 33.47, 33.47, 33.47, 33.47, 33.47, 32.14, 32.14, 32.14, 32.14, 32.14, 31.47, 31.47, 31.47, 31.47, 31.47, 31.87, 31.87, 31.87, 31.87, 31.87, 31.98, 31.98, 31.98, 31.98, 31.98, 31.28, 31.28, 31.28, 31.28, 31.28, 30.99, 30.99, 30.99, 30.99, 30.99, 30.96, 30.96, 30.96, 30.96, 30.96, 31.09, 31.09, 31.09, 31.09, 31.09, 31.31, 31.31, 31.31, 31.31, 31.31, 31.17, 31.17, 31.17, 31.17, 31.17, 31.2, 31.2, 31.2, 31.2, 31.2, 31.38, 31.38, 31.38, 31.38, 31.38, 31.37, 31.37, 31.37, 31.37, 31.37, 30.79, 30.79, 30.79, 30.79, 30.79, 30.52, 30.52, 30.52, 30.52, 30.52, 30.71, 30.71, 30.71, 30.71, 30.71, 30.86, 30.86, 30.86, 30.86, 30.86, 31.05, 31.05, 31.05, 31.05, 31.05, 31.27, 31.27, 31.27, 31.27, 31.27, 31.31, 31.31, 31.31, 31.31, 31.31, 31.26, 31.26, 31.26, 31.26, 31.26, 31.19, 31.19, 31.19, 31.19, 31.19, 31.09, 31.09, 31.09, 31.09, 31.09, 30.87, 30.87, 30.87, 30.87, 30.87, 30.88, 30.88, 30.88, 30.88, 30.88, 31.08, 31.08, 31.08, 31.08, 31.08, 31.22, 31.22, 31.22, 31.22, 31.22, 31.26, 31.26, 31.26, 31.26, 31.26, 31.23, 31.23, 31.23, 31.23, 31.23, 31.14, 31.14, 31.14, 31.14, 31.14, 30.9, 30.9, 30.9, 30.9, 30.9, 29.61, 29.61, 29.61, 29.61, 29.61, 29.6, 29.6, 29.6, 29.6, 29.6, 29.56, 29.56, 29.56, 29.56, 29.56, 29.55, 29.55, 29.55, 29.55, 29.55, 29.69, 29.69, 29.69, 29.69, 29.69, 29.7, 29.7, 29.7, 29.7, 29.7, 29.89, 29.89, 29.89, 29.89, 29.89, 29.88, 29.88, 29.88, 29.88, 29.88, 29.87, 29.87, 29.87, 29.87, 29.87, 29.69, 29.69, 29.69, 29.69, 29.69, 29.62, 29.62, 29.62, 29.62, 29.62, 29.67, 29.67, 29.67, 29.67, 29.67, 29.81, 29.81, 29.81, 29.81, 29.81, 29.92, 29.92, 29.92, 29.92, 29.92, 30.03, 30.03, 30.03, 30.03, 30.03, 30.07, 30.07, 30.07, 30.07]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714885613 --> 1714886239
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.18, 0.18, 0.18, 0.18, 0.18, 0.39, 0.39, 0.39, 0.39, 0.39, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.07, 0.07, 0.07, 0.07, 0.07, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.32, 0.32, 0.32, 0.32, 0.32, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.29, 0.24, 0.24, 0.24, 0.24, 0.24, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.26, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.3, 0.3, 0.3, 0.3, 0.3, 0.26, 0.26, 0.26, 0.26, 0.26, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.33, 0.33, 0.33, 0.33, 0.33, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.09, 0.09, 0.09, 0.09, 0.09, 0.33, 0.33, 0.33, 0.33, 0.33, 0.45, 0.45, 0.45, 0.45, 0.45, 0.5, 0.5, 0.5, 0.5, 0.5, 0.6, 0.6, 0.6, 0.6, 0.6, 0.46, 0.46, 0.46, 0.46, 0.46, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.18, 0.3, 0.3, 0.3, 0.3, 0.3, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714885613 --> 1714886239
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

@slaren
Copy link
Member

slaren commented May 3, 2024

This and #7041 have different regex. Which one is correct?

@dranger003
Copy link
Contributor Author

also has 'Digits' and individual_digits=True, so making an assumption there now.

@slaren There is mention of an assumption about digits, which I haven't included but I can include if needed. The regex in this PR has been tested with test-tokenizer-0 which I presume does not cover all scenarios?

@araleza
Copy link

araleza commented May 3, 2024

Hi, does this mean that Command-R was always running at reduced quality, and we just didn't know until recently? Or have the recent Llama 3 changes to the llama.cpp tokenizer resulted in this update being needed to get it back to where it was before the Llama 3 changes went in?

@eskeletor97
Copy link

There is mention of an assumption about digits, which I haven't included but I can include if needed. The regex in this PR has been tested with test-tokenizer-0 which I presume does not cover all scenarios?

I haven't really tested command-r before with any math or numbers, but isn't it a similar issue to llama3 where digits were grouped and tokenized incorrectly?

@ggerganov
Copy link
Member

I had to update to new transformers:

diff --git a/requirements/requirements-convert.txt b/requirements/requirements-convert.txt
index a3d6ecec..5520ba73 100644
--- a/requirements/requirements-convert.txt
+++ b/requirements/requirements-convert.txt
@@ -1,5 +1,5 @@
 numpy~=1.24.4
 sentencepiece~=0.1.98
-transformers>=4.35.2,<5.0.0
+transformers>=4.40.1,<5.0.0
 gguf>=0.1.0
 protobuf>=4.21.0,<5.0.0

Else, I got this error:

python3 convert-hf-to-gguf-update.py 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
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/Users/ggerganov/development/github/llama.cpp/convert-hf-to-gguf-update.py", line 135, in <module>
    tokenizer = AutoTokenizer.from_pretrained(f"models/tokenizers/{name}")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 784, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class CohereTokenizer does not exist or is not currently imported.

@ggerganov
Copy link
Member

Let's rebase on latest master and I will run some extra tests to check if the regexes are correct

@dranger003 dranger003 force-pushed the bpe-pretok-command-r-2 branch from 7bfc01b to d5d6731 Compare May 4, 2024 10:42
@dranger003
Copy link
Contributor Author

@ggerganov Thanks, the PR has been rebased and I added the transformers change.

@ggerganov ggerganov merged commit 889bdd7 into ggml-org:master May 5, 2024
nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
* Add BPE pre-tokenization for Command-R/R+.

* Bump transformers convert requirement.

* command-r : add individual digits regex

---------

Co-authored-by: Georgi Gerganov <[email protected]>
@maziyarpanahi maziyarpanahi mentioned this pull request Oct 24, 2024
4 tasks
@dranger003 dranger003 deleted the bpe-pretok-command-r-2 branch January 3, 2025 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Command-R GGUF conversion no longer working

5 participants