-
Notifications
You must be signed in to change notification settings - Fork 13.6k
server: benchmark: chat/completions scenario and other llm servers comparison #5941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 2 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
68d1d8f
server: bench: Init a bench scenario with K6
phymbert 0b822b6
server: bench: EOL EOF
phymbert 548bc96
server: bench: PR feedback and improved k6 script configuration
phymbert ab0a59d
server: bench: remove llamacpp_completions_tokens_seconds as it inclu…
phymbert f425240
server: bench: fix doc
phymbert bed1cdd
server: bench: change gauge custom metrics to trend
phymbert 572758a
server: bench: change gauge custom metrics to trend
phymbert 06e225f
server: bench: doc add an option to debug http request
phymbert a4b0d10
server: bench: filter dataset too short and too long sequences
phymbert 29c635b
server: bench: allow to filter out conversation in the dataset based …
phymbert ba7114c
server: bench: fix assistant message sent instead of user message
phymbert c4d1b5a
server: bench: fix assistant message sent instead of user message
phymbert 5d25f74
Merge branch 'master' into hp/server/bench/init
ggerganov 52c76d5
server : add defrag thold parameter
ggerganov 6bfb80e
server: bench: select prompts based on the current iteration id not r…
phymbert File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,64 @@ | ||
| ### Server benchmark tools | ||
|
|
||
| Benchmark is using [k6](https://k6.io/). | ||
|
|
||
| ##### Install k6 - ubuntu | ||
| ```shell | ||
| snap install k6 | ||
| ``` | ||
|
|
||
| #### Downloading the ShareGPT dataset | ||
|
|
||
| ```shell | ||
| wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json | ||
| ``` | ||
|
|
||
| #### Download a model | ||
| Example for PHI-2 | ||
|
|
||
| ```shell | ||
| ../../../scripts/hf.sh --repo ggml-org/models --file phi-2/ggml-model-q4_0.gguf | ||
| ``` | ||
|
|
||
| #### Start the server | ||
| The server must listen on `localhost:8080`. | ||
|
|
||
| Example: | ||
| ```shell | ||
| server --host localhost --port 8080 \ | ||
| --model ggml-model-q4_0.gguf \ | ||
| --cont-batching \ | ||
| --metrics \ | ||
| --parallel 8 \ | ||
| --batch-size 512 \ | ||
| --ctx-size 4096 \ | ||
| --log-format text \ | ||
| -ngl 33 | ||
| ``` | ||
|
|
||
| #### Run the bench | ||
| ```shell | ||
| k6 run script.js | ||
| ``` | ||
|
|
||
| #### Change the number of concurrent user | ||
| in the `script.js`, change the ramping period according to your number of slots. | ||
|
|
||
| #### Metrics | ||
|
|
||
| Following metrics are available: | ||
| - `llamacpp_prompt_tokens` Gauge of OAI response `usage.prompt_tokens` | ||
| - `llamacpp_prompt_tokens_total_counter` Counter of OAI response `usage.prompt_tokens` | ||
| - `llamacpp_completion_tokens` Gauge of OAI response `usage.completion_tokens` | ||
| - `llamacpp_completion_tokens_total_counter` Counter of OAI response `usage.completion_tokens` | ||
| - `llamacpp_completions_tokens_seconds` Gauge of `usage.completion_tokens` divided by the request time in second | ||
| - `llamacpp_completions_truncated_rate` Rate of completions truncated, i.e. if `finish_reason === 'length'` | ||
| - `llamacpp_completions_stop_rate` Rate of completions truncated, i.e. if `finish_reason === 'stop'` | ||
|
|
||
| The script will fail if too many completions are truncated, see `llamacpp_completions_truncated_rate`. | ||
|
|
||
| K6 metrics might be compared against [server metrics](../README.md), with: | ||
|
|
||
| ```shell | ||
| curl http://localhost:8080/metrics | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| import http from 'k6/http'; | ||
| import { check, sleep } from 'k6'; | ||
| import { SharedArray } from 'k6/data'; | ||
| import { Counter, Gauge, Rate } from 'k6/metrics'; | ||
|
|
||
| const data = new SharedArray('conversations', function () { | ||
| return JSON.parse(open('./ShareGPT_V3_unfiltered_cleaned_split.json')) | ||
|
|
||
| // Filter out the conversations with less than 2 turns. | ||
| .filter(data => data["conversations"].length >= 2) | ||
| // Only keep the first two turns of each conversation. | ||
| .map(data => Array(data["conversations"][0]["value"], data["conversations"][1]["value"])); | ||
ngxson marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| }); | ||
|
|
||
| const llamacpp_prompt_tokens = new Gauge('llamacpp_prompt_tokens'); | ||
| const llamacpp_completion_tokens = new Gauge('llamacpp_completion_tokens'); | ||
|
|
||
| const llamacpp_completions_tokens_seconds = new Gauge('llamacpp_completions_tokens_seconds'); | ||
|
|
||
| const llamacpp_prompt_tokens_total_counter = new Counter('llamacpp_prompt_tokens_total_counter'); | ||
| const llamacpp_completion_tokens_total_counter = new Counter('llamacpp_completion_tokens_total_counter'); | ||
|
|
||
| const llamacpp_completions_truncated_rate = new Rate('llamacpp_completions_truncated_rate'); | ||
| const llamacpp_completions_stop_rate = new Rate('llamacpp_completions_stop_rate'); | ||
|
|
||
| export const options = { | ||
| thresholds: { | ||
| llamacpp_completions_truncated_rate: [ | ||
| // more than 10% of truncated input will abort the test | ||
| { threshold: 'rate < 0.1', abortOnFail: true, delayAbortEval: '1m' }, | ||
| ], | ||
| }, | ||
| scenarios: { | ||
| completions: { | ||
| executor: 'ramping-vus', | ||
| startVUs: 1, | ||
| stages: [ | ||
| {duration: '1m', target: 8}, | ||
| {duration: '3m', target: 8}, | ||
| {duration: '1m', target: 0}, | ||
| ], | ||
| gracefulRampDown: '30s', | ||
| }, | ||
| }, | ||
| }; | ||
|
|
||
| export default function () { | ||
| const conversation = data[0] | ||
phymbert marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| const payload = { | ||
| "messages": [ | ||
| { | ||
| "role": "system", | ||
| "content": conversation[0], | ||
| }, | ||
| { | ||
| "role": "user", | ||
| "content": conversation[1], | ||
| } | ||
| ], | ||
| "model": "model", | ||
| "stream": false, | ||
| } | ||
| let res = http.post('http://localhost:8080/v1/chat/completions', JSON.stringify(payload), { | ||
| headers: { 'Content-Type': 'application/json' }, | ||
| }) | ||
|
|
||
| check(res, {'success completion': (r) => r.status === 200}) | ||
|
|
||
| const completions = res.json() | ||
|
|
||
| llamacpp_prompt_tokens.add(completions.usage.prompt_tokens) | ||
| llamacpp_prompt_tokens_total_counter.add(completions.usage.prompt_tokens) | ||
|
|
||
| llamacpp_completion_tokens.add(completions.usage.completion_tokens) | ||
| llamacpp_completion_tokens_total_counter.add(completions.usage.completion_tokens) | ||
|
|
||
| llamacpp_completions_tokens_seconds.add(completions.usage.completion_tokens / res.timings.duration * 1e3) | ||
|
|
||
| llamacpp_completions_truncated_rate.add(completions.choices[0].finish_reason === 'length') | ||
| llamacpp_completions_stop_rate.add(completions.choices[0].finish_reason === 'stop') | ||
|
|
||
|
|
||
| sleep(0.3) | ||
| } | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.