-
Notifications
You must be signed in to change notification settings - Fork 1
Awarno/haproxy #241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
AWarno
wants to merge
27
commits into
main
Choose a base branch
from
awarno/haproxy
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Awarno/haproxy #241
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Anna Warno <[email protected]>
1. Add total stats. 2. Add reasoning token stats (if provided). - https://platform.openai.com/docs/guides/reasoning or "reasoning_tokens" in usage, (completion_tokens_details, output_tokens_details) 3. Make stats cache-resistant — do not include stats if the response is from cache. --------- Signed-off-by: Anna Warno <[email protected]>
checkbox added Signed-off-by: AWarno <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Anna Warno <[email protected]>
It unblocks us to use new Eval Factory containers in the launcher — they don't have `nv-eval`/`nv_eval` alias anymore. Signed-off-by: Piotr Januszewski <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]> Signed-off-by: Anna Warno <[email protected]>
This is a very basic migration of the readme content + adding a minimal toctree to the home index page so that the sphinx site produces a sidebar. The sidebar will mature and break out in the future into sections such as About, Get Started, etc. We will also add more sections/cards to this page after all other basic edits have been checked in, so it won't be a direct copy of the README, instead it will become a proper docs site home page. --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: L.B. <[email protected]> Co-authored-by: jgerh <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Docs update --------- Signed-off-by: Anna Warno <[email protected]> Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Wojciech Prazuch <[email protected]> Signed-off-by: AWarno <[email protected]> Co-authored-by: Oliver Koenig <[email protected]> Co-authored-by: Alexey Gronskiy <[email protected]> Co-authored-by: Wojciech Prazuch <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: oliver könig <[email protected]> Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Marta Stepniewska-Dziubinska <[email protected]> Signed-off-by: Anna Warno <[email protected]>
84c264e
to
1452fce
Compare
/ok to test 06e6a85 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Enabling Multi-Instance Deployment with HAProxy
Why HAProxy
HAProxy is a lightweight, reliable, and widely used load balancer. It generalizes well to all server types. Using an external load balancer is officially recommended in the vLLM documentation (see vLLM Data Parallel Deployment); the documentation provides an example using NGINX, but HAProxy should work similarly.
Alternative Solutions
Ray
This is useful for multi-node deployments when a model is too large for a single node. It can also be used for multi-instance setups, but it requires knowing how to launch and manage each server type individually (vLLM, SGLang may have different CLI arguments for this). It does not generalize as well as using an external load balancer. However, we may want to provide an example of how to use it for multi-node large model deployment.
LiteLLM
Offers backend orchestration but is generally overkill for simple load balancing. The project evolves quickly, which may affect stability.
NGINX
Very similar to HAProxy for this use case and officially recommended in the vLLM documentation:
vLLM Data Parallel Deployment
HAProxy, however, is slightly simpler/nicer to use in practice (based on my experience).
Literature
TODO
Next Steps
examples/
.