Skip to content

Conversation

AWarno
Copy link
Contributor

@AWarno AWarno commented Sep 29, 2025

Enabling Multi-Instance Deployment with HAProxy

Why HAProxy

HAProxy is a lightweight, reliable, and widely used load balancer. It generalizes well to all server types. Using an external load balancer is officially recommended in the vLLM documentation (see vLLM Data Parallel Deployment); the documentation provides an example using NGINX, but HAProxy should work similarly.

Alternative Solutions

  • Ray
    This is useful for multi-node deployments when a model is too large for a single node. It can also be used for multi-instance setups, but it requires knowing how to launch and manage each server type individually (vLLM, SGLang may have different CLI arguments for this). It does not generalize as well as using an external load balancer. However, we may want to provide an example of how to use it for multi-node large model deployment.

  • LiteLLM
    Offers backend orchestration but is generally overkill for simple load balancing. The project evolves quickly, which may affect stability.

  • NGINX
    Very similar to HAProxy for this use case and officially recommended in the vLLM documentation:
    vLLM Data Parallel Deployment
    HAProxy, however, is slightly simpler/nicer to use in practice (based on my experience).

Literature

TODO

  • Run on longer tasks to validate stability and performance. (I have checked ifeval so far)
  • Check if the HAProxy template is correctly included in the pip wheel (consider renaming it)
  • Documentation
  • dataclass in types fix!!!!

Next Steps

  • Add a multi-node deployment example using Ray server. This will likely just require creating one example configuration file under examples/.

@AWarno AWarno requested review from a team and agronskiy as code owners September 29, 2025 11:45
Copy link

copy-pr-bot bot commented Sep 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ko3n1g and others added 24 commits September 29, 2025 13:56
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
1. Add total stats.
2. Add reasoning token stats (if provided). -
https://platform.openai.com/docs/guides/reasoning or "reasoning_tokens"
in usage, (completion_tokens_details, output_tokens_details)
3. Make stats cache-resistant — do not include stats if the response is
from cache.

---------

Signed-off-by: Anna Warno <[email protected]>
checkbox added

Signed-off-by: AWarno <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
It unblocks us to use new Eval Factory containers in the launcher — they
don't have `nv-eval`/`nv_eval` alias anymore.

Signed-off-by: Piotr Januszewski <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
This is a very basic migration of the readme content + adding a minimal
toctree to the home index page so that the sphinx site produces a
sidebar. The sidebar will mature and break out in the future into
sections such as About, Get Started, etc.

We will also add more sections/cards to this page after all other basic
edits have been checked in, so it won't be a direct copy of the README,
instead it will become a proper docs site home page.

---------

Signed-off-by: Lawrence Lane <[email protected]>
Signed-off-by: L.B. <[email protected]>
Co-authored-by: jgerh <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Docs update

---------

Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]>
Signed-off-by: AWarno <[email protected]>
Co-authored-by: Oliver Koenig <[email protected]>
Co-authored-by: Alexey Gronskiy <[email protected]>
Co-authored-by: Wojciech Prazuch <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Wojciech Prazuch <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
Signed-off-by: Marta Stepniewska-Dziubinska <[email protected]>
Signed-off-by: Anna Warno <[email protected]>
@AWarno AWarno marked this pull request as draft September 29, 2025 13:19
@AWarno AWarno marked this pull request as ready for review October 1, 2025 12:12
@AWarno
Copy link
Contributor Author

AWarno commented Oct 1, 2025

/ok to test 06e6a85

@AWarno AWarno marked this pull request as draft October 2, 2025 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants