A lightweight load balancer and reverse proxy for OpenAI-compatible API servers, well-suited for experimentation. It monitors multiple backend LLM servers, performs model-name-based routing, and selects per-model instances (model, model-2, ...). It supports sticky sessions to reuse prompt caches by routing repeat requests to the same instance, enforces per-server concurrency limits, exposes its own load based on GPU utilization, and ships with a simple monitoring UI.
- Model-name routing: Distributes requests to server groups using regex patterns defined in
server-list.json. - Per-instance selection: Prefers available instances among
model,model-2,model-3, ... - Sticky sessions: Keeps routing pinned per client (IP or username in the system message) × model for a limited time (default 3 minutes).
- Concurrency limits: Selects backends so that no server exceeds its
request-max. - Health monitoring + UI: Polls each backend’s
/llmhealthevery second; view status at/llmhealth-monitor. - GPU utilization on Windows/NVML: Measures local GPU load via
win32pdhorpynvml. - OpenAI-compatible proxy: Routes
/v1/chat/completionsby model; all other requests are proxied to the fallback.
- Tested on Windows.
- Not tested on Linux (GPU metrics collection may not work).
- Implemented with Flask; not intended for high-volume production traffic.
Run this single command to automatically download and execute the setup script. Nothing is required including Git or Python - all portable versions are automatically downloaded and set up in ./workspace:
curl -L -o "run-llama-balancer.bat" "https://raw.githubusercontent.com/issixx/llama-balancer/main/run-llama-balancer.bat" && call run-llama-balancer.batgit clone https://github.com/issixx/llama-balancer.git
cd llama-balancer
pip install -r requirements.txt
python llama-balancer-server.py- The default port is
18000. - Please create a server-list.json in this directory, using the example below as a reference.
Define backend servers and model routing rules in server-list.json.
{
"servers": {
"PC1": { "addr": "http://192.168.1.20", "health-port": 18000, "model-port": 8081, "request-max": 1 },
"PC2": { "addr": "http://192.168.1.21", "health-port": 18000, "model-port": 8081 }
},
"models": {
"gpt-oss:20b-64k.*": [
"PC1",
"PC2"
],
"gpt-oss:20b-128k.*": [
"PC1"
]
},
"fallback_server": "PC2"
}- servers: For each server, specify
addr(base URL including scheme),health-port(health endpoint),model-port(model API), and optionalrequest-max(max concurrent in-flight requests). - models: Regex pattern → list of eligible server names. Evaluated in order. If all attempts fail, the first server is used.
- fallback_server: Server name to use when no pattern matches.
You can override the config file path via the SERVER_LIST_JSON environment variable (default: server-list.json).
- OpenAI-compatible API endpoints
- Use this in place of llama-server or llama-swap
GET /llmhealth- Returns the balancer’s own health (idle/busy based on local GPU utilization).
GET /llmhealth-snapshot- Returns a JSON snapshot of recent backend states, in-flight counts, sticky entries, etc.
GET /llmhealth-monitor- Minimal dashboard viewable in a browser.
GET /v1/models- Returns a merged list of models across all backends (excludes hyphen-numbered variants like
-2,-3).
- Returns a merged list of models across all backends (excludes hyphen-numbered variants like
/*(everything else)- Reverse proxy. Removes hop-by-hop headers; request/response bodies are largely passed through.
- Only for
/v1/chat/completions, the model name may be rewritten when selecting an instance.
- Health monitoring: Polls each backend at
addr:health-port/llmhealthevery second. Uses a conservative 5-second sliding window to judge state (idle/busy/invalid). - GPU load threshold: The balancer is considered busy if the maximum GPU utilization over the last 5 seconds is ≥ 50%.
- Sticky sessions: Keyed by client identifier (IP or username in the system message) × model. Default TTL is 3 minutes.
- Concurrency: When
request-maxis set, new requests are avoided once the total in-flight count across all models on that server reaches the limit. - Instance selection: Prefer available instances among
model,model-2, ... If none are free, prefer backends currentlyidle. - Per-model rules: Regex patterns in
modelsare evaluated withfullmatch.
- Monitor request status at
/llmhealth-monitor(auto-refresh every 5 seconds).
- This project is intended for evaluation within a private network.
- Not intended for large groups or exposure to public networks.
- Contributions (issues, PRs, improvement proposals) are welcome. Please follow the standard GitHub flow.
- Bug fixes, optimizations, and sharing benchmark results are also welcome.
MIT- See LICENSE.
- Q: Are endpoints other than
/v1/chat/completionsrouted per model?- A: No. Model-aware routing/optimization applies only to
POST /v1/chat/completions. Other endpoints are proxied to the fallback server.
- A: No. Model-aware routing/optimization applies only to
- Q: Does it run on non-Windows platforms?
- A: Likely, but we have not tested on Linux. If
win32pdhis unavailable,pynvmlis used; if neither is available, GPU utilization is treated as 0%.
- A: Likely, but we have not tested on Linux. If
- Q: What are the timeouts?
- A: 5s connect (2s read for health checks). For upstream proxying, the default connect timeout is 300s.
