Model swapping for llama.cpp (or any local OpenAI API compatible server)
-
Updated
Sep 29, 2025 - Go
Model swapping for llama.cpp (or any local OpenAI API compatible server)
Intelligent Mixture-of-Models Router for Efficient LLM Inference
🔒 Enterprise-grade API gateway that helps you monitor and impose cost or rate limits per API key. Get fine-grained access control and monitoring per user, application, or environment. Supports OpenAI, Azure OpenAI, Anthropic, vLLM, and open-source LLMs.
AI Inference Operator for Kubernetes. The easiest way to serve ML models in production. Supports VLMs, LLMs, embeddings, and speech-to-text.
☸️ Easy, advanced inference platform for large language models on Kubernetes. 🌟 Star to support our work!
Lightweight & fast AI inference proxy for self-hosted LLMs backends like Ollama, LM Studio and others. Designed for speed, simplicity and local-first deployments.
Extensible generative AI platform on Kubernetes with OpenAI-compatible APIs.
Arks is a cloud-native inference framework running on Kubernetes
Carbon Limiting Auto Tuning for Kubernetes
Unified management and routing for llama.cpp, MLX and vLLM models with web dashboard.
🚀🛸 Easily boost the speed of pulling your models and datasets from various of inference runtimes. (e.g. 🤗 HuggingFace, 🐫 Ollama, vLLM, and more!)
Call many AIs from a single API.
Production-ready AI for Kubernetes. Run cutting‑edge LLMs on NVIDIA GPUs with vLLM. Use Ollama for embeddings and vision. Access securely through OpenWebUI. Scalable, high‑performance, and fully self‑hosted.
A sample architecture that mimics MoE (Mixture of Experts) using Go.
nfrx is an inference exchange gateway
Add a description, image, and links to the vllm topic page so that developers can more easily learn about it.
To associate your repository with the vllm topic, visit your repo's landing page and select "manage topics."