SpaceServe: Encoder–LLM Concurrent Inference on vLLM

SpaceServe is a vLLM‑based inference runtime for multimodal (especially vision–language) models. It decouples the vision encoder from LLM decoding and runs them in parallel, improving throughput and concurrency while reducing end‑to‑end latency. The system keeps an OpenAI‑compatible API for seamless integration.

Highlights

Parallel encoder–decoder execution: encoder and decoder run as cooperating processes with batched, pipelined computation.
Encoder‑aware scheduling: decoding only advances when necessary encoder features are ready; avoids wasted work and smooths latency.
Lightweight encoder cache: inter‑process cache for encoder outputs with automatic release after consumption by the decoder.
OpenAI‑compatible serving: keep your existing client and tooling; streaming and tensor parallelism remain supported by vLLM.
Optional GPU SM partitioning (experimental): reduce interference between encoder and decoder kernels on the same GPU.

Architecture (overview)

High‑level dataflow and scheduling in SpaceServe.

Environment Setup

Prerequisites

Python 3.10 or 3.11
NVIDIA GPU with CUDA recommended
PyTorch 2.5.1 (pinned in this repo)

Install (CUDA)

pip install -r requirements-cuda.txt
pip install -e .

CPU or alternative devices: see the corresponding requirements-*.txt files.

Run the Server

Enable the vLLM V1 inner loop (required):

export VLLM_USE_V1=1

Start the OpenAI‑compatible server (example with Qwen2.5‑VL‑3B):

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-VL-3B-Instruct \
  --trust-remote-code \
  --max-num-batched-tokens 2048 \
  --max-num-seqs 256

Quick health check:

curl http://127.0.0.1:8000/v1/models

Optional local benchmark clients:

bash ./client_qwen2dot5vl_3b.sh

Notes

Default throughput knobs: --max-num-batched-tokens and --max-num-seqs.
Multimodal preprocessing and per‑prompt limits are supported via vLLM flags (e.g., --limit-mm-per-prompt, --mm-processor-kwargs).
SM partitioning is experimental and requires a local libsmctrl installation.

Name		Name	Last commit message	Last commit date
Latest commit History 4,477 Commits
.buildkite		.buildkite
.github		.github
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docs		docs
examples		examples
tests		tests
tools		tools
vllm		vllm
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.shellcheckrc		.shellcheckrc
.yapfignore		.yapfignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
Dockerfile		Dockerfile
Dockerfile.arm		Dockerfile.arm
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.hpu		Dockerfile.hpu
Dockerfile.neuron		Dockerfile.neuron
Dockerfile.openvino		Dockerfile.openvino
Dockerfile.ppc64le		Dockerfile.ppc64le
Dockerfile.rocm		Dockerfile.rocm
Dockerfile.rocm_base		Dockerfile.rocm_base
Dockerfile.tpu		Dockerfile.tpu
Dockerfile.xpu		Dockerfile.xpu
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
client_Molmo7b.sh		client_Molmo7b.sh
client_deepseek_small.sh		client_deepseek_small.sh
client_glm4b9b.sh		client_glm4b9b.sh
client_idefics.sh		client_idefics.sh
client_internvl2_26b.sh		client_internvl2_26b.sh
client_internvl2_4b.sh		client_internvl2_4b.sh
client_internvl2_5_4b.sh		client_internvl2_5_4b.sh
client_internvl2_8b.sh		client_internvl2_8b.sh
client_llava_1_6.sh		client_llava_1_6.sh
client_llava_onevision.sh		client_llava_onevision.sh
client_mantis.sh		client_mantis.sh
client_minicpm_o_2_6.sh		client_minicpm_o_2_6.sh
client_mistral12b.sh		client_mistral12b.sh
client_phi3.sh		client_phi3.sh
client_phi3_5.sh		client_phi3_5.sh
client_qwen2dot5vl_3b.sh		client_qwen2dot5vl_3b.sh
client_qwen2dot5vl_7b.sh		client_qwen2dot5vl_7b.sh
client_qwen2vl_2b.sh		client_qwen2vl_2b.sh
client_qwen2vl_7b.sh		client_qwen2vl_7b.sh
collect_env.py		collect_env.py
enbale_mps.sh		enbale_mps.sh
find_cuda_init.py		find_cuda_init.py
format.sh		format.sh
overview.png		overview.png
pyproject.toml		pyproject.toml
python_only_dev.py		python_only_dev.py
requirements-build.txt		requirements-build.txt
requirements-common.txt		requirements-common.txt
requirements-cpu.txt		requirements-cpu.txt
requirements-cuda.txt		requirements-cuda.txt
requirements-dev.txt		requirements-dev.txt
requirements-hpu.txt		requirements-hpu.txt
requirements-lint.txt		requirements-lint.txt
requirements-neuron.txt		requirements-neuron.txt
requirements-openvino.txt		requirements-openvino.txt
requirements-rocm.txt		requirements-rocm.txt
requirements-test.in		requirements-test.in
requirements-test.txt		requirements-test.txt
requirements-tpu.txt		requirements-tpu.txt
requirements-xpu.txt		requirements-xpu.txt
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpaceServe: Encoder–LLM Concurrent Inference on vLLM

Highlights

Architecture (overview)

Environment Setup

Run the Server

Notes

About

Uh oh!

Releases

Packages

Languages

License

MachineLearningSystem/25NeurIPS_SpaceServe

Folders and files

Latest commit

History

Repository files navigation

SpaceServe: Encoder–LLM Concurrent Inference on vLLM

Highlights

Architecture (overview)

Environment Setup

Run the Server

Notes

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages