Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
f36c4f9
Remove guardrails that prevent V1 from trying to run embedding models
maxdebayser Mar 24, 2025
acf4638
hack v1 flash_attn to support encoder_only
maxdebayser Apr 3, 2025
b13bbc0
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 3, 2025
8debea0
Revert changes to disable kv caching for encoder-only models
maxdebayser Apr 3, 2025
8d97b9c
Add pooling support in v1
maxdebayser Apr 5, 2025
d60b22b
First end-to-end working version of Bert embeddings in V1
maxdebayser Apr 7, 2025
6bebbb8
Support warmup for pooling models in V1
maxdebayser Apr 7, 2025
6dafd71
address review comments
maxdebayser Apr 7, 2025
e2724a2
address review comments
maxdebayser Apr 7, 2025
56ff6cd
remove debug prints
maxdebayser Apr 7, 2025
fc57edd
address review comments
maxdebayser Apr 7, 2025
64a0e62
Fix cross encoder models in V1 and enable tests for pooling models
maxdebayser Apr 8, 2025
4014d41
address review comments
maxdebayser Apr 8, 2025
87a95a8
Merge branch 'main' into v1_embeddings
maxdebayser Apr 8, 2025
902c129
address review comments
maxdebayser Apr 8, 2025
2c68855
re-enable large embedding models
maxdebayser Apr 8, 2025
8afd8f5
address review comments
maxdebayser Apr 8, 2025
7762976
Merge branch 'main' into v1_embeddings
maxdebayser Apr 8, 2025
d7537ae
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 8, 2025
a9e7747
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 9, 2025
17520bd
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 14, 2025
90c611a
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 15, 2025
dec2441
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 17, 2025
a5e83f4
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 23, 2025
187f69b
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 24, 2025
69a0332
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 29, 2025
a9f1721
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 29, 2025
4b066a3
fix merge problems
maxdebayser Apr 30, 2025
43a26dc
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 30, 2025
ca34513
Merge branch 'upstream_main' into v1_embeddings
maxdebayser Apr 30, 2025
bf3033d
Fix missing qwen embedding model param
maxdebayser Apr 30, 2025
67bf727
Make pooling params reach the pooling in V1
maxdebayser May 1, 2025
93b6361
Merge branch 'upstream_main' into v1_embeddings
maxdebayser May 1, 2025
d916b88
Merge branch 'upstream_main' into v1_embeddings
maxdebayser May 10, 2025
bad4211
fix merge problems
maxdebayser May 10, 2025
35d9bd9
Merge branch 'upstream_main' into v1_embeddings
maxdebayser May 11, 2025
dcc6100
Merge branch 'upstream_main' into v1_embeddings
maxdebayser May 12, 2025
6d56271
PoC of a separated Pooling Model Runner
maxdebayser May 12, 2025
140583f
fix small problems
maxdebayser May 12, 2025
f6531b9
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser May 12, 2025
804646d
fix merge problems
maxdebayser May 12, 2025
c89fefc
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser May 13, 2025
481b1c9
fix typings.List
maxdebayser May 13, 2025
df36c50
refactor a few hacks
maxdebayser May 13, 2025
7fa642e
small fixes
maxdebayser May 13, 2025
29fb611
Disable chunked prefill for pooling models in V1
maxdebayser May 13, 2025
63c20b0
Remove KV cache for pooling model runner
maxdebayser May 14, 2025
07e8e9a
Remove cascade attention from pooling runner as it only applies to de…
maxdebayser May 14, 2025
afdb8f9
Remove handling of running requests in pooling model runner
maxdebayser May 14, 2025
b9f366f
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser May 15, 2025
a0ec55d
Allow the gpu_model_runner to run with cuda graphs without crashing
maxdebayser May 19, 2025
108cd7c
Revert "Remove handling of running requests in pooling model runner"
maxdebayser May 20, 2025
fa394f8
Revert "Remove KV cache for pooling model runner"
maxdebayser May 20, 2025
afc0454
Add support for prefix caching and chunked prefill
maxdebayser May 20, 2025
32c6eeb
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser May 21, 2025
cb58510
appease linter
maxdebayser May 21, 2025
c8b7e72
Refactor input batch
maxdebayser May 21, 2025
339ff9a
The python version in CI doesn't support @dataclass(kw_only=True) yet
maxdebayser May 21, 2025
5d72489
fix broken import
maxdebayser May 21, 2025
1fce390
fix small errors
maxdebayser May 21, 2025
a96115e
fix small errors
maxdebayser May 21, 2025
5c050bb
fix silly bug
maxdebayser May 21, 2025
b7cd175
Refactor gpu model runner
maxdebayser May 21, 2025
842d8fd
fix small mistake
maxdebayser May 22, 2025
2954b22
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser May 27, 2025
ed05f96
disable cuda graphs for pooling
maxdebayser May 27, 2025
c42ec28
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser May 28, 2025
403a143
revert debugging change
maxdebayser May 28, 2025
f07ff33
First pass on review comments
maxdebayser May 28, 2025
b2ba922
rename pooling input batch
maxdebayser May 28, 2025
f0a180f
remove duplicated class
maxdebayser May 28, 2025
f001ed9
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser May 28, 2025
df87da3
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser May 28, 2025
ba72032
fix encoding test and activate v0 and v1
maxdebayser May 28, 2025
eb73c02
fix ordering of operations
maxdebayser May 28, 2025
d917aaf
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser May 29, 2025
d1c740d
trigger ci
maxdebayser May 29, 2025
868059e
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser Jun 1, 2025
e3f4bf5
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser Jun 2, 2025
4f64ee2
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser Jun 3, 2025
77f7056
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser Jun 3, 2025
364ec25
free blocks for finished pooling requests
maxdebayser Jun 3, 2025
3ae735d
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser Jun 4, 2025
ff796e9
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser Jun 4, 2025
a4181ba
sync prs
maxdebayser Jun 6, 2025
4c8dc44
revert unecessary change
maxdebayser Jun 6, 2025
f7687e0
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser Jun 6, 2025
6c3b032
Merge branch 'upstream_main' into v1_embeddings_runner
maxdebayser Jun 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,13 +145,16 @@ def run_with_both_engines(request, monkeypatch):
# Automatically runs tests twice, once with V1 and once without
use_v1 = request.param
# Tests decorated with `@skip_v1` are only run without v1
skip_v0 = request.node.get_closest_marker("skip_v0")
skip_v1 = request.node.get_closest_marker("skip_v1")

if use_v1:
if skip_v1:
pytest.skip("Skipping test on vllm V1")
monkeypatch.setenv('VLLM_USE_V1', '1')
else:
if skip_v0:
pytest.skip("Skipping test on vllm V0")
monkeypatch.setenv('VLLM_USE_V1', '0')

yield
Expand Down
24 changes: 20 additions & 4 deletions tests/entrypoints/llm/test_encode.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
from vllm import LLM, PoolingParams, PoolingRequestOutput
from vllm.distributed import cleanup_dist_env_and_memory

from ...models.utils import check_embeddings_close

MODEL_NAME = "intfloat/multilingual-e5-small"

PROMPTS = [
Expand All @@ -27,6 +29,14 @@
]


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


@pytest.fixture(scope="module")
def llm():
# pytest caches the fixture so we use weakref.proxy to
Expand All @@ -46,9 +56,15 @@ def llm():
cleanup_dist_env_and_memory()


def assert_outputs_equal(o1: list[PoolingRequestOutput],
def assert_outputs_match(o1: list[PoolingRequestOutput],
o2: list[PoolingRequestOutput]):
assert [o.outputs for o in o1] == [o.outputs for o in o2]
check_embeddings_close(
embeddings_0_lst=[o.outputs.data for o in o1],
embeddings_1_lst=[o.outputs.data for o in o2],
name_0="hf",
name_1="vllm",
tol=1e-2,
)


@pytest.mark.skip_global_cleanup
Expand All @@ -63,7 +79,7 @@ def test_v1_v2_api_consistency_single_prompt_tokens(llm: LLM,

v2_output = llm.encode({"prompt_token_ids": prompt_token_ids},
pooling_params=pooling_params)
assert_outputs_equal(v1_output, v2_output)
assert_outputs_match(v1_output, v2_output)


@pytest.mark.skip_global_cleanup
Expand All @@ -80,7 +96,7 @@ def test_v1_v2_api_consistency_multi_prompt_tokens(llm: LLM):
} for p in TOKEN_IDS],
pooling_params=pooling_params,
)
assert_outputs_equal(v1_output, v2_output)
assert_outputs_match(v1_output, v2_output)


@pytest.mark.skip_global_cleanup
Expand Down
8 changes: 8 additions & 0 deletions tests/entrypoints/openai/test_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,14 @@
DTYPE = "bfloat16"


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


@pytest.fixture(scope="module")
def server():
args = [
Expand Down
8 changes: 8 additions & 0 deletions tests/entrypoints/openai/test_rerank.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@
DTYPE = "bfloat16"


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


@pytest.fixture(scope="module")
def server():
args = ["--enforce-eager", "--max-model-len", "100", "--dtype", DTYPE]
Expand Down
9 changes: 9 additions & 0 deletions tests/entrypoints/openai/test_score.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,15 @@

from ...utils import RemoteOpenAIServer


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


MODELS = [
{
"name": "BAAI/bge-reranker-v2-m3",
Expand Down
8 changes: 8 additions & 0 deletions tests/models/language/pooling/test_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,14 @@

from vllm.platforms import current_platform

# TODO: enable when float32 is supported by V1
# @pytest.fixture(autouse=True)
# def v1(run_with_both_engines):
# # Simple autouse wrapper to run both engines for each test
# # This can be promoted up to conftest.py to run for every
# # test in a package
# pass


@pytest.mark.parametrize(
"model",
Expand Down
18 changes: 16 additions & 2 deletions tests/models/language/pooling/test_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@
from ...utils import check_embeddings_close


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


@pytest.mark.parametrize(
"model",
[
Expand All @@ -20,13 +28,19 @@
marks=[pytest.mark.core_model]),
pytest.param("intfloat/e5-mistral-7b-instruct",
marks=[pytest.mark.core_model, pytest.mark.cpu_model]),
pytest.param("ssmits/Qwen2-7B-Instruct-embed-base"),
# the qwen models interfere with each other (see PR
# https://github.com/vllm-project/vllm/pull/18720).
# To avoid this problem, for now we skip v0 since it will be
# deprecated anyway.
pytest.param("ssmits/Qwen2-7B-Instruct-embed-base",
marks=[pytest.mark.skip_v0]),
# [Encoder-only]
pytest.param("BAAI/bge-base-en-v1.5",
marks=[pytest.mark.core_model, pytest.mark.cpu_model]),
pytest.param("sentence-transformers/all-MiniLM-L12-v2"),
pytest.param("intfloat/multilingual-e5-small"),
pytest.param("Alibaba-NLP/gte-Qwen2-1.5B-instruct"),
pytest.param("Alibaba-NLP/gte-Qwen2-1.5B-instruct",
marks=[pytest.mark.skip_v0]),
# [Cross-Encoder]
pytest.param("sentence-transformers/stsb-roberta-base-v2"),
],
Expand Down
8 changes: 8 additions & 0 deletions tests/models/language/pooling/test_jina.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,14 @@
]


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


@pytest.fixture(scope="module", params=SCORING_MODELS)
def model_name(request):
yield request.param
Expand Down
9 changes: 9 additions & 0 deletions tests/models/language/pooling/test_scoring.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,15 @@
"The capital of Germany is Berlin.",
]


@pytest.fixture(autouse=True)
def v1(run_with_both_engines):
# Simple autouse wrapper to run both engines for each test
# This can be promoted up to conftest.py to run for every
# test in a package
pass


DTYPE = "half"


Expand Down
38 changes: 36 additions & 2 deletions vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -706,6 +706,9 @@ def _init_pooler_config(self) -> Optional["PoolerConfig"]:
if isinstance(self.override_pooler_config, dict):
self.override_pooler_config = PoolerConfig(
**self.override_pooler_config)
logger.warning("CUDA graph is not supported for pooling yet, "
"fallback to the eager mode.")
self.enforce_eager = True

pooler_config = self.override_pooler_config or PoolerConfig()

Expand Down Expand Up @@ -4450,14 +4453,45 @@ def __post_init__(self):
"Disabling `torch.compile`.")
self.compilation_config.level = CompilationLevel.NO_COMPILATION

disable_cascade_reasons: list[str] = []

if self.compilation_config.full_cuda_graph and \
not self.model_config.disable_cascade_attn:
logger.warning_once(
disable_cascade_reasons.append(
"full_cuda_graph is not supported with "
"cascade attention. Disabling cascade attention.")
self.model_config.disable_cascade_attn = True
self.cache_config.enable_prefix_caching = False

disable_chunked_prefill_reasons: list[str] = []

if self.model_config and self.model_config.pooler_config:
pooling_type = self.model_config.pooler_config.pooling_type
if pooling_type is None or pooling_type.lower() != "last":
disable_chunked_prefill_reasons.append(
"Only \"last\" pooling supports chunked "
"prefill and prefix caching; disabling both.")

disable_cascade_reasons.append(
"Loaded model for pooling; disabling cascade attention.")

if disable_chunked_prefill_reasons:
for reason in disable_chunked_prefill_reasons:
logger.info(reason)
self.scheduler_config.enable_chunked_prefill = False
self.scheduler_config.chunked_prefill_enabled = False
self.scheduler_config.long_prefill_token_threshold = 0
self.scheduler_config.max_num_batched_tokens = max(
self.scheduler_config.max_model_len,
DEFAULT_MAX_NUM_BATCHED_TOKENS)

if self.cache_config is not None:
self.cache_config.enable_prefix_caching = False

if disable_cascade_reasons:
for reason in disable_cascade_reasons:
logger.info(reason)
self.model_config.disable_cascade_attn = True

if (self.kv_events_config is not None
and self.kv_events_config.enable_kv_cache_events
and not self.cache_config.enable_prefix_caching):
Expand Down
6 changes: 0 additions & 6 deletions vllm/engine/arg_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1344,12 +1344,6 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool:
recommend_to_remove=False)
return False

# No Embedding Models so far.
if model_config.task not in ["generate"]:
_raise_or_fallback(feature_name=f"--task {model_config.task}",
recommend_to_remove=False)
return False

# No Mamba or Encoder-Decoder so far.
if not model_config.is_v1_compatible:
_raise_or_fallback(feature_name=model_config.architectures,
Expand Down
2 changes: 1 addition & 1 deletion vllm/entrypoints/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -1230,7 +1230,7 @@ def score(
# the tokenizer for models such as
# "cross-encoder/ms-marco-MiniLM-L-6-v2" doesn't support passing
# lists of tokens to the `text` and `text_pair` kwargs
tokenizer = self.llm_engine.get_tokenizer()
tokenizer = self.get_tokenizer()

def ensure_str(prompt: SingletonPrompt):
if isinstance(prompt, dict):
Expand Down
Loading