Skip to content

Commit 68752d0

Browse files
Isotr0pyalbertoperdomo2
authored andcommitted
[Bugfix] Disable FlexAttention direct block mask building for encoder-only models (vllm-project#27344)
Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Alberto Perdomo <[email protected]>
1 parent 6b638d1 commit 68752d0

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

vllm/v1/attention/backends/flex_attention.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -658,7 +658,10 @@ def build(
658658
total_cache_tokens=total_cache_tokens,
659659
decode_offset=offset_tensor,
660660
num_blocks_per_seq=num_blocks_per_seq,
661-
direct_build=self.direct_build,
661+
# FIXME(Isotr0py): direct build has issue to build bidirectional
662+
# attention block mask for encoder-only models, disable it temporarily.
663+
# see: https://github.com/vllm-project/vllm/pull/27329#issuecomment-3431484053
664+
direct_build=(self.direct_build and common_attn_metadata.causal),
662665
q_block_size=self.q_block_size,
663666
kv_block_size=self.kv_block_size,
664667
)

0 commit comments

Comments
 (0)