-
Notifications
You must be signed in to change notification settings - Fork 596
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.7.1+cpu
Is debug build: False
OS: Debian GNU/Linux 12 (bookworm) (x86_64)
GCC version: (Debian 12.2.0-14+deb12u1) 12.2.0
Clang version: Could not collect
CMake version: version 3.25.1
Libc version: glibc-2.36
Python version: 3.11.9 (main, Dec 26 2024, 11:08:39) [GCC 12.2.0] (64-bit runtime)
Python platform: Linux-5.10.135.bsk.6-amd64-x86_64-with-glibc2.36
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 180
On-line CPU(s) list: 0-179
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8457C
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 45
Socket(s): 2
Stepping: 8
BogoMIPS: 5199.53
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 4.2 MiB (90 instances)
L1i cache: 2.8 MiB (90 instances)
L2 cache: 180 MiB (90 instances)
L3 cache: 195 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-89
NUMA node1 CPU(s): 90-179
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; TSX disabled
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.7.1+cpu
[pip3] torch_npu==2.7.1
[pip3] transformers==4.57.1
[conda] Could not collect
vLLM Version: 0.11.0
vLLM Ascend Version: 0.11.0rc1.dev198+gd913f9474.d20251110 (git sha: d913f9474, date: 20251110)
ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ASCEND_VISIBLE_DEVICES=1,2,12,3,10,5,14,15,4,13,8,7,0,6,9,11
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ASCEND_RUNTIME_OPTIONS=
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ASCEND_PROCESS_LOG_PATH=/var/log/tiger/ascend_diag_logs/run_0/process_log
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/x86_64:/opt/tiger/native_libhdfs/lib/native:/opt/tiger/jdk/jdk8u265-b01/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver::/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/toolbox/latest/Ascend-DMI/lib64
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2 Version: 24.1.rc2 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B2C | OK | 90.4 44 0 / 0 |
| 0 | 0000:75:01.0 | 0 0 / 0 3446 / 65536 |
+===========================+===============+====================================================+
| 1 910B2C | OK | 93.1 45 0 / 0 |
| 0 | 0000:6F:01.0 | 0 0 / 0 3426 / 65536 |
+===========================+===============+====================================================+
| 2 910B2C | OK | 89.1 42 0 / 0 |
| 0 | 0000:71:01.0 | 0 0 / 0 3425 / 65536 |
+===========================+===============+====================================================+
| 3 910B2C | OK | 93.3 45 0 / 0 |
| 0 | 0000:6B:01.0 | 0 0 / 0 3425 / 65536 |
+===========================+===============+====================================================+
| 4 910B2C | OK | 99.3 44 0 / 0 |
| 0 | 0000:69:01.0 | 0 0 / 0 3426 / 65536 |
+===========================+===============+====================================================+
| 5 910B2C | OK | 94.8 46 0 / 0 |
| 0 | 0000:67:01.0 | 0 0 / 0 3425 / 65536 |
+===========================+===============+====================================================+
| 6 910B2C | OK | 99.7 46 0 / 0 |
| 0 | 0000:65:01.0 | 0 0 / 0 3426 / 65536 |
+===========================+===============+====================================================+
| 7 910B2C | OK | 92.3 45 0 / 0 |
| 0 | 0000:73:01.0 | 0 0 / 0 3425 / 65536 |
+===========================+===============+====================================================+
| 8 910B2C | OK | 93.4 45 0 / 0 |
| 0 | 0000:75:02.0 | 0 0 / 0 3437 / 65536 |
+===========================+===============+====================================================+
| 9 910B2C | OK | 94.5 45 0 / 0 |
| 0 | 0000:6F:02.0 | 0 0 / 0 3424 / 65536 |
+===========================+===============+====================================================+
| 10 910B2C | OK | 94.4 43 0 / 0 |
| 0 | 0000:71:02.0 | 0 0 / 0 3425 / 65536 |
+===========================+===============+====================================================+
| 11 910B2C | OK | 94.5 46 0 / 0 |
| 0 | 0000:6B:02.0 | 0 0 / 0 3420 / 65536 |
+===========================+===============+====================================================+
| 12 910B2C | OK | 93.2 45 0 / 0 |
| 0 | 0000:69:02.0 | 0 0 / 0 3424 / 65536 |
+===========================+===============+====================================================+
| 13 910B2C | OK | 97.7 45 0 / 0 |
| 0 | 0000:67:02.0 | 0 0 / 0 3424 / 65536 |
+===========================+===============+====================================================+
| 14 910B2C | OK | 91.4 43 0 / 0 |
| 0 | 0000:65:02.0 | 0 0 / 0 3425 / 65536 |
+===========================+===============+====================================================+
| 15 910B2C | OK | 97.6 45 0 / 0 |
| 0 | 0000:73:02.0 | 0 0 / 0 3423 / 65536 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 0 |
+===========================+===============+====================================================+
| No running processes found in NPU 1 |
+===========================+===============+====================================================+
| No running processes found in NPU 2 |
+===========================+===============+====================================================+
| No running processes found in NPU 3 |
+===========================+===============+====================================================+
| No running processes found in NPU 4 |
+===========================+===============+====================================================+
| No running processes found in NPU 5 |
+===========================+===============+====================================================+
| No running processes found in NPU 6 |
+===========================+===============+====================================================+
| No running processes found in NPU 7 |
+===========================+===============+====================================================+
| No running processes found in NPU 8 |
+===========================+===============+====================================================+
| No running processes found in NPU 9 |
+===========================+===============+====================================================+
| No running processes found in NPU 10 |
+===========================+===============+====================================================+
| No running processes found in NPU 11 |
+===========================+===============+====================================================+
| No running processes found in NPU 12 |
+===========================+===============+====================================================+
| No running processes found in NPU 13 |
+===========================+===============+====================================================+
| No running processes found in NPU 14 |
+===========================+===============+====================================================+
| No running processes found in NPU 15 |
+===========================+===============+====================================================+
CANN:
package_name=Ascend-cann-toolkit
version=8.3.RC1
innerversion=V100R001C23SPC001B235
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23]
arch=x86_64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.3.RC1/x86_64-linux
🐛 Describe the bug
ERROR Message from Decode:
(Worker_TP1 pid=113078) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-b51d45c6-33bf-4b76-98c5-fcbcc2f72334-0: Mooncake transfer failed, ret: -1
(Worker_TP4 pid=113591) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-b51d45c6-33bf-4b76-98c5-fcbcc2f72334-0
(Worker_TP4 pid=113591) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-b51d45c6-33bf-4b76-98c5-fcbcc2f72334-0: Mooncake transfer failed, ret: -1
E20251110 20:25:57.753341 124570 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:16051, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
E20251110 20:25:57.753441 124565 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:15393, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
(Worker_TP0 pid=112945) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
E20251110 20:25:57.753445 124569 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:16791, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
E20251110 20:25:57.753521 124563 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:15961, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
(Worker_TP3 pid=113415) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
E20251110 20:25:57.753563 124564 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:16038, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
E20251110 20:25:57.753571 124566 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:15920, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
(Worker_TP0 pid=112945) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP3 pid=113415) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP2 pid=113223) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
(Worker_TP6 pid=113982) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
(Worker_TP6 pid=113982) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP7 pid=114181) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
(Worker_TP2 pid=113223) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP5 pid=113752) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
(Worker_TP7 pid=114181) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP5 pid=113752) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
E20251110 20:25:57.753978 124567 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:15758, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
(Worker_TP1 pid=113078) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
E20251110 20:25:57.754192 124568 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:16119, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
(Worker_TP1 pid=113078) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP4 pid=113591) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
(Worker_TP4 pid=113591) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(APIServer pid=112643) INFO 11-10 20:26:04 [loggers.py:127] Engine 000: Avg prompt throughput: 2100.0 tokens/s, Avg generation throughput: 146.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.4%, Prefix cache hit rate: 0.0%
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working