Skip to content

[Bug]: PD seperate (MoonCake) doesn't take effect because of IPV6 host ip #4103

@mitseng

Description

@mitseng

Your current environment

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.7.1+cpu
Is debug build: False

OS: Debian GNU/Linux 12 (bookworm) (x86_64)
GCC version: (Debian 12.2.0-14+deb12u1) 12.2.0
Clang version: Could not collect
CMake version: version 3.25.1
Libc version: glibc-2.36

Python version: 3.11.9 (main, Dec 26 2024, 11:08:39) [GCC 12.2.0] (64-bit runtime)
Python platform: Linux-5.10.135.bsk.6-amd64-x86_64-with-glibc2.36

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          180
On-line CPU(s) list:             0-179
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8457C
CPU family:                      6
Model:                           143
Thread(s) per core:              2
Core(s) per socket:              45
Socket(s):                       2
Stepping:                        8
BogoMIPS:                        5199.53
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       4.2 MiB (90 instances)
L1i cache:                       2.8 MiB (90 instances)
L2 cache:                        180 MiB (90 instances)
L3 cache:                        195 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-89
NUMA node1 CPU(s):               90-179
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.7.1+cpu
[pip3] torch_npu==2.7.1
[pip3] transformers==4.57.1
[conda] Could not collect
vLLM Version: 0.11.0
vLLM Ascend Version: 0.11.0rc1.dev198+gd913f9474.d20251110 (git sha: d913f9474, date: 20251110)

ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=1
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ASCEND_VISIBLE_DEVICES=1,2,12,3,10,5,14,15,4,13,8,7,0,6,9,11
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ASCEND_RUNTIME_OPTIONS=
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ASCEND_PROCESS_LOG_PATH=/var/log/tiger/ascend_diag_logs/run_0/process_log
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/x86_64:/opt/tiger/native_libhdfs/lib/native:/opt/tiger/jdk/jdk8u265-b01/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver::/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/toolbox/latest/Ascend-DMI/lib64
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B2C              | OK            | 90.4        44                0    / 0             |
| 0                         | 0000:75:01.0  | 0           0    / 0          3446 / 65536         |
+===========================+===============+====================================================+
| 1     910B2C              | OK            | 93.1        45                0    / 0             |
| 0                         | 0000:6F:01.0  | 0           0    / 0          3426 / 65536         |
+===========================+===============+====================================================+
| 2     910B2C              | OK            | 89.1        42                0    / 0             |
| 0                         | 0000:71:01.0  | 0           0    / 0          3425 / 65536         |
+===========================+===============+====================================================+
| 3     910B2C              | OK            | 93.3        45                0    / 0             |
| 0                         | 0000:6B:01.0  | 0           0    / 0          3425 / 65536         |
+===========================+===============+====================================================+
| 4     910B2C              | OK            | 99.3        44                0    / 0             |
| 0                         | 0000:69:01.0  | 0           0    / 0          3426 / 65536         |
+===========================+===============+====================================================+
| 5     910B2C              | OK            | 94.8        46                0    / 0             |
| 0                         | 0000:67:01.0  | 0           0    / 0          3425 / 65536         |
+===========================+===============+====================================================+
| 6     910B2C              | OK            | 99.7        46                0    / 0             |
| 0                         | 0000:65:01.0  | 0           0    / 0          3426 / 65536         |
+===========================+===============+====================================================+
| 7     910B2C              | OK            | 92.3        45                0    / 0             |
| 0                         | 0000:73:01.0  | 0           0    / 0          3425 / 65536         |
+===========================+===============+====================================================+
| 8     910B2C              | OK            | 93.4        45                0    / 0             |
| 0                         | 0000:75:02.0  | 0           0    / 0          3437 / 65536         |
+===========================+===============+====================================================+
| 9     910B2C              | OK            | 94.5        45                0    / 0             |
| 0                         | 0000:6F:02.0  | 0           0    / 0          3424 / 65536         |
+===========================+===============+====================================================+
| 10    910B2C              | OK            | 94.4        43                0    / 0             |
| 0                         | 0000:71:02.0  | 0           0    / 0          3425 / 65536         |
+===========================+===============+====================================================+
| 11    910B2C              | OK            | 94.5        46                0    / 0             |
| 0                         | 0000:6B:02.0  | 0           0    / 0          3420 / 65536         |
+===========================+===============+====================================================+
| 12    910B2C              | OK            | 93.2        45                0    / 0             |
| 0                         | 0000:69:02.0  | 0           0    / 0          3424 / 65536         |
+===========================+===============+====================================================+
| 13    910B2C              | OK            | 97.7        45                0    / 0             |
| 0                         | 0000:67:02.0  | 0           0    / 0          3424 / 65536         |
+===========================+===============+====================================================+
| 14    910B2C              | OK            | 91.4        43                0    / 0             |
| 0                         | 0000:65:02.0  | 0           0    / 0          3425 / 65536         |
+===========================+===============+====================================================+
| 15    910B2C              | OK            | 97.6        45                0    / 0             |
| 0                         | 0000:73:02.0  | 0           0    / 0          3423 / 65536         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 0                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 8                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 9                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 10                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 11                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 12                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 13                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 14                                                           |
+===========================+===============+====================================================+
| No running processes found in NPU 15                                                           |
+===========================+===============+====================================================+

CANN:
package_name=Ascend-cann-toolkit
version=8.3.RC1
innerversion=V100R001C23SPC001B235
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23]
arch=x86_64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.3.RC1/x86_64-linux

🐛 Describe the bug

ERROR Message from Decode:

(Worker_TP1 pid=113078) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-b51d45c6-33bf-4b76-98c5-fcbcc2f72334-0: Mooncake transfer failed, ret: -1
(Worker_TP4 pid=113591) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-b51d45c6-33bf-4b76-98c5-fcbcc2f72334-0
(Worker_TP4 pid=113591) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-b51d45c6-33bf-4b76-98c5-fcbcc2f72334-0: Mooncake transfer failed, ret: -1
E20251110 20:25:57.753341 124570 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:16051, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
E20251110 20:25:57.753441 124565 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:15393, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
(Worker_TP0 pid=112945) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
E20251110 20:25:57.753445 124569 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:16791, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
E20251110 20:25:57.753521 124563 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:15961, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
(Worker_TP3 pid=113415) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
E20251110 20:25:57.753563 124564 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:16038, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
E20251110 20:25:57.753571 124566 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:15920, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
(Worker_TP0 pid=112945) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP3 pid=113415) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP2 pid=113223) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
(Worker_TP6 pid=113982) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
(Worker_TP6 pid=113982) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP7 pid=114181) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
(Worker_TP2 pid=113223) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP5 pid=113752) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
(Worker_TP7 pid=114181) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP5 pid=113752) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
E20251110 20:25:57.753978 124567 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:15758, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
(Worker_TP1 pid=113078) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
E20251110 20:25:57.754192 124568 transfer_metadata_plugin.cpp:969] SocketHandShakePlugin: failed to get IP address of peer server 2605:340:cd51:4900:3b92:43bb:ec8:c354:16119, check DNS and /etc/hosts, or use IPv4 address instead: Resource temporarily unavailable [11]
(Worker_TP1 pid=113078) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(Worker_TP4 pid=113591) ERROR 11-10 20:25:57 [mooncake_connector.py:410] Mooncake transfer failed for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0
(Worker_TP4 pid=113591) ERROR 11-10 20:25:57 [mooncake_connector.py:329] Failed to transfer KV cache for request cmpl-5de53eb9-c853-451a-91c5-a4f0e822e5ca-0: Mooncake transfer failed, ret: -1
(APIServer pid=112643) INFO 11-10 20:26:04 [loggers.py:127] Engine 000: Avg prompt throughput: 2100.0 tokens/s, Avg generation throughput: 146.8 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 5.4%, Prefix cache hit rate: 0.0%

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions