-
Notifications
You must be signed in to change notification settings - Fork 433
feat: add PCIe Relaxed Ordering (RO) support and RDMA traffic class (… #1076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add PCIe Relaxed Ordering (RO) support and RDMA traffic class (… #1076
Conversation
Summary of ChangesHello @1998zxn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly optimizes Mooncake's performance, particularly in 2P1D deployment scenarios, by targeting bottlenecks in KV Cache transfers. It achieves this through two key enhancements: enabling PCIe Relaxed Ordering to improve data transfer flexibility and implementing RDMA traffic class control for better queue-level QoS under high-traffic conditions. These changes are designed to collectively reduce Time-To-First-Token (TTFT) latency by optimizing the underlying data transport mechanisms. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces valuable performance optimizations by adding support for PCIe Relaxed Ordering and RDMA traffic class control. The implementation is generally well-done, with good use of logging for better observability. I have identified a few areas for improvement to enhance code robustness and correctness: a potential data race in a multi-threaded context, unsafe parsing of an environment variable, and a minor code style issue for better encapsulation. Addressing these points will make the new features more solid.
| if (access_rights == 0) { | ||
| access_rights = kBaseAccessRights; | ||
| if (MCIbRelaxedOrderingEnabled) { | ||
| access_rights |= IBV_ACCESS_RELAXED_ORDERING; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one-time initialization of access_rights is not thread-safe. registerLocalMemory can be called concurrently (e.g., from registerLocalMemoryBatch), creating a data race when checking if (access_rights == 0) and then modifying it. This is undefined behavior. Please use std::call_once to ensure this initialization is performed atomically and exactly once across all threads.
| if (access_rights == 0) { | |
| access_rights = kBaseAccessRights; | |
| if (MCIbRelaxedOrderingEnabled) { | |
| access_rights |= IBV_ACCESS_RELAXED_ORDERING; | |
| } | |
| } | |
| static std::once_flag access_rights_flag; | |
| std::call_once(access_rights_flag, [&]() { | |
| access_rights = kBaseAccessRights; | |
| if (MCIbRelaxedOrderingEnabled) { | |
| access_rights |= IBV_ACCESS_RELAXED_ORDERING; | |
| } | |
| }); |
| int val = atoi(traffic_class_env); | ||
| if (val >= 0 && val <= 255) | ||
| config.traffic_class = val; | ||
| else | ||
| LOG(WARNING) << "Ignore value from environment variable " | ||
| "MoonCake_TC_DEFAULT, it should be 0-255"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using atoi for string-to-integer conversion is unsafe as it doesn't handle errors. For instance, if MoonCake_TC_DEFAULT is set to a non-numeric string like "abc", atoi returns 0, which is a valid traffic class, leading to silent misconfiguration. It's better to use std::stoi with a try-catch block for robust error handling, which is also consistent with other parts of this file (e.g., line 191).
try {
int val = std::stoi(traffic_class_env);
if (val >= 0 && val <= 255)
config.traffic_class = val;
else
LOG(WARNING) << "Ignore value from environment variable "
"MoonCake_TC_DEFAULT, it should be 0-255";
} catch (const std::exception &) {
LOG(WARNING) << "Invalid value for environment variable MoonCake_TC_DEFAULT: '"
<< traffic_class_env << "'. It should be an integer between 0-255.";
}| static bool MCIbRelaxedOrderingEnabled = false; | ||
|
|
||
| // Determine whether RELAXED_ORDERING is enabled and possible | ||
| bool has_ibv_reg_mr_iova2(void) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helper function has_ibv_reg_mr_iova2 is only used within this file. To limit its scope to this translation unit and prevent potential linkage conflicts, it's good practice to declare it as static. This aligns with the declaration of MCIbRelaxedOrderingEnabled.
| bool has_ibv_reg_mr_iova2(void) { | |
| static bool has_ibv_reg_mr_iova2(void) { |
| } | ||
| } | ||
|
|
||
| const char *traffic_class_env = std::getenv("MoonCake_TC_DEFAULT"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MC_IB_TC
Aleda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split this commit into two atomic commits.
a0a8ae2 to
923a4b4
Compare
| #include <set> | ||
| #include <thread> | ||
|
|
||
| #include <dlfcn.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure these two files are common-used in all types RDMA lib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dlfcn.h is a POSIX standard header that provides dynamic loading functions (dlopen, dlsym, etc.).
It is part of glibc on Linux and is not tied to any specific RDMA implementation.
Also, the previous #include <infiniband/verbs.h> has been removed in the latest revision.
|
|
||
| // Determine whether RELAXED_ORDERING is enabled and possible | ||
| bool has_ibv_reg_mr_iova2(void) { | ||
| void *handle = dlopen("libibverbs.so", RTLD_NOW); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is too tricky for detecting the status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces Relaxed Ordering (RO) support with two mechanisms:
1. Capability detection
RO is detected by checking for ibv_reg_mr_iova2 in the RDMA dynamic library, following NCCL’s approach. NCCL notes that static-lin can check the function directly, but MoonCake relies on dynamic linking, so the dynamic-link path is sufficient.
2. Explicit control (New)
An environment variable is added to allow users to explicitly enable or disable RO.
This approach is lightweight, reliable for our dynamic-linking model, and consistent with industry practice.
923a4b4 to
c2245f6
Compare
|
@1998zxn Is there any performance result about how both options affect TTFT? |
| } | ||
| } | ||
|
|
||
| const char *traffic_class_env = std::getenv("MC_IB_TC"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the recommended setting for this value?
| return; | ||
| } | ||
|
|
||
| MCIbRelaxedOrderingEnabled = has_ibv_reg_mr_iova2(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use this implementation as a reference?
|
When IBV_ACCESS_RELAXED_ORDERING is set, RDMA write-after-write message order is no longer guaranteed, I'm not sure if it is not impact on us. |
@staryxchen And I’d suggest adding a configuration option that allows us to explicitly enable or disable this relax ordering behavior, or to let it be automatically enabled depending on the environment. |
A configuration option is better, and I prefer to set it to disable by default. So users who are not concerned about these settings will not be affected in any way; performance-focused users can enable them as needed (They are more likely to fully grasp the implications of this feature). |
…TC) control to improve ordering flexibility and queue-level QoS
Description
This PR optimizes Mooncake’s performance in the 2P1D scenario by introducing two main improvements:
These changes effectively reduce KV Cache transfer time, thereby lowering overall TTFT (Time-To-First-Token) latency.
Background
In our deployment scenario using SGLang DeepSeek v3 with 2P1D configuration:
We observed that KV Cache transfers could account for up to 23% of the total TTFT. The reasons are:
This also addresses the issue discussed in #39.
By enabling these two features, we reduced TTFT from 650ms to ~585ms, and KV Cache transfer time dropped to 15% of TTFT, showing significant performance improvements.
Implementation Details
Fully backward compatible; scenarios not using these new features remain unaffected with no regression risk.
Type of Change
Checklist