Skip to content

Conversation

@1998zxn
Copy link

@1998zxn 1998zxn commented Nov 18, 2025

…TC) control to improve ordering flexibility and queue-level QoS

Description

This PR optimizes Mooncake’s performance in the 2P1D scenario by introducing two main improvements:

  1. Relaxed Ordering (RO) support to improve PCIe out-of-order handling
  2. RDMA queue selection via environment variable to improve queue-level QoS under burst traffic

These changes effectively reduce KV Cache transfer time, thereby lowering overall TTFT (Time-To-First-Token) latency.

Background

In our deployment scenario using SGLang DeepSeek v3 with 2P1D configuration:

  • P nodes use tp8, pp2 parallel strategy
  • D nodes use tp8 parallel strategy

We observed that KV Cache transfers could account for up to 23% of the total TTFT. The reasons are:

  1. Mooncake does not enable Relaxed Ordering by default, reducing PCIe out-of-order packet handling efficiency.
    This also addresses the issue discussed in #39.
  2. In 2P1D burst traffic scenarios, RDMA queue scheduling can cause congestion, affecting transfer performance.

By enabling these two features, we reduced TTFT from 650ms to ~585ms, and KV Cache transfer time dropped to 15% of TTFT, showing significant performance improvements.

Implementation Details

  1. Relaxed Ordering (RO) Support
  • Detects if hardware supports RO and enables it
  • Added logging to indicate whether RO is enabled during runtime, improving observability
  1. RDMA Queue Selection and QoS
  • Adds an environment variable to select RDMA queues
  • Added logging to show the details of traffic classes, helping verify correct functionality

Fully backward compatible; scenarios not using these new features remain unaffected with no regression risk.

Type of Change

  • Types
    • Bug fix
    • [*] New feature
      • [*] Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

Checklist

  • [*] I have performed a self-review of my own code.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @1998zxn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes Mooncake's performance, particularly in 2P1D deployment scenarios, by targeting bottlenecks in KV Cache transfers. It achieves this through two key enhancements: enabling PCIe Relaxed Ordering to improve data transfer flexibility and implementing RDMA traffic class control for better queue-level QoS under high-traffic conditions. These changes are designed to collectively reduce Time-To-First-Token (TTFT) latency by optimizing the underlying data transport mechanisms.

Highlights

  • PCIe Relaxed Ordering (RO) Support: This PR introduces support for PCIe Relaxed Ordering, which dynamically detects hardware capabilities and enables RO to improve PCIe out-of-order packet handling efficiency. This optimization is crucial for reducing KV Cache transfer times.
  • RDMA Traffic Class (TC) Control: A new mechanism has been added to allow users to control RDMA queue selection via the MoonCake_TC_DEFAULT environment variable. This enables queue-level Quality of Service (QoS) to mitigate congestion during burst traffic scenarios, further enhancing transfer performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable performance optimizations by adding support for PCIe Relaxed Ordering and RDMA traffic class control. The implementation is generally well-done, with good use of logging for better observability. I have identified a few areas for improvement to enhance code robustness and correctness: a potential data race in a multi-threaded context, unsafe parsing of an environment variable, and a minor code style issue for better encapsulation. Addressing these points will make the new features more solid.

Comment on lines +166 to +177
if (access_rights == 0) {
access_rights = kBaseAccessRights;
if (MCIbRelaxedOrderingEnabled) {
access_rights |= IBV_ACCESS_RELAXED_ORDERING;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The one-time initialization of access_rights is not thread-safe. registerLocalMemory can be called concurrently (e.g., from registerLocalMemoryBatch), creating a data race when checking if (access_rights == 0) and then modifying it. This is undefined behavior. Please use std::call_once to ensure this initialization is performed atomically and exactly once across all threads.

Suggested change
if (access_rights == 0) {
access_rights = kBaseAccessRights;
if (MCIbRelaxedOrderingEnabled) {
access_rights |= IBV_ACCESS_RELAXED_ORDERING;
}
}
static std::once_flag access_rights_flag;
std::call_once(access_rights_flag, [&]() {
access_rights = kBaseAccessRights;
if (MCIbRelaxedOrderingEnabled) {
access_rights |= IBV_ACCESS_RELAXED_ORDERING;
}
});

Comment on lines 299 to 304
int val = atoi(traffic_class_env);
if (val >= 0 && val <= 255)
config.traffic_class = val;
else
LOG(WARNING) << "Ignore value from environment variable "
"MoonCake_TC_DEFAULT, it should be 0-255";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using atoi for string-to-integer conversion is unsafe as it doesn't handle errors. For instance, if MoonCake_TC_DEFAULT is set to a non-numeric string like "abc", atoi returns 0, which is a valid traffic class, leading to silent misconfiguration. It's better to use std::stoi with a try-catch block for robust error handling, which is also consistent with other parts of this file (e.g., line 191).

        try {
            int val = std::stoi(traffic_class_env);
            if (val >= 0 && val <= 255)
                config.traffic_class = val;
            else
                LOG(WARNING) << "Ignore value from environment variable "
                                "MoonCake_TC_DEFAULT, it should be 0-255";
        } catch (const std::exception &) {
            LOG(WARNING) << "Invalid value for environment variable MoonCake_TC_DEFAULT: '"
                         << traffic_class_env << "'. It should be an integer between 0-255.";
        }

static bool MCIbRelaxedOrderingEnabled = false;

// Determine whether RELAXED_ORDERING is enabled and possible
bool has_ibv_reg_mr_iova2(void) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This helper function has_ibv_reg_mr_iova2 is only used within this file. To limit its scope to this translation unit and prevent potential linkage conflicts, it's good practice to declare it as static. This aligns with the declaration of MCIbRelaxedOrderingEnabled.

Suggested change
bool has_ibv_reg_mr_iova2(void) {
static bool has_ibv_reg_mr_iova2(void) {

}
}

const char *traffic_class_env = std::getenv("MoonCake_TC_DEFAULT");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MC_IB_TC

Copy link

@Aleda Aleda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split this commit into two atomic commits.

@1998zxn 1998zxn force-pushed the feature/relaxed-ordering-and-rdma-tc branch from a0a8ae2 to 923a4b4 Compare November 18, 2025 16:17
#include <set>
#include <thread>

#include <dlfcn.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure these two files are common-used in all types RDMA lib

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dlfcn.h is a POSIX standard header that provides dynamic loading functions (dlopen, dlsym, etc.).
It is part of glibc on Linux and is not tied to any specific RDMA implementation.

Also, the previous #include <infiniband/verbs.h> has been removed in the latest revision.


// Determine whether RELAXED_ORDERING is enabled and possible
bool has_ibv_reg_mr_iova2(void) {
void *handle = dlopen("libibverbs.so", RTLD_NOW);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is too tricky for detecting the status.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces Relaxed Ordering (RO) support with two mechanisms:

1. Capability detection
RO is detected by checking for ibv_reg_mr_iova2 in the RDMA dynamic library, following NCCL’s approach. NCCL notes that static-lin can check the function directly, but MoonCake relies on dynamic linking, so the dynamic-link path is sufficient.
2. Explicit control (New)
An environment variable is added to allow users to explicitly enable or disable RO.

This approach is lightweight, reliable for our dynamic-linking model, and consistent with industry practice.

@1998zxn 1998zxn force-pushed the feature/relaxed-ordering-and-rdma-tc branch from 923a4b4 to c2245f6 Compare November 18, 2025 17:01
@alogfans
Copy link
Collaborator

@1998zxn Is there any performance result about how both options affect TTFT?

}
}

const char *traffic_class_env = std::getenv("MC_IB_TC");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the recommended setting for this value?

return;
}

MCIbRelaxedOrderingEnabled = has_ibv_reg_mr_iova2();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use this implementation as a reference?

@staryxchen
Copy link
Collaborator

When IBV_ACCESS_RELAXED_ORDERING is set, RDMA write-after-write message order is no longer guaranteed, I'm not sure if it is not impact on us.

@Aleda
Copy link

Aleda commented Nov 20, 2025

When IBV_ACCESS_RELAXED_ORDERING is set, RDMA write-after-write message order is no longer guaranteed, I'm not sure if it is not impact on us.

@staryxchen
Based on our observations, many vendors forcibly enable force_relax at the NIC FW level to improve performance. From our current benchmarking results, it appears that this has no negative impact on Mooncake, while the performance gains can be quite significant.
https://docs.nvidia.com/grace-perf-tuning-guide/optimizing-io.html

And I’d suggest adding a configuration option that allows us to explicitly enable or disable this relax ordering behavior, or to let it be automatically enabled depending on the environment.

@staryxchen
Copy link
Collaborator

When IBV_ACCESS_RELAXED_ORDERING is set, RDMA write-after-write message order is no longer guaranteed, I'm not sure if it is not impact on us.

@staryxchen Based on our observations, many vendors forcibly enable force_relax at the NIC FW level to improve performance. From our current benchmarking results, it appears that this has no negative impact on Mooncake, while the performance gains can be quite significant. https://docs.nvidia.com/grace-perf-tuning-guide/optimizing-io.html

And I’d suggest adding a configuration option that allows us to explicitly enable or disable this relax ordering behavior, or to let it be automatically enabled depending on the environment.

A configuration option is better, and I prefer to set it to disable by default. So users who are not concerned about these settings will not be affected in any way; performance-focused users can enable them as needed (They are more likely to fully grasp the implications of this feature).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants