feat: add PCIe Relaxed Ordering (RO) support and RDMA traffic class (… #1076

1998zxn · 2025-11-18T11:34:46Z

…TC) control to improve ordering flexibility and queue-level QoS

Description

This PR optimizes Mooncake’s performance in the 2P1D scenario by introducing two main improvements:

Relaxed Ordering (RO) support to improve PCIe out-of-order handling
RDMA queue selection via environment variable to improve queue-level QoS under burst traffic

These changes effectively reduce KV Cache transfer time, thereby lowering overall TTFT (Time-To-First-Token) latency.

Background

In our deployment scenario using SGLang DeepSeek v3 with 2P1D configuration:

P nodes use tp8, pp2 parallel strategy
D nodes use tp8 parallel strategy

We observed that KV Cache transfers could account for up to 23% of the total TTFT. The reasons are:

Mooncake does not enable Relaxed Ordering by default, reducing PCIe out-of-order packet handling efficiency.
This also addresses the issue discussed in #39.
In 2P1D burst traffic scenarios, RDMA queue scheduling can cause congestion, affecting transfer performance.

By enabling these two features, we reduced TTFT from 650ms to ~585ms, and KV Cache transfer time dropped to 15% of TTFT, showing significant performance improvements.

Implementation Details

Relaxed Ordering (RO) Support

Detects if hardware supports RO and enables it
Added logging to indicate whether RO is enabled during runtime, improving observability

RDMA Queue Selection and QoS

Adds an environment variable to select RDMA queues
Added logging to show the details of traffic classes, helping verify correct functionality

Fully backward compatible; scenarios not using these new features remain unaffected with no regression risk.

Type of Change

Types
- Bug fix
- [*] New feature
  - [*] Transfer Engine
  - Mooncake Store
  - Mooncake EP
  - Integration
  - P2P Store
  - Python Wheel
- Breaking change
- CI/CD
- Documentation update
- Other

Checklist

[*] I have performed a self-review of my own code.
I have updated the documentation.
I have added tests to prove my changes are effective.

gemini-code-assist · 2025-11-18T11:35:03Z

Summary of Changes

Hello @1998zxn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes Mooncake's performance, particularly in 2P1D deployment scenarios, by targeting bottlenecks in KV Cache transfers. It achieves this through two key enhancements: enabling PCIe Relaxed Ordering to improve data transfer flexibility and implementing RDMA traffic class control for better queue-level QoS under high-traffic conditions. These changes are designed to collectively reduce Time-To-First-Token (TTFT) latency by optimizing the underlying data transport mechanisms.

Highlights

PCIe Relaxed Ordering (RO) Support: This PR introduces support for PCIe Relaxed Ordering, which dynamically detects hardware capabilities and enables RO to improve PCIe out-of-order packet handling efficiency. This optimization is crucial for reducing KV Cache transfer times.
RDMA Traffic Class (TC) Control: A new mechanism has been added to allow users to control RDMA queue selection via the MoonCake_TC_DEFAULT environment variable. This enables queue-level Quality of Service (QoS) to mitigate congestion during burst traffic scenarios, further enhancing transfer performance.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces valuable performance optimizations by adding support for PCIe Relaxed Ordering and RDMA traffic class control. The implementation is generally well-done, with good use of logging for better observability. I have identified a few areas for improvement to enhance code robustness and correctness: a potential data race in a multi-threaded context, unsafe parsing of an environment variable, and a minor code style issue for better encapsulation. Addressing these points will make the new features more solid.

gemini-code-assist · 2025-11-18T11:36:34Z

mooncake-transfer-engine/src/transport/rdma_transport/rdma_transport.cpp

+    if (access_rights == 0) {
+        access_rights = kBaseAccessRights;
+        if (MCIbRelaxedOrderingEnabled) {
+            access_rights |= IBV_ACCESS_RELAXED_ORDERING;
+        }
+    }


The one-time initialization of access_rights is not thread-safe. registerLocalMemory can be called concurrently (e.g., from registerLocalMemoryBatch), creating a data race when checking if (access_rights == 0) and then modifying it. This is undefined behavior. Please use std::call_once to ensure this initialization is performed atomically and exactly once across all threads.

Suggested change

if (access_rights == 0) {

access_rights = kBaseAccessRights;

if (MCIbRelaxedOrderingEnabled) {

access_rights |= IBV_ACCESS_RELAXED_ORDERING;

}

}

static std::once_flag access_rights_flag;

std::call_once(access_rights_flag, [&]() {

access_rights = kBaseAccessRights;

if (MCIbRelaxedOrderingEnabled) {

access_rights |= IBV_ACCESS_RELAXED_ORDERING;

}

});

gemini-code-assist · 2025-11-18T11:36:34Z

mooncake-transfer-engine/src/config.cpp

+        int val = atoi(traffic_class_env);
+        if (val >= 0 && val <= 255)
+            config.traffic_class = val;
+        else
+            LOG(WARNING) << "Ignore value from environment variable "
+                            "MoonCake_TC_DEFAULT, it should be 0-255";


Using atoi for string-to-integer conversion is unsafe as it doesn't handle errors. For instance, if MoonCake_TC_DEFAULT is set to a non-numeric string like "abc", atoi returns 0, which is a valid traffic class, leading to silent misconfiguration. It's better to use std::stoi with a try-catch block for robust error handling, which is also consistent with other parts of this file (e.g., line 191).

try { int val = std::stoi(traffic_class_env); if (val >= 0 && val <= 255) config.traffic_class = val; else LOG(WARNING) << "Ignore value from environment variable " "MoonCake_TC_DEFAULT, it should be 0-255"; } catch (const std::exception &) { LOG(WARNING) << "Invalid value for environment variable MoonCake_TC_DEFAULT: '" << traffic_class_env << "'. It should be an integer between 0-255."; }

gemini-code-assist · 2025-11-18T11:36:34Z

mooncake-transfer-engine/src/transport/rdma_transport/rdma_transport.cpp

+static bool MCIbRelaxedOrderingEnabled = false;
+
+// Determine whether RELAXED_ORDERING is enabled and possible
+bool has_ibv_reg_mr_iova2(void) {


This helper function has_ibv_reg_mr_iova2 is only used within this file. To limit its scope to this translation unit and prevent potential linkage conflicts, it's good practice to declare it as static. This aligns with the declaration of MCIbRelaxedOrderingEnabled.

Suggested change

bool has_ibv_reg_mr_iova2(void) {

static bool has_ibv_reg_mr_iova2(void) {

Aleda · 2025-11-18T12:14:02Z

mooncake-transfer-engine/src/config.cpp

        }
    }
+
+    const char *traffic_class_env = std::getenv("MoonCake_TC_DEFAULT");


Aleda

split this commit into two atomic commits.

stmatengss · 2025-11-18T16:43:29Z

mooncake-transfer-engine/src/transport/rdma_transport/rdma_transport.cpp

 #include <set>
 #include <thread>

+#include <dlfcn.h>


Make sure these two files are common-used in all types RDMA lib

dlfcn.h is a POSIX standard header that provides dynamic loading functions (dlopen, dlsym, etc.).
It is part of glibc on Linux and is not tied to any specific RDMA implementation.

Also, the previous #include <infiniband/verbs.h> has been removed in the latest revision.

stmatengss · 2025-11-18T16:44:38Z

mooncake-transfer-engine/src/transport/rdma_transport/rdma_transport.cpp

+
+// Determine whether RELAXED_ORDERING is enabled and possible
+bool has_ibv_reg_mr_iova2(void) {
+    void *handle = dlopen("libibverbs.so", RTLD_NOW);


It is too tricky for detecting the status.

This PR introduces Relaxed Ordering (RO) support with two mechanisms:

1. Capability detection
RO is detected by checking for ibv_reg_mr_iova2 in the RDMA dynamic library, following NCCL’s approach. NCCL notes that static-lin can check the function directly, but MoonCake relies on dynamic linking, so the dynamic-link path is sufficient.
2. Explicit control (New)
An environment variable is added to allow users to explicitly enable or disable RO.

This approach is lightweight, reliable for our dynamic-linking model, and consistent with industry practice.

alogfans · 2025-11-19T12:52:24Z

@1998zxn Is there any performance result about how both options affect TTFT?

staryxchen · 2025-11-19T15:16:11Z

mooncake-transfer-engine/src/config.cpp

        }
    }
+
+    const char *traffic_class_env = std::getenv("MC_IB_TC");


What is the recommended setting for this value?

staryxchen · 2025-11-19T15:18:41Z

mooncake-transfer-engine/src/transport/rdma_transport/rdma_transport.cpp

+        return;
+    }
+
+    MCIbRelaxedOrderingEnabled = has_ibv_reg_mr_iova2();


Can we use this implementation as a reference?

staryxchen · 2025-11-19T15:20:31Z

When IBV_ACCESS_RELAXED_ORDERING is set, RDMA write-after-write message order is no longer guaranteed, I'm not sure if it is not impact on us.

Aleda · 2025-11-20T09:50:34Z

When IBV_ACCESS_RELAXED_ORDERING is set, RDMA write-after-write message order is no longer guaranteed, I'm not sure if it is not impact on us.

@staryxchen
Based on our observations, many vendors forcibly enable force_relax at the NIC FW level to improve performance. From our current benchmarking results, it appears that this has no negative impact on Mooncake, while the performance gains can be quite significant.
https://docs.nvidia.com/grace-perf-tuning-guide/optimizing-io.html

And I’d suggest adding a configuration option that allows us to explicitly enable or disable this relax ordering behavior, or to let it be automatically enabled depending on the environment.

staryxchen · 2025-11-20T10:54:34Z

When IBV_ACCESS_RELAXED_ORDERING is set, RDMA write-after-write message order is no longer guaranteed, I'm not sure if it is not impact on us.

@staryxchen Based on our observations, many vendors forcibly enable force_relax at the NIC FW level to improve performance. From our current benchmarking results, it appears that this has no negative impact on Mooncake, while the performance gains can be quite significant. https://docs.nvidia.com/grace-perf-tuning-guide/optimizing-io.html

And I’d suggest adding a configuration option that allows us to explicitly enable or disable this relax ordering behavior, or to let it be automatically enabled depending on the environment.

A configuration option is better, and I prefer to set it to disable by default. So users who are not concerned about these settings will not be affected in any way; performance-focused users can enable them as needed (They are more likely to fully grasp the implications of this feature).

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

Aleda reviewed Nov 18, 2025

View reviewed changes

mooncake-transfer-engine/src/config.cpp Outdated

}

}

const char *traffic_class_env = std::getenv("MoonCake_TC_DEFAULT");

Copy link

Aleda Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MC_IB_TC

Aleda reviewed Nov 18, 2025

View reviewed changes

1998zxn force-pushed the feature/relaxed-ordering-and-rdma-tc branch from a0a8ae2 to 923a4b4 Compare November 18, 2025 16:17

stmatengss reviewed Nov 18, 2025

View reviewed changes

1998zxn added 2 commits November 19, 2025 00:54

add PCIe Relaxed Ordering (RO) support.

9b81dc7

add RDMA traffic class contrl

c2245f6

1998zxn force-pushed the feature/relaxed-ordering-and-rdma-tc branch from 923a4b4 to c2245f6 Compare November 18, 2025 17:01

fix: add env variable to control Relaxed Ordering (RO)

2b6bad9

staryxchen reviewed Nov 19, 2025

View reviewed changes

	bool has_ibv_reg_mr_iova2(void) {
	static bool has_ibv_reg_mr_iova2(void) {

feat: add PCIe Relaxed Ordering (RO) support and RDMA traffic class (… #1076

Are you sure you want to change the base?

feat: add PCIe Relaxed Ordering (RO) support and RDMA traffic class (… #1076

Conversation

1998zxn commented Nov 18, 2025

Description

Background

Implementation Details

Type of Change

Checklist

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Aleda Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Aleda left a comment

Choose a reason for hiding this comment

Uh oh!

stmatengss Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

1998zxn Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

stmatengss Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

1998zxn Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

alogfans commented Nov 19, 2025

Uh oh!

staryxchen Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

staryxchen Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

staryxchen commented Nov 19, 2025

Uh oh!

Aleda commented Nov 20, 2025

Uh oh!

staryxchen commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants