-
Notifications
You must be signed in to change notification settings - Fork 436
[TransferEngine][WIP]heterogeneous-tcp #1070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @AscendTransport, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new TCP-based transport for heterogeneous communication involving Ascend NPUs. The changes are extensive, touching CMake build scripts, integration code, and the core transfer engine. While the feature addition is significant, the current implementation has several critical issues that must be addressed. These include compilation errors from incorrect include paths and non-standard function usage, potential data corruption and race conditions due to incorrect types and synchronization, and performance bottlenecks from busy-waiting. I've provided specific comments and suggestions to resolve these problems.
| #include "transport/ascend_transport/heterogeneous_transport/heterogeneous_rdma_transport.h" | ||
| #endif | ||
| #ifdef USE_ASCEND_HETEROGENEOUS_TCP | ||
| #include "transport/ascend_transport/heterogeneous_transport/heterogeneous_tcp_transport.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The include path for heterogeneous_tcp_transport.h appears to be incorrect. Based on the file's location (mooncake-transfer-engine/include/transport/ascend_transport/heterogeneous_tcp_transport.h) and the project's include paths, the extra heterogeneous_transport/ directory in the path is likely a copy-paste error and will cause a compilation failure.
| #include "transport/ascend_transport/heterogeneous_transport/heterogeneous_tcp_transport.h" | |
| #include "transport/ascend_transport/heterogeneous_tcp_transport.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The include path for
heterogeneous_tcp_transport.happears to be incorrect. Based on the file's location (mooncake-transfer-engine/include/transport/ascend_transport/heterogeneous_tcp_transport.h) and the project's include paths, the extraheterogeneous_transport/directory in the path is likely a copy-paste error and will cause a compilation failure.
fixed
| #ifdef USE_ASCEND_HETEROGENEOUS_TCP | ||
| else if (std::string(proto) == "ascend") { | ||
| transport = new HeterogeneousTcpTransport(); | ||
| } | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This else if block is conditional on USE_ASCEND_HETEROGENEOUS_TCP, but it's syntactically attached to the preceding if-else if chain. If USE_ASCEND_HETEROGENEOUS is disabled, this else if will be dangling and cause a compilation error. The entire block for selecting an ascend transport should be restructured to handle the mutually exclusive options correctly, for example by using a single if-else if chain with #if/#elif/#endif preprocessor directives to ensure correctness regardless of which flags are enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
else ifblock is conditional onUSE_ASCEND_HETEROGENEOUS_TCP, but it's syntactically attached to the precedingif-else ifchain. IfUSE_ASCEND_HETEROGENEOUSis disabled, thiselse ifwill be dangling and cause a compilation error. The entire block for selecting anascendtransport should be restructured to handle the mutually exclusive options correctly, for example by using a singleif-else ifchain with#if/#elif/#endifpreprocessor directives to ensure correctness regardless of which flags are enabled.
fixed
| return ret; | ||
| } | ||
|
|
||
| hostAddr_ = std::aligned_alloc(64, HUGE_HOST_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::aligned_alloc is not part of the C++ standard; it's a C11 function. Using std::aligned_alloc will likely cause compilation errors on standard-compliant compilers. You should include <cstdlib> and use aligned_alloc. The memory is correctly freed with free() in the destructor.
| hostAddr_ = std::aligned_alloc(64, HUGE_HOST_SIZE); | |
| hostAddr_ = static_cast<char*>(aligned_alloc(64, HUGE_HOST_SIZE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (transferQueues_.empty()) { | ||
| transfer_cond_.wait(lock); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wait on the condition variable should be done in a loop or with a predicate to handle spurious wakeups. The current implementation might cause the thread to wake up, find the queue empty, and then crash when trying to access transferQueues_.front(). A predicate in the wait call is the standard way to solve this.
| if (transferQueues_.empty()) { | |
| transfer_cond_.wait(lock); | |
| } | |
| transfer_cond_.wait(lock, [this] { return !transferQueues_.empty() || !running_; }); | |
| if (transferQueues_.empty()) { | |
| continue; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
waiton the condition variable should be done in a loop or with a predicate to handle spurious wakeups. The current implementation might cause the thread to wake up, find the queue empty, and then crash when trying to accesstransferQueues_.front(). A predicate in thewaitcall is the standard way to solve this.
fixed
| << "HeterogeneousTcpTransport: empty transfer task batch"; | ||
| continue; | ||
| } | ||
| int total_length = pkg.total_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential data truncation here. pkg.total_length is a uint64_t, but it's being assigned to an int. On a typical 64-bit system, int is 32 bits, which can lead to incorrect length if the transfer size exceeds 2GB. The total_length variable should be uint64_t to match the type in the TransferTaskTCP struct.
| int total_length = pkg.total_length; | |
| uint64_t total_length = pkg.total_length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential data truncation here.
pkg.total_lengthis auint64_t, but it's being assigned to anint. On a typical 64-bit system,intis 32 bits, which can lead to incorrect length if the transfer size exceeds 2GB. Thetotal_lengthvariable should beuint64_tto match the type in theTransferTaskTCPstruct.
fixed
| #define HUGE_HOST_SIZE 3ULL * 1024 * 1024 * 1024 | ||
| #define HUGE_DEVICE_SIZE 8 * 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good practice to wrap macro definitions containing expressions in parentheses to avoid potential operator precedence issues when they are used in more complex expressions. While it may not be causing a bug here, it improves code robustness.
| #define HUGE_HOST_SIZE 3ULL * 1024 * 1024 * 1024 | |
| #define HUGE_DEVICE_SIZE 8 * 1024 * 1024 | |
| #define HUGE_HOST_SIZE (3ULL * 1024 * 1024 * 1024) | |
| #define HUGE_DEVICE_SIZE (8 * 1024 * 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Status getTransferStatus(BatchID batch_id, | ||
| std::vector<TransferStatus> &status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| s = transport_->submitTransferTask(task_list); | ||
| if (!s.ok()) { | ||
| LOG(ERROR) | ||
| << "HeterogeneousTcpTransport: Rdma submitTransferTask error"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message mentions Rdma, which seems to be a copy-paste error from another transport implementation. This should be corrected to Tcp or HeterogeneousTcp for accurate logging.
| << "HeterogeneousTcpTransport: Rdma submitTransferTask error"; | |
| << "HeterogeneousTcpTransport: Tcp submitTransferTask error"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| if (transport_ == nullptr) { | ||
| LOG(ERROR) << "HeterogeneousTcpTransport:transport is null"; | ||
| return ret; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If transport_ is null, this function returns ret, which holds the value from aclrtGetDevice. This is likely not the intended error code. It would be better to return a specific error code, like -1, to indicate the failure due to the null transport.
| if (transport_ == nullptr) { | |
| LOG(ERROR) << "HeterogeneousTcpTransport:transport is null"; | |
| return ret; | |
| } | |
| if (transport_ == nullptr) { | |
| LOG(ERROR) << "HeterogeneousTcpTransport:transport is null"; | |
| return -1; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
|
||
| ret = transport_->install(local_server_name_, meta, topo); | ||
| if (ret) { | ||
| LOG(ERROR) << "HeterogeneousTcpTransport::RdmaTransport install error, ret: " << ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message mentions RdmaTransport, which seems to be a copy-paste error from another transport implementation. This should be corrected to TcpTransport for accurate logging.
| LOG(ERROR) << "HeterogeneousTcpTransport::RdmaTransport install error, ret: " << ret; | |
| LOG(ERROR) << "HeterogeneousTcpTransport::TcpTransport install error, ret: " << ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
fe45a11 to
f0daa9e
Compare
Description
Type of Change
How Has This Been Tested?
Checklist