-
Notifications
You must be signed in to change notification settings - Fork 35
Draft: Add DDLB workload #711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution!
Please also:
- Extend
test_acceptance.pyto cover sbatch generation logic. - Add documentation page for this workload, see
doc/workloadsfor examples. And link this page to the main one.
|
|
||
| def generate_test_command(self) -> List[str]: | ||
| tdef: DDLBTestDefinition = cast(DDLBTestDefinition, self.test_run.test.test_definition) | ||
| srun_command_parts = ["python scripts/run_benchmark.py"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to use relative path? We can introduce a field in the test definition for this workload to hold path_to_script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the container the default path has this relative path available. I think it's ok, I can update it if the default path changes in the container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, no comments
|
Thanks for the review. I will update with these changes. In the mean time, when I tried running this change, I found that Here the output of a manual run: I figured since the container is ~9 GB, I should wait a little bit. But it's been about 4 hours, so I think it's safe to assume it's a hang. |
Depends on the system it can take some time, but 4h for 9GB is too much. Have you tried enabling local caching in system with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The most recent changes address previously raised issues about copyright dates and commented code. The developer has updated copyright headers in newly added files to use only "2025" (instead of "2024-2025") and removed commented pre_test and post_test lines from the test scenario configuration, streamlining the DDLB integration. These are minor cleanup changes that improve code consistency with project conventions.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| conf/common/test_scenario/ddlb_test.toml | 5/5 | Removed commented pre_test and post_test hook lines, leaving clean minimal configuration |
| src/cloudai/workloads/ddlb/init.py | 5/5 | Updated copyright year from "2024-2025" to "2025" only |
| src/cloudai/registration.py | 5/5 | Updated copyright year from "2024-2025" to "2025" only |
| src/cloudai/workloads/ddlb/ddlb.py | 5/5 | Updated copyright year from "2024-2025" to "2025" only |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 5/5 | Updated copyright year from "2024-2025" to "2025" only |
| conf/common/test/ddlb_test.toml | 5/5 | Updated copyright year from "2024-2025" to "2025" only |
Confidence score: 5/5
- These changes are safe to merge as they only address formatting and consistency issues raised in previous reviews
- The score reflects that these are purely cosmetic/metadata changes with no functional impact on code behavior
- No files require special attention; all changes are straightforward corrections to copyright headers and removal of commented placeholder code
6 files reviewed, 7 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers changes made to the DDLB workload since the last review, not the entire PR. The developer addressed most of the critical feedback from prior reviews by fixing the unreachable code bug, removing dead configuration comments, simplifying validation logic, and standardizing copyright headers to "2025" for newly added files. The key fix removes the duplicate "Error" check in ddlb.py lines 58/68 that made success validation unreachable, and eliminates the unused missing_indicators list. PEP 8 formatting was also corrected. The test scenario timeout was extended from 10 to 30 minutes to allow DDLB benchmarks to complete. These changes clean up the DDLB integration while addressing previously flagged code quality issues.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/init.py | 5/5 | Updated copyright year from "2024-2025" to "2025" (administrative only) |
| conf/common/test_scenario/ddlb_test.toml | 4.5/5 | Extended test time limit from 10 to 30 minutes and removed dead commented-out fields |
| src/cloudai/workloads/ddlb/ddlb.py | 5/5 | Fixed critical duplicate error check bug making validation unreachable; removed unused missing_indicators list |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 4/5 | Simplified command generation by removing intermediate variable; copyright updated |
Confidence score: 4/5
- This PR addresses critical bugs but one code smell remains that should be resolved before merging
- Score reflects that the duplicate error check bug was fixed and copyright headers were standardized, but the unused
tdefvariable inslurm_command_gen_strategy.pystill exists from prior reviews, and concerns about the relative path safety raised in previous review ("Is it safe to use relative path? We can introduce a field in the test definition for this workload to holdpath_to_script.") remain unaddressed - Review
slurm_command_gen_strategy.pycarefully—the unusedtdefvariable suggests the test definition may need to be used for configuration in the future, and the hardcoded relative path "scripts/run_benchmark.py" may cause failures if executed from unexpected working directories
4 files reviewed, no comments
Thanks. I opted to try it on another cluster, and it failed there too with It seems like the container is too big to convert to a .sqsh file with the scratch space available in enroot's |
If this image is too big, how will you create |
| test_template_name = "DDLBTest" | ||
|
|
||
| [cmd_args] | ||
| docker_image_url = "gitlab-master.nvidia.com/nsarkauskas/ddlb:latest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use non-public images for common/ configs:
- If possible, let's use a public image
- If there is no public image, we can either move this config to another folder or to CloudAIx repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the near future we'll make the image public. Which folder should I move it to until then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use conf/experimental for now. Or create a new one under conf/.
|
@nsarka please merge your PR with the latest main branch to align check list. |
1171554 to
2bf56e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI. The implementation follows established patterns from other workloads like NCCL and ChakraReplay.
Key Changes:
- New test definition (
DDLBTestDefinition) with Docker image support - Slurm command generation strategy that executes
python scripts/run_benchmark.py - Configuration files for test setup (single node, 30-minute timeout)
- Success validation checking for "Benchmark Results" in stdout
- Registration in the main registry alongside other test definitions
Observations:
- The error detection uses a generic
"Error"string check which may produce false positives - The implementation is minimal but functional, delegating most logic to the container's benchmark script
- No unit tests included for the new workload (though other workloads have test coverage)
Confidence Score: 4/5
- This PR is safe to merge with minor refinements recommended
- The implementation follows existing patterns closely (NCCL, ChakraReplay) and integrates cleanly into the registry. The main concern is the generic error detection string which could cause false positives. The code is well-structured and mirrors established workload patterns, making it maintainable. No breaking changes or security issues identified.
- Primary attention needed on
src/cloudai/workloads/ddlb/ddlb.pyfor error detection refinement
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/ddlb.py | 4/5 | Core DDLB test definition with generic error detection ('Error' string may match false positives), success validation checks for 'Benchmark Results' |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 5/5 | Slurm command generation for DDLB, returns static test command, success check validates 'Benchmark Results' in output |
Sequence Diagram
sequenceDiagram
participant User
participant Registry
participant DDLBTestDefinition
participant DDLBTestSlurmCommandGenStrategy
participant SlurmSystem
participant DockerImage
participant OutputFile
User->>Registry: Register DDLB workload
Registry->>Registry: Add DDLBTestDefinition
Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy
User->>DDLBTestDefinition: Create test with docker_image_url
DDLBTestDefinition->>DockerImage: Initialize DockerImage(url)
User->>DDLBTestSlurmCommandGenStrategy: Generate test command
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get docker_image.installed_path
DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Return image path
DDLBTestSlurmCommandGenStrategy->>SlurmSystem: Generate srun command with container
DDLBTestSlurmCommandGenStrategy-->>User: Return ["python scripts/run_benchmark.py"]
User->>SlurmSystem: Execute test via Slurm
SlurmSystem->>OutputFile: Write stdout.txt
User->>DDLBTestDefinition: Check was_run_successful()
DDLBTestDefinition->>OutputFile: Read stdout.txt
alt Contains "Error"
DDLBTestDefinition-->>User: JobStatusResult(False, error details)
else Missing "Benchmark Results"
DDLBTestDefinition-->>User: JobStatusResult(False, missing indicators)
else Success
DDLBTestDefinition-->>User: JobStatusResult(True)
end
6 files reviewed, 1 comment
| content = file.read() | ||
|
|
||
| # Check for specific error patterns | ||
| if "Error" in content: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: generic "Error" check may produce false positives (e.g., "Error rate: 0%", "Error handling initialized"). Consider more specific pattern like "Error:" or actual DDLB error messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
adds DDLB (Distributed Deep Learning Benchmark) workload support following the existing CloudAI workload pattern with test definition, Slurm command generation strategy, and configuration files.
Key changes:
- registered DDLB workload in
src/cloudai/registration.py - created
DDLBTestDefinitionwith Docker image management and success validation based on "Benchmark Results" pattern - implemented
DDLBTestSlurmCommandGenStrategyto generate mpirun commands - added test configuration (
conf/common/test/ddlb_test.toml) and test scenario
Issues found:
- critical command generation bug in
slurm_command_gen_strategy.py:36that produces malformed commands - unused imports in
ddlb.py
Confidence Score: 2/5
- critical bug in command generation will cause runtime failures when executing DDLB tests
- the
generate_test_commandmethod inslurm_command_gen_strategy.py:36constructs a malformed command list with"mpirun -np "(trailing space) as a single element, which when joined with spaces produces"mpirun -np 8 python..."(double space). This breaks command parsing and will cause test execution failures src/cloudai/workloads/ddlb/slurm_command_gen_strategy.pyrequires immediate fix to command generation logic
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| conf/common/test/ddlb_test.toml | 4/5 | configuration file with hardcoded path, standard structure matches other test configs |
| src/cloudai/workloads/ddlb/ddlb.py | 3/5 | test definition with unused imports, follows established patterns for workload definitions |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 2/5 | command generation with critical bug in generate_test_command list structure (line 36) that will produce malformed command |
Sequence Diagram
sequenceDiagram
participant User
participant Registry
participant DDLBTestDefinition
participant SlurmCommandGenStrategy
participant DockerImage
participant SlurmSystem
User->>Registry: register DDLB workload
Registry->>DDLBTestDefinition: register test definition
Registry->>SlurmCommandGenStrategy: register command gen strategy
User->>DDLBTestDefinition: load test config (ddlb_test.toml)
DDLBTestDefinition->>DockerImage: initialize docker_image from docker_image_url
User->>SlurmCommandGenStrategy: generate execution command
SlurmCommandGenStrategy->>DDLBTestDefinition: get test definition
DDLBTestDefinition->>DockerImage: get installed_path
SlurmCommandGenStrategy->>SlurmCommandGenStrategy: generate_test_command() -> ["mpirun -np ", "8", "python scripts/run_benchmark.py"]
SlurmCommandGenStrategy->>SlurmSystem: create sbatch script with srun command
User->>SlurmSystem: execute job
SlurmSystem->>SlurmSystem: run mpirun with DDLB benchmark
SlurmSystem-->>User: output to stdout.txt
User->>DDLBTestDefinition: was_run_successful(test_run)
DDLBTestDefinition->>DDLBTestDefinition: check stdout.txt for "Error" or "Benchmark Results"
DDLBTestDefinition-->>User: JobStatusResult
3 files reviewed, 2 comments
| cmd = ["mpirun -np "] | ||
| cmd.append(str(tdef.cmd_args.np)) | ||
| cmd.append("python scripts/run_benchmark.py") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: incorrect command list structure—"mpirun -np " with trailing space will produce malformed command when joined. When " ".join() is applied, this becomes "mpirun -np 8 python scripts/run_benchmark.py" (double space). Each element should be a complete token.
| cmd = ["mpirun -np "] | |
| cmd.append(str(tdef.cmd_args.np)) | |
| cmd.append("python scripts/run_benchmark.py") | |
| cmd = ["mpirun", "-np"] | |
| cmd.append(str(tdef.cmd_args.np)) |
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| from typing import Literal, Optional, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Literal and Union are imported but never used
| from typing import Literal, Optional, Union | |
| from typing import Optional |
This reverts commit eda5d0e.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI, following the existing pattern for workload registration with Slurm systems.
Key Changes
- Added DDLB test definition with Docker image support and job success validation
- Implemented Slurm command generation strategy for DDLB workloads
- Created configuration files for test and test scenario definitions
- Registered DDLB workload in the global registry alongside other workloads like NCCL and UCC
Issues Identified
- Critical: Duplicate error checking logic in
ddlb.py:59-68makes second condition unreachable - Unused imports (
Literal,Union) inddlb.py:17 - Unused
tdefvariable inslurm_command_gen_strategy.py:35 - Generic "Error" pattern may cause false positives
- Potential
Nonehandling issue inimage_path()wheninstalled_pathisNone
Confidence Score: 2/5
- Not safe to merge - contains critical logic bug that prevents proper error detection
- The duplicate error check at lines 59-68 in
ddlb.pycreates unreachable code that will prevent the success indicator check from ever executing. This is a critical bug that breaks the test validation logic. Additionally, the generic "Error" pattern is prone to false positives, and theimage_path()method may return string"None"instead of handlingNoneproperly. src/cloudai/workloads/ddlb/ddlb.pyrequires immediate attention due to unreachable code, andslurm_command_gen_strategy.pyneeds review for None handling
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/cloudai/workloads/ddlb/ddlb.py | 2/5 | DDLB test definition with critical logic error in duplicate error checking (lines 59-68 unreachable), unused imports, and overly generic error pattern matching |
| src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py | 3/5 | Command generation strategy with unused tdef variable and potential None handling issue in image_path() method |
Sequence Diagram
sequenceDiagram
participant User
participant Registry
participant TestRunner
participant DDLBTestDefinition
participant DDLBTestSlurmCommandGenStrategy
participant SlurmSystem
participant DockerImage
User->>Registry: Register DDLB workload
Registry->>Registry: Add DDLBTestDefinition
Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy
User->>TestRunner: Execute DDLB test
TestRunner->>DDLBTestDefinition: Load test configuration
DDLBTestDefinition->>DockerImage: Initialize docker_image from URL
TestRunner->>DDLBTestSlurmCommandGenStrategy: Generate command
DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get docker_image.installed_path
DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Generate test command
DDLBTestSlurmCommandGenStrategy-->>TestRunner: Return ["python scripts/run_benchmark.py"]
TestRunner->>SlurmSystem: Submit job with srun command
SlurmSystem-->>TestRunner: Job execution
TestRunner->>DDLBTestDefinition: was_run_successful()
DDLBTestDefinition->>DDLBTestDefinition: Read stdout.txt
DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" pattern
DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results"
DDLBTestDefinition-->>TestRunner: Return JobStatusResult
3 files reviewed, no comments
DDLB workload integration