Draft: Add DDLB workload #711

nsarka · 2025-10-16T21:25:22Z

DDLB workload integration

amaslenn

Thank you for your contribution!

Please also:

Extend test_acceptance.py to cover sbatch generation logic.
Add documentation page for this workload, see doc/workloads for examples. And link this page to the main one.

src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py

amaslenn · 2025-10-17T09:37:52Z

src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py

+
+    def generate_test_command(self) -> List[str]:
+        tdef: DDLBTestDefinition = cast(DDLBTestDefinition, self.test_run.test.test_definition)
+        srun_command_parts = ["python scripts/run_benchmark.py"]


Is it safe to use relative path? We can introduce a field in the test definition for this workload to hold path_to_script.

In the container the default path has this relative path available. I think it's ok, I can update it if the default path changes in the container

conf/common/test_scenario/ddlb_test.toml

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

nsarka · 2025-10-22T20:26:13Z

Thanks for the review. I will update with these changes.

In the mean time, when I tried running this change, I found that srun hangs when I use --container-image. Do you have any ideas on how to troubleshoot this? @amaslenn

Here the output of a manual run:

srun --export=ALL --mpi=pmix --container-image=gitlab-master.nvidia.com/dl/pytorch/update-scripts:pjnl-latest hostname
pyxis: importing docker image ...
pyxis: importing docker image ...

I figured since the container is ~9 GB, I should wait a little bit. But it's been about 4 hours, so I think it's safe to assume it's a hang.

amaslenn · 2025-10-23T08:24:00Z

Thanks for the review. I will update with these changes.

In the mean time, when I tried running this change, I found that srun hangs when I use --container-image. Do you have any ideas on how to troubleshoot this? @amaslenn

Here the output of a manual run:
srun --export=ALL --mpi=pmix --container-image=gitlab-master.nvidia.com/dl/pytorch/update-scripts:pjnl-latest hostname
pyxis: importing docker image ...
pyxis: importing docker image ...
I figured since the container is ~9 GB, I should wait a little bit. But it's been about 4 hours, so I think it's safe to assume it's a hang.

Depends on the system it can take some time, but 4h for 9GB is too much.

Have you tried enabling local caching in system with cache_docker_images_locally = true (https://nvidia.github.io/cloudai/USER_GUIDE.html#step-4-system-configuration) and running cloudai install? This will run srun ... enroot ... to cache the image explicitly. Once done, next runs will use .sqsh file instead of pulling image every time.

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent changes address previously raised issues about copyright dates and commented code. The developer has updated copyright headers in newly added files to use only "2025" (instead of "2024-2025") and removed commented pre_test and post_test lines from the test scenario configuration, streamlining the DDLB integration. These are minor cleanup changes that improve code consistency with project conventions.

Important Files Changed

Filename	Score	Overview
conf/common/test_scenario/ddlb_test.toml	5/5	Removed commented pre_test and post_test hook lines, leaving clean minimal configuration
src/cloudai/workloads/ddlb/init.py	5/5	Updated copyright year from "2024-2025" to "2025" only
src/cloudai/registration.py	5/5	Updated copyright year from "2024-2025" to "2025" only
src/cloudai/workloads/ddlb/ddlb.py	5/5	Updated copyright year from "2024-2025" to "2025" only
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py	5/5	Updated copyright year from "2024-2025" to "2025" only
conf/common/test/ddlb_test.toml	5/5	Updated copyright year from "2024-2025" to "2025" only

Confidence score: 5/5

These changes are safe to merge as they only address formatting and consistency issues raised in previous reviews
The score reflects that these are purely cosmetic/metadata changes with no functional impact on code behavior
No files require special attention; all changes are straightforward corrections to copyright headers and removal of commented placeholder code

_{6 files reviewed, 7 comments}

_{Edit Code Review Agent Settings | Greptile}

conf/common/test_scenario/ddlb_test.toml

src/cloudai/registration.py

src/cloudai/workloads/ddlb/ddlb.py

src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py

greptile-apps

Greptile Overview

Greptile Summary

This review covers changes made to the DDLB workload since the last review, not the entire PR. The developer addressed most of the critical feedback from prior reviews by fixing the unreachable code bug, removing dead configuration comments, simplifying validation logic, and standardizing copyright headers to "2025" for newly added files. The key fix removes the duplicate "Error" check in ddlb.py lines 58/68 that made success validation unreachable, and eliminates the unused missing_indicators list. PEP 8 formatting was also corrected. The test scenario timeout was extended from 10 to 30 minutes to allow DDLB benchmarks to complete. These changes clean up the DDLB integration while addressing previously flagged code quality issues.

Important Files Changed

Filename	Score	Overview
src/cloudai/workloads/ddlb/init.py	5/5	Updated copyright year from "2024-2025" to "2025" (administrative only)
conf/common/test_scenario/ddlb_test.toml	4.5/5	Extended test time limit from 10 to 30 minutes and removed dead commented-out fields
src/cloudai/workloads/ddlb/ddlb.py	5/5	Fixed critical duplicate error check bug making validation unreachable; removed unused `missing_indicators` list
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py	4/5	Simplified command generation by removing intermediate variable; copyright updated

Confidence score: 4/5

This PR addresses critical bugs but one code smell remains that should be resolved before merging
Score reflects that the duplicate error check bug was fixed and copyright headers were standardized, but the unused tdef variable in slurm_command_gen_strategy.py still exists from prior reviews, and concerns about the relative path safety raised in previous review ("Is it safe to use relative path? We can introduce a field in the test definition for this workload to hold path_to_script.") remain unaddressed
Review slurm_command_gen_strategy.py carefully—the unused tdef variable suggests the test definition may need to be used for configuration in the future, and the hardcoded relative path "scripts/run_benchmark.py" may cause failures if executed from unexpected working directories

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

nsarka · 2025-10-27T21:21:06Z

Thanks for the review. I will update with these changes.
In the mean time, when I tried running this change, I found that srun hangs when I use --container-image. Do you have any ideas on how to troubleshoot this? @amaslenn
Here the output of a manual run:
srun --export=ALL --mpi=pmix --container-image=gitlab-master.nvidia.com/dl/pytorch/update-scripts:pjnl-latest hostname
pyxis: importing docker image ...
pyxis: importing docker image ...
I figured since the container is ~9 GB, I should wait a little bit. But it's been about 4 hours, so I think it's safe to assume it's a hang.
Depends on the system it can take some time, but 4h for 9GB is too much.

Have you tried enabling local caching in system with cache_docker_images_locally = true (https://nvidia.github.io/cloudai/USER_GUIDE.html#step-4-system-configuration) and running cloudai install? This will run srun ... enroot ... to cache the image explicitly. Once done, next runs will use .sqsh file instead of pulling image every time.

Thanks. I opted to try it on another cluster, and it failed there too with

slurmstepd: error: pyxis:     [INFO] Creating squashfs filesystem...
slurmstepd: error: pyxis:     Write failed because No space left on device
slurmstepd: error: pyxis:     FATAL ERROR: Failed to write to output filesystem
slurmstepd: error: pyxis:     Parallel mksquashfs: Using 32 processors
slurmstepd: error: pyxis:     Creating 4.0 filesystem on /run/pyxis/47469/846367.2.squashfs, block size 131072.

It seems like the container is too big to convert to a .sqsh file with the scratch space available in enroot's tmpfs. Is there a way to pass the .sqsh file directly to CloudAI without caching? I want to make the .sqsh file and copy it over to the machine I'm testing on.

amaslenn · 2025-10-28T09:28:40Z

Thanks. I opted to try it on another cluster, and it failed there too with
slurmstepd: error: pyxis:     [INFO] Creating squashfs filesystem...
slurmstepd: error: pyxis:     Write failed because No space left on device
slurmstepd: error: pyxis:     FATAL ERROR: Failed to write to output filesystem
slurmstepd: error: pyxis:     Parallel mksquashfs: Using 32 processors
slurmstepd: error: pyxis:     Creating 4.0 filesystem on /run/pyxis/47469/846367.2.squashfs, block size 131072.
It seems like the container is too big to convert to a .sqsh file with the scratch space available in enroot's tmpfs. Is there a way to pass the .sqsh file directly to CloudAI without caching? I want to make the .sqsh file and copy it over to the machine I'm testing on.

If this image is too big, how will you create .sqsh file?
You can specify a local file for docker_image_url field, it will bypass enroot.

amaslenn · 2025-10-28T09:30:15Z

conf/common/test/ddlb_test.toml

+test_template_name = "DDLBTest"
+
+[cmd_args]
+docker_image_url = "gitlab-master.nvidia.com/nsarkauskas/ddlb:latest"


We can't use non-public images for common/ configs:

If possible, let's use a public image

If there is no public image, we can either move this config to another folder or to CloudAIx repo.

I think in the near future we'll make the image public. Which folder should I move it to until then?

You can use conf/experimental for now. Or create a new one under conf/.

src/cloudai/workloads/ddlb/ddlb.py

src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py

amaslenn · 2025-10-31T09:02:05Z

@nsarka please merge your PR with the latest main branch to align check list.

greptile-apps

Greptile Overview

Greptile Summary

Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI. The implementation follows established patterns from other workloads like NCCL and ChakraReplay.

Key Changes:

New test definition (DDLBTestDefinition) with Docker image support
Slurm command generation strategy that executes python scripts/run_benchmark.py
Configuration files for test setup (single node, 30-minute timeout)
Success validation checking for "Benchmark Results" in stdout
Registration in the main registry alongside other test definitions

Observations:

The error detection uses a generic "Error" string check which may produce false positives
The implementation is minimal but functional, delegating most logic to the container's benchmark script
No unit tests included for the new workload (though other workloads have test coverage)

Confidence Score: 4/5

This PR is safe to merge with minor refinements recommended
The implementation follows existing patterns closely (NCCL, ChakraReplay) and integrates cleanly into the registry. The main concern is the generic error detection string which could cause false positives. The code is well-structured and mirrors established workload patterns, making it maintainable. No breaking changes or security issues identified.
Primary attention needed on src/cloudai/workloads/ddlb/ddlb.py for error detection refinement

Important Files Changed

File Analysis

Filename	Score	Overview
src/cloudai/workloads/ddlb/ddlb.py	4/5	Core DDLB test definition with generic error detection ('Error' string may match false positives), success validation checks for 'Benchmark Results'
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py	5/5	Slurm command generation for DDLB, returns static test command, success check validates 'Benchmark Results' in output

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant DDLBTestDefinition
    participant DDLBTestSlurmCommandGenStrategy
    participant SlurmSystem
    participant DockerImage
    participant OutputFile

    User->>Registry: Register DDLB workload
    Registry->>Registry: Add DDLBTestDefinition
    Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy
    
    User->>DDLBTestDefinition: Create test with docker_image_url
    DDLBTestDefinition->>DockerImage: Initialize DockerImage(url)
    
    User->>DDLBTestSlurmCommandGenStrategy: Generate test command
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get docker_image.installed_path
    DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Return image path
    DDLBTestSlurmCommandGenStrategy->>SlurmSystem: Generate srun command with container
    DDLBTestSlurmCommandGenStrategy-->>User: Return ["python scripts/run_benchmark.py"]
    
    User->>SlurmSystem: Execute test via Slurm
    SlurmSystem->>OutputFile: Write stdout.txt
    
    User->>DDLBTestDefinition: Check was_run_successful()
    DDLBTestDefinition->>OutputFile: Read stdout.txt
    alt Contains "Error"
        DDLBTestDefinition-->>User: JobStatusResult(False, error details)
    else Missing "Benchmark Results"
        DDLBTestDefinition-->>User: JobStatusResult(False, missing indicators)
    else Success
        DDLBTestDefinition-->>User: JobStatusResult(True)
    end

_{6 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-31T15:17:39Z

src/cloudai/workloads/ddlb/ddlb.py

+                content = file.read()
+
+                # Check for specific error patterns
+                if "Error" in content:


style: generic "Error" check may produce false positives (e.g., "Error rate: 0%", "Error handling initialized"). Consider more specific pattern like "Error:" or actual DDLB error messages

greptile-apps

Greptile Overview

Greptile Summary

adds DDLB (Distributed Deep Learning Benchmark) workload support following the existing CloudAI workload pattern with test definition, Slurm command generation strategy, and configuration files.

Key changes:

registered DDLB workload in src/cloudai/registration.py
created DDLBTestDefinition with Docker image management and success validation based on "Benchmark Results" pattern
implemented DDLBTestSlurmCommandGenStrategy to generate mpirun commands
added test configuration (conf/common/test/ddlb_test.toml) and test scenario

Issues found:

critical command generation bug in slurm_command_gen_strategy.py:36 that produces malformed commands
unused imports in ddlb.py

Confidence Score: 2/5

critical bug in command generation will cause runtime failures when executing DDLB tests
the generate_test_command method in slurm_command_gen_strategy.py:36 constructs a malformed command list with "mpirun -np " (trailing space) as a single element, which when joined with spaces produces "mpirun -np 8 python..." (double space). This breaks command parsing and will cause test execution failures
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate fix to command generation logic

Important Files Changed

File Analysis

Filename	Score	Overview
conf/common/test/ddlb_test.toml	4/5	configuration file with hardcoded path, standard structure matches other test configs
src/cloudai/workloads/ddlb/ddlb.py	3/5	test definition with unused imports, follows established patterns for workload definitions
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py	2/5	command generation with critical bug in `generate_test_command` list structure (line 36) that will produce malformed command

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant DDLBTestDefinition
    participant SlurmCommandGenStrategy
    participant DockerImage
    participant SlurmSystem

    User->>Registry: register DDLB workload
    Registry->>DDLBTestDefinition: register test definition
    Registry->>SlurmCommandGenStrategy: register command gen strategy
    
    User->>DDLBTestDefinition: load test config (ddlb_test.toml)
    DDLBTestDefinition->>DockerImage: initialize docker_image from docker_image_url
    
    User->>SlurmCommandGenStrategy: generate execution command
    SlurmCommandGenStrategy->>DDLBTestDefinition: get test definition
    DDLBTestDefinition->>DockerImage: get installed_path
    SlurmCommandGenStrategy->>SlurmCommandGenStrategy: generate_test_command() -> ["mpirun -np ", "8", "python scripts/run_benchmark.py"]
    SlurmCommandGenStrategy->>SlurmSystem: create sbatch script with srun command
    
    User->>SlurmSystem: execute job
    SlurmSystem->>SlurmSystem: run mpirun with DDLB benchmark
    SlurmSystem-->>User: output to stdout.txt
    
    User->>DDLBTestDefinition: was_run_successful(test_run)
    DDLBTestDefinition->>DDLBTestDefinition: check stdout.txt for "Error" or "Benchmark Results"
    DDLBTestDefinition-->>User: JobStatusResult

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-31T17:44:43Z

src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py

+        cmd = ["mpirun -np "]
+        cmd.append(str(tdef.cmd_args.np))
+        cmd.append("python scripts/run_benchmark.py")


logic: incorrect command list structure—"mpirun -np " with trailing space will produce malformed command when joined. When " ".join() is applied, this becomes "mpirun -np 8 python scripts/run_benchmark.py" (double space). Each element should be a complete token.

Suggested change

cmd = ["mpirun -np "]

cmd.append(str(tdef.cmd_args.np))

cmd.append("python scripts/run_benchmark.py")

cmd = ["mpirun", "-np"]

cmd.append(str(tdef.cmd_args.np))

greptile-apps · 2025-10-31T17:44:43Z

src/cloudai/workloads/ddlb/ddlb.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import Literal, Optional, Union


style: Literal and Union are imported but never used

Suggested change

from typing import Literal, Optional, Union

from typing import Optional

This reverts commit eda5d0e.

greptile-apps

Greptile Overview

Greptile Summary

Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI, following the existing pattern for workload registration with Slurm systems.

Key Changes

Added DDLB test definition with Docker image support and job success validation
Implemented Slurm command generation strategy for DDLB workloads
Created configuration files for test and test scenario definitions
Registered DDLB workload in the global registry alongside other workloads like NCCL and UCC

Issues Identified

Critical: Duplicate error checking logic in ddlb.py:59-68 makes second condition unreachable
Unused imports (Literal, Union) in ddlb.py:17
Unused tdef variable in slurm_command_gen_strategy.py:35
Generic "Error" pattern may cause false positives
Potential None handling issue in image_path() when installed_path is None

Confidence Score: 2/5

Not safe to merge - contains critical logic bug that prevents proper error detection
The duplicate error check at lines 59-68 in ddlb.py creates unreachable code that will prevent the success indicator check from ever executing. This is a critical bug that breaks the test validation logic. Additionally, the generic "Error" pattern is prone to false positives, and the image_path() method may return string "None" instead of handling None properly.
src/cloudai/workloads/ddlb/ddlb.py requires immediate attention due to unreachable code, and slurm_command_gen_strategy.py needs review for None handling

Important Files Changed

File Analysis

Filename	Score	Overview
src/cloudai/workloads/ddlb/ddlb.py	2/5	DDLB test definition with critical logic error in duplicate error checking (lines 59-68 unreachable), unused imports, and overly generic error pattern matching
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py	3/5	Command generation strategy with unused `tdef` variable and potential None handling issue in `image_path()` method

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant TestRunner
    participant DDLBTestDefinition
    participant DDLBTestSlurmCommandGenStrategy
    participant SlurmSystem
    participant DockerImage

    User->>Registry: Register DDLB workload
    Registry->>Registry: Add DDLBTestDefinition
    Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy

    User->>TestRunner: Execute DDLB test
    TestRunner->>DDLBTestDefinition: Load test configuration
    DDLBTestDefinition->>DockerImage: Initialize docker_image from URL
    
    TestRunner->>DDLBTestSlurmCommandGenStrategy: Generate command
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get docker_image.installed_path
    DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Generate test command
    DDLBTestSlurmCommandGenStrategy-->>TestRunner: Return ["python scripts/run_benchmark.py"]
    
    TestRunner->>SlurmSystem: Submit job with srun command
    SlurmSystem-->>TestRunner: Job execution
    
    TestRunner->>DDLBTestDefinition: was_run_successful()
    DDLBTestDefinition->>DDLBTestDefinition: Read stdout.txt
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" pattern
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results"
    DDLBTestDefinition-->>TestRunner: Return JobStatusResult

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

nsarka requested review from TaekyungHeo, amaslenn, srinivas212 and srivatsankrishnan as code owners October 16, 2025 21:25

amaslenn reviewed Oct 17, 2025

View reviewed changes

greptile-apps bot reviewed Oct 21, 2025

View reviewed changes

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

amaslenn reviewed Oct 28, 2025

View reviewed changes

nsarka added 3 commits October 31, 2025 11:12

Add DDLB workload

8ac4b6c

Update copyright, remove comments

ef4256c

Greptile feedback

2bf56e4

nsarka force-pushed the nsarka/ddlb-integration branch from 1171554 to 2bf56e4 Compare October 31, 2025 15:12

greptile-apps bot reviewed Oct 31, 2025

View reviewed changes

Add mpirun

eda5d0e

greptile-apps bot reviewed Oct 31, 2025

View reviewed changes

Revert "Add mpirun"

920a023

This reverts commit eda5d0e.

greptile-apps bot reviewed Oct 31, 2025

View reviewed changes

	from typing import Literal, Optional, Union
	from typing import Optional

Draft: Add DDLB workload #711

Are you sure you want to change the base?

Draft: Add DDLB workload #711

Conversation

nsarka commented Oct 16, 2025

Uh oh!

amaslenn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amaslenn Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

nsarka Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

nsarka commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amaslenn commented Oct 23, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Uh oh!

nsarka commented Oct 27, 2025

Uh oh!

amaslenn commented Oct 28, 2025

Uh oh!

amaslenn Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

nsarka Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

amaslenn Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amaslenn commented Oct 31, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 2/5

Important Files Changed

nsarka commented Oct 22, 2025 •

edited

Loading