Skip to content

Conversation

@nsarka
Copy link
Member

@nsarka nsarka commented Oct 16, 2025

DDLB workload integration

Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

Please also:

  1. Extend test_acceptance.py to cover sbatch generation logic.
  2. Add documentation page for this workload, see doc/workloads for examples. And link this page to the main one.


def generate_test_command(self) -> List[str]:
tdef: DDLBTestDefinition = cast(DDLBTestDefinition, self.test_run.test.test_definition)
srun_command_parts = ["python scripts/run_benchmark.py"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to use relative path? We can introduce a field in the test definition for this workload to hold path_to_script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the container the default path has this relative path available. I think it's ok, I can update it if the default path changes in the container

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@nsarka
Copy link
Member Author

nsarka commented Oct 22, 2025

Thanks for the review. I will update with these changes.

In the mean time, when I tried running this change, I found that srun hangs when I use --container-image. Do you have any ideas on how to troubleshoot this? @amaslenn

Here the output of a manual run:

srun --export=ALL --mpi=pmix --container-image=gitlab-master.nvidia.com/dl/pytorch/update-scripts:pjnl-latest hostname
pyxis: importing docker image ...
pyxis: importing docker image ...

I figured since the container is ~9 GB, I should wait a little bit. But it's been about 4 hours, so I think it's safe to assume it's a hang.

@amaslenn
Copy link
Contributor

Thanks for the review. I will update with these changes.

In the mean time, when I tried running this change, I found that srun hangs when I use --container-image. Do you have any ideas on how to troubleshoot this? @amaslenn

Here the output of a manual run:

srun --export=ALL --mpi=pmix --container-image=gitlab-master.nvidia.com/dl/pytorch/update-scripts:pjnl-latest hostname
pyxis: importing docker image ...
pyxis: importing docker image ...

I figured since the container is ~9 GB, I should wait a little bit. But it's been about 4 hours, so I think it's safe to assume it's a hang.

Depends on the system it can take some time, but 4h for 9GB is too much.

Have you tried enabling local caching in system with cache_docker_images_locally = true (https://nvidia.github.io/cloudai/USER_GUIDE.html#step-4-system-configuration) and running cloudai install? This will run srun ... enroot ... to cache the image explicitly. Once done, next runs will use .sqsh file instead of pulling image every time.

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent changes address previously raised issues about copyright dates and commented code. The developer has updated copyright headers in newly added files to use only "2025" (instead of "2024-2025") and removed commented pre_test and post_test lines from the test scenario configuration, streamlining the DDLB integration. These are minor cleanup changes that improve code consistency with project conventions.

Important Files Changed

Filename Score Overview
conf/common/test_scenario/ddlb_test.toml 5/5 Removed commented pre_test and post_test hook lines, leaving clean minimal configuration
src/cloudai/workloads/ddlb/init.py 5/5 Updated copyright year from "2024-2025" to "2025" only
src/cloudai/registration.py 5/5 Updated copyright year from "2024-2025" to "2025" only
src/cloudai/workloads/ddlb/ddlb.py 5/5 Updated copyright year from "2024-2025" to "2025" only
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 5/5 Updated copyright year from "2024-2025" to "2025" only
conf/common/test/ddlb_test.toml 5/5 Updated copyright year from "2024-2025" to "2025" only

Confidence score: 5/5

  • These changes are safe to merge as they only address formatting and consistency issues raised in previous reviews
  • The score reflects that these are purely cosmetic/metadata changes with no functional impact on code behavior
  • No files require special attention; all changes are straightforward corrections to copyright headers and removal of commented placeholder code

6 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers changes made to the DDLB workload since the last review, not the entire PR. The developer addressed most of the critical feedback from prior reviews by fixing the unreachable code bug, removing dead configuration comments, simplifying validation logic, and standardizing copyright headers to "2025" for newly added files. The key fix removes the duplicate "Error" check in ddlb.py lines 58/68 that made success validation unreachable, and eliminates the unused missing_indicators list. PEP 8 formatting was also corrected. The test scenario timeout was extended from 10 to 30 minutes to allow DDLB benchmarks to complete. These changes clean up the DDLB integration while addressing previously flagged code quality issues.

Important Files Changed

Filename Score Overview
src/cloudai/workloads/ddlb/init.py 5/5 Updated copyright year from "2024-2025" to "2025" (administrative only)
conf/common/test_scenario/ddlb_test.toml 4.5/5 Extended test time limit from 10 to 30 minutes and removed dead commented-out fields
src/cloudai/workloads/ddlb/ddlb.py 5/5 Fixed critical duplicate error check bug making validation unreachable; removed unused missing_indicators list
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 4/5 Simplified command generation by removing intermediate variable; copyright updated

Confidence score: 4/5

  • This PR addresses critical bugs but one code smell remains that should be resolved before merging
  • Score reflects that the duplicate error check bug was fixed and copyright headers were standardized, but the unused tdef variable in slurm_command_gen_strategy.py still exists from prior reviews, and concerns about the relative path safety raised in previous review ("Is it safe to use relative path? We can introduce a field in the test definition for this workload to hold path_to_script.") remain unaddressed
  • Review slurm_command_gen_strategy.py carefully—the unused tdef variable suggests the test definition may need to be used for configuration in the future, and the hardcoded relative path "scripts/run_benchmark.py" may cause failures if executed from unexpected working directories

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@nsarka
Copy link
Member Author

nsarka commented Oct 27, 2025

Thanks for the review. I will update with these changes.
In the mean time, when I tried running this change, I found that srun hangs when I use --container-image. Do you have any ideas on how to troubleshoot this? @amaslenn
Here the output of a manual run:

srun --export=ALL --mpi=pmix --container-image=gitlab-master.nvidia.com/dl/pytorch/update-scripts:pjnl-latest hostname
pyxis: importing docker image ...
pyxis: importing docker image ...

I figured since the container is ~9 GB, I should wait a little bit. But it's been about 4 hours, so I think it's safe to assume it's a hang.

Depends on the system it can take some time, but 4h for 9GB is too much.

Have you tried enabling local caching in system with cache_docker_images_locally = true (https://nvidia.github.io/cloudai/USER_GUIDE.html#step-4-system-configuration) and running cloudai install? This will run srun ... enroot ... to cache the image explicitly. Once done, next runs will use .sqsh file instead of pulling image every time.

Thanks. I opted to try it on another cluster, and it failed there too with

slurmstepd: error: pyxis:     [INFO] Creating squashfs filesystem...
slurmstepd: error: pyxis:     Write failed because No space left on device
slurmstepd: error: pyxis:     FATAL ERROR: Failed to write to output filesystem
slurmstepd: error: pyxis:     Parallel mksquashfs: Using 32 processors
slurmstepd: error: pyxis:     Creating 4.0 filesystem on /run/pyxis/47469/846367.2.squashfs, block size 131072.

It seems like the container is too big to convert to a .sqsh file with the scratch space available in enroot's tmpfs. Is there a way to pass the .sqsh file directly to CloudAI without caching? I want to make the .sqsh file and copy it over to the machine I'm testing on.

@amaslenn
Copy link
Contributor

Thanks. I opted to try it on another cluster, and it failed there too with

slurmstepd: error: pyxis:     [INFO] Creating squashfs filesystem...
slurmstepd: error: pyxis:     Write failed because No space left on device
slurmstepd: error: pyxis:     FATAL ERROR: Failed to write to output filesystem
slurmstepd: error: pyxis:     Parallel mksquashfs: Using 32 processors
slurmstepd: error: pyxis:     Creating 4.0 filesystem on /run/pyxis/47469/846367.2.squashfs, block size 131072.

It seems like the container is too big to convert to a .sqsh file with the scratch space available in enroot's tmpfs. Is there a way to pass the .sqsh file directly to CloudAI without caching? I want to make the .sqsh file and copy it over to the machine I'm testing on.

If this image is too big, how will you create .sqsh file?
You can specify a local file for docker_image_url field, it will bypass enroot.

test_template_name = "DDLBTest"

[cmd_args]
docker_image_url = "gitlab-master.nvidia.com/nsarkauskas/ddlb:latest"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use non-public images for common/ configs:

  1. If possible, let's use a public image
  2. If there is no public image, we can either move this config to another folder or to CloudAIx repo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the near future we'll make the image public. Which folder should I move it to until then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use conf/experimental for now. Or create a new one under conf/.

@amaslenn
Copy link
Contributor

@nsarka please merge your PR with the latest main branch to align check list.

@nsarka nsarka force-pushed the nsarka/ddlb-integration branch from 1171554 to 2bf56e4 Compare October 31, 2025 15:12
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI. The implementation follows established patterns from other workloads like NCCL and ChakraReplay.

Key Changes:

  • New test definition (DDLBTestDefinition) with Docker image support
  • Slurm command generation strategy that executes python scripts/run_benchmark.py
  • Configuration files for test setup (single node, 30-minute timeout)
  • Success validation checking for "Benchmark Results" in stdout
  • Registration in the main registry alongside other test definitions

Observations:

  • The error detection uses a generic "Error" string check which may produce false positives
  • The implementation is minimal but functional, delegating most logic to the container's benchmark script
  • No unit tests included for the new workload (though other workloads have test coverage)

Confidence Score: 4/5

  • This PR is safe to merge with minor refinements recommended
  • The implementation follows existing patterns closely (NCCL, ChakraReplay) and integrates cleanly into the registry. The main concern is the generic error detection string which could cause false positives. The code is well-structured and mirrors established workload patterns, making it maintainable. No breaking changes or security issues identified.
  • Primary attention needed on src/cloudai/workloads/ddlb/ddlb.py for error detection refinement

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/ddlb.py 4/5 Core DDLB test definition with generic error detection ('Error' string may match false positives), success validation checks for 'Benchmark Results'
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 5/5 Slurm command generation for DDLB, returns static test command, success check validates 'Benchmark Results' in output

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant DDLBTestDefinition
    participant DDLBTestSlurmCommandGenStrategy
    participant SlurmSystem
    participant DockerImage
    participant OutputFile

    User->>Registry: Register DDLB workload
    Registry->>Registry: Add DDLBTestDefinition
    Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy
    
    User->>DDLBTestDefinition: Create test with docker_image_url
    DDLBTestDefinition->>DockerImage: Initialize DockerImage(url)
    
    User->>DDLBTestSlurmCommandGenStrategy: Generate test command
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get docker_image.installed_path
    DDLBTestDefinition-->>DDLBTestSlurmCommandGenStrategy: Return image path
    DDLBTestSlurmCommandGenStrategy->>SlurmSystem: Generate srun command with container
    DDLBTestSlurmCommandGenStrategy-->>User: Return ["python scripts/run_benchmark.py"]
    
    User->>SlurmSystem: Execute test via Slurm
    SlurmSystem->>OutputFile: Write stdout.txt
    
    User->>DDLBTestDefinition: Check was_run_successful()
    DDLBTestDefinition->>OutputFile: Read stdout.txt
    alt Contains "Error"
        DDLBTestDefinition-->>User: JobStatusResult(False, error details)
    else Missing "Benchmark Results"
        DDLBTestDefinition-->>User: JobStatusResult(False, missing indicators)
    else Success
        DDLBTestDefinition-->>User: JobStatusResult(True)
    end
Loading

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

content = file.read()

# Check for specific error patterns
if "Error" in content:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: generic "Error" check may produce false positives (e.g., "Error rate: 0%", "Error handling initialized"). Consider more specific pattern like "Error:" or actual DDLB error messages

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

adds DDLB (Distributed Deep Learning Benchmark) workload support following the existing CloudAI workload pattern with test definition, Slurm command generation strategy, and configuration files.

Key changes:

  • registered DDLB workload in src/cloudai/registration.py
  • created DDLBTestDefinition with Docker image management and success validation based on "Benchmark Results" pattern
  • implemented DDLBTestSlurmCommandGenStrategy to generate mpirun commands
  • added test configuration (conf/common/test/ddlb_test.toml) and test scenario

Issues found:

  • critical command generation bug in slurm_command_gen_strategy.py:36 that produces malformed commands
  • unused imports in ddlb.py

Confidence Score: 2/5

  • critical bug in command generation will cause runtime failures when executing DDLB tests
  • the generate_test_command method in slurm_command_gen_strategy.py:36 constructs a malformed command list with "mpirun -np " (trailing space) as a single element, which when joined with spaces produces "mpirun -np 8 python..." (double space). This breaks command parsing and will cause test execution failures
  • src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py requires immediate fix to command generation logic

Important Files Changed

File Analysis

Filename Score Overview
conf/common/test/ddlb_test.toml 4/5 configuration file with hardcoded path, standard structure matches other test configs
src/cloudai/workloads/ddlb/ddlb.py 3/5 test definition with unused imports, follows established patterns for workload definitions
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 2/5 command generation with critical bug in generate_test_command list structure (line 36) that will produce malformed command

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant DDLBTestDefinition
    participant SlurmCommandGenStrategy
    participant DockerImage
    participant SlurmSystem

    User->>Registry: register DDLB workload
    Registry->>DDLBTestDefinition: register test definition
    Registry->>SlurmCommandGenStrategy: register command gen strategy
    
    User->>DDLBTestDefinition: load test config (ddlb_test.toml)
    DDLBTestDefinition->>DockerImage: initialize docker_image from docker_image_url
    
    User->>SlurmCommandGenStrategy: generate execution command
    SlurmCommandGenStrategy->>DDLBTestDefinition: get test definition
    DDLBTestDefinition->>DockerImage: get installed_path
    SlurmCommandGenStrategy->>SlurmCommandGenStrategy: generate_test_command() -> ["mpirun -np ", "8", "python scripts/run_benchmark.py"]
    SlurmCommandGenStrategy->>SlurmSystem: create sbatch script with srun command
    
    User->>SlurmSystem: execute job
    SlurmSystem->>SlurmSystem: run mpirun with DDLB benchmark
    SlurmSystem-->>User: output to stdout.txt
    
    User->>DDLBTestDefinition: was_run_successful(test_run)
    DDLBTestDefinition->>DDLBTestDefinition: check stdout.txt for "Error" or "Benchmark Results"
    DDLBTestDefinition-->>User: JobStatusResult
Loading

3 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 36 to 38
cmd = ["mpirun -np "]
cmd.append(str(tdef.cmd_args.np))
cmd.append("python scripts/run_benchmark.py")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: incorrect command list structure—"mpirun -np " with trailing space will produce malformed command when joined. When " ".join() is applied, this becomes "mpirun -np 8 python scripts/run_benchmark.py" (double space). Each element should be a complete token.

Suggested change
cmd = ["mpirun -np "]
cmd.append(str(tdef.cmd_args.np))
cmd.append("python scripts/run_benchmark.py")
cmd = ["mpirun", "-np"]
cmd.append(str(tdef.cmd_args.np))

# See the License for the specific language governing permissions and
# limitations under the License.

from typing import Literal, Optional, Union
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Literal and Union are imported but never used

Suggested change
from typing import Literal, Optional, Union
from typing import Optional

This reverts commit eda5d0e.
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Adds DDLB (Distributed Deep Learning Benchmark) workload integration to CloudAI, following the existing pattern for workload registration with Slurm systems.

Key Changes

  • Added DDLB test definition with Docker image support and job success validation
  • Implemented Slurm command generation strategy for DDLB workloads
  • Created configuration files for test and test scenario definitions
  • Registered DDLB workload in the global registry alongside other workloads like NCCL and UCC

Issues Identified

  • Critical: Duplicate error checking logic in ddlb.py:59-68 makes second condition unreachable
  • Unused imports (Literal, Union) in ddlb.py:17
  • Unused tdef variable in slurm_command_gen_strategy.py:35
  • Generic "Error" pattern may cause false positives
  • Potential None handling issue in image_path() when installed_path is None

Confidence Score: 2/5

  • Not safe to merge - contains critical logic bug that prevents proper error detection
  • The duplicate error check at lines 59-68 in ddlb.py creates unreachable code that will prevent the success indicator check from ever executing. This is a critical bug that breaks the test validation logic. Additionally, the generic "Error" pattern is prone to false positives, and the image_path() method may return string "None" instead of handling None properly.
  • src/cloudai/workloads/ddlb/ddlb.py requires immediate attention due to unreachable code, and slurm_command_gen_strategy.py needs review for None handling

Important Files Changed

File Analysis

Filename Score Overview
src/cloudai/workloads/ddlb/ddlb.py 2/5 DDLB test definition with critical logic error in duplicate error checking (lines 59-68 unreachable), unused imports, and overly generic error pattern matching
src/cloudai/workloads/ddlb/slurm_command_gen_strategy.py 3/5 Command generation strategy with unused tdef variable and potential None handling issue in image_path() method

Sequence Diagram

sequenceDiagram
    participant User
    participant Registry
    participant TestRunner
    participant DDLBTestDefinition
    participant DDLBTestSlurmCommandGenStrategy
    participant SlurmSystem
    participant DockerImage

    User->>Registry: Register DDLB workload
    Registry->>Registry: Add DDLBTestDefinition
    Registry->>Registry: Add DDLBTestSlurmCommandGenStrategy

    User->>TestRunner: Execute DDLB test
    TestRunner->>DDLBTestDefinition: Load test configuration
    DDLBTestDefinition->>DockerImage: Initialize docker_image from URL
    
    TestRunner->>DDLBTestSlurmCommandGenStrategy: Generate command
    DDLBTestSlurmCommandGenStrategy->>DDLBTestDefinition: Get docker_image.installed_path
    DDLBTestSlurmCommandGenStrategy->>DDLBTestSlurmCommandGenStrategy: Generate test command
    DDLBTestSlurmCommandGenStrategy-->>TestRunner: Return ["python scripts/run_benchmark.py"]
    
    TestRunner->>SlurmSystem: Submit job with srun command
    SlurmSystem-->>TestRunner: Job execution
    
    TestRunner->>DDLBTestDefinition: was_run_successful()
    DDLBTestDefinition->>DDLBTestDefinition: Read stdout.txt
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Error" pattern
    DDLBTestDefinition->>DDLBTestDefinition: Check for "Benchmark Results"
    DDLBTestDefinition-->>TestRunner: Return JobStatusResult
Loading

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants