-
Notifications
You must be signed in to change notification settings - Fork 35
[DRAFT][RAY] Add slurm ray tests #409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AgrawalAmey thanks a lot for your contribution! And sorry for the late feedback.
| @@ -0,0 +1,23 @@ | |||
| # SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES | |||
| # Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new files please set a single year value (diagnostic we have today is misleading)
|
|
||
| def _get_sbatch_directives(self, args: Dict[str, Any], output_path: Path) -> Dict[str, str]: | ||
| sbatch_directives = super()._get_sbatch_directives(args, output_path) | ||
| # TODO(Amey): We probably need to figure out what to do with cpus-per-task, mem-per-cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be set with SlurmSystem.extra_sbatch_args. The downside is that it is set per System, so all tests in a scenario will have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, i want this to be dynamic, as a fraction of total resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have to set the task per worker to 1 for ray, we need to ensure that all the resources are made available to the process.
| template_path = script_dir / "slurm_ray_container_template.sh.jinja" | ||
| template = Template(template_path.read_text()) | ||
|
|
||
| conda_activate_command = f"conda activate {tdef.cmd_args.conda_env} && " if tdef.cmd_args.conda_env else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please help me understanding this part. Isn't env for ray is ready inside a container? Why this extra env needed?
In CloudAI we have a concept of installable: items that should be "installed" before run (done with cloudai install ...). Examples: docker images, git repos with python scripts (in this case we can create venv for it), etc. Repos can be mount into a container to have files available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, this is supposed to be an optional parameter to activate a specific environment if required. For instance, in the Vajra nightly perf test container, we have multiple envs for vllm, vajra, sglang etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that SlurmRayContainer becomes too Vajra-specific. This shouldn't be a blocker, but if we can generalize it, would be great. I don't have a good idea so far.
src/cloudai/workloads/slurm_ray_container/slurm_ray_container_template.sh.jinja
Show resolved
Hide resolved
| ), | ||
| SlurmContainerCommandGenStrategy, | ||
| ), | ||
| "slurm_ray_container": lambda: create_test_run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also update fixture.params for this one, otherwise this case will not run.
Summary
This PR adds a new test category which runs ray applications with slurm.
Tested with:
Generated sbatch file: