Skip to content

Conversation

@AgrawalAmey
Copy link

@AgrawalAmey AgrawalAmey commented Mar 12, 2025

Summary

This PR adds a new test category which runs ray applications with slurm.

Tested with:

cloudai dry-run \
    --system-config conf/common/system/example_slurm_cluster.toml \
    --tests-dir conf/common/test \
    --test-scenario conf/common/test_scenario/slurm_ray_container.toml

Generated sbatch file:

#!/bin/bash
#SBATCH --job-name=TestTemplate_20250312_013021
#SBATCH -N 2
#SBATCH --output=results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0/stdout.txt
#SBATCH --error=results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0/stderr.txt
#SBATCH --partition=partition_1
#SBATCH --gpus_per_node=8
#SBATCH --gres=gpu:8
#SBATCH --ntasks_per_node=8
#SBATCH --time_limit=00:00:00
#SBATCH --tasks-per-node=2
#SBATCH --exclusive

export SLURM_JOB_MASTER_NODE=$(scontrol show hostname $SLURM_JOB_NODELIST | head -n 1)
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export MELLANOX_VISIBLE_DEVICES=0,3,4,5,6,9,10,11
export NCCL_IB_GID_INDEX=3
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_IB_TIMEOUT=20
srun --mpi=pmix --container-image=vllm/vllm-openai:latest --container-mounts=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0:/cloudai_run_results --no-container-mount-home --output=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0/mapping-stdout.txt --error=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0/mapping-stderr.txt bash -c "echo \$(date): \$(hostname):node \${SLURM_NODEID}:rank \${SLURM_PROCID}."

port=6379
ip_head=$head_node_ip:$port
export ip_head
echo "IP Head: $ip_head"

echo "Starting HEAD at $head_node"
srun --mpi=pmix --container-image=vllm/vllm-openai:latest --container-mounts=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0:/cloudai_run_results --no-container-mount-home --nodes=1 --ntasks=1 -w "$head_node" \
     \
    ray start --head --node-ip-address="$head_node_ip" --port=$port \
    --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block &

# optional, though may be useful in certain versions of Ray < 1.0.
sleep 10

# number of nodes other than the head node
worker_num=$((SLURM_JOB_NUM_NODES - 1))

for ((i = 1; i <= worker_num; i++)); do
    node_i=${nodes_array[$i]}
    echo "Starting WORKER $i at $node_i"
    srun --mpi=pmix --container-image=vllm/vllm-openai:latest --container-mounts=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0:/cloudai_run_results --no-container-mount-home --nodes=1 --ntasks=1 -w "$node_i" \
         \
        ray start --address "$ip_head" \
        --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_TASK}" --block &
    sleep 5
done

srun --mpi=pmix --container-image=vllm/vllm-openai:latest --container-mounts=/home/aagrawal360/repos/cloudai/results/slurm_ray_container_example_2025-03-12_01-30-21/Tests.1/0:/cloudai_run_results --no-container-mount-home --nodes=1 --ntasks=1 \
  -w "$head_node" --gpus-per-node=0 \
   \
  python3 examples/offline_inference/llm_engine_example.py -tp 8 -pp 2

@TaekyungHeo TaekyungHeo marked this pull request as draft March 13, 2025 00:44
Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AgrawalAmey thanks a lot for your contribution! And sorry for the late feedback.

@@ -0,0 +1,23 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For new files please set a single year value (diagnostic we have today is misleading)


def _get_sbatch_directives(self, args: Dict[str, Any], output_path: Path) -> Dict[str, str]:
sbatch_directives = super()._get_sbatch_directives(args, output_path)
# TODO(Amey): We probably need to figure out what to do with cpus-per-task, mem-per-cpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be set with SlurmSystem.extra_sbatch_args. The downside is that it is set per System, so all tests in a scenario will have it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, i want this to be dynamic, as a fraction of total resources

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have to set the task per worker to 1 for ray, we need to ensure that all the resources are made available to the process.

template_path = script_dir / "slurm_ray_container_template.sh.jinja"
template = Template(template_path.read_text())

conda_activate_command = f"conda activate {tdef.cmd_args.conda_env} && " if tdef.cmd_args.conda_env else ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help me understanding this part. Isn't env for ray is ready inside a container? Why this extra env needed?

In CloudAI we have a concept of installable: items that should be "installed" before run (done with cloudai install ...). Examples: docker images, git repos with python scripts (in this case we can create venv for it), etc. Repos can be mount into a container to have files available.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, this is supposed to be an optional parameter to activate a specific environment if required. For instance, in the Vajra nightly perf test container, we have multiple envs for vllm, vajra, sglang etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that SlurmRayContainer becomes too Vajra-specific. This shouldn't be a blocker, but if we can generalize it, would be great. I don't have a good idea so far.

),
SlurmContainerCommandGenStrategy,
),
"slurm_ray_container": lambda: create_test_run(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update fixture.params for this one, otherwise this case will not run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants