Skip to content
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docker/.env.base
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ DOCKER_USER_HOME=/root
# Cluster specific settings
###

# Job scheduler used by cluster.
# Currently supports PBS and SLURM
CLUSTER_JOB_SCHEDULER=PBS
# Docker cache dir for Isaac Sim (has to end on docker-isaac-sim)
# e.g. /cluster/scratch/$USER/docker-isaac-sim
CLUSTER_ISAAC_SIM_CACHE_DIR=/some/path/on/cluster/docker-isaac-sim
Expand Down
45 changes: 37 additions & 8 deletions docker/cluster/submit_job.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
#!/usr/bin/env bash

# in the case you need to load specific modules on the cluster, add them here
# e.g., `module load eth_proxy`
# In case you need to load specific modules on the cluster, add them here
# e.g., `module load eth_proxy` or `ml go-1.19.4/apptainer-1.1.8`

# create job script with compute demands
### MODIFY HERE FOR YOUR JOB ###
cat <<EOT > job.sh
scheduler="$1"
cluster_isaaclab_dir="$2"
container_profile="$3"

if [ "$scheduler" == "SLURM" ]; then
cat <<'EOT' >> job.sh
#!/bin/bash

#SBATCH -n 1
Expand All @@ -15,11 +18,37 @@ cat <<EOT > job.sh
#SBATCH --mem-per-cpu=4048
#SBATCH --mail-type=END
#SBATCH --mail-user=name@mail
#SBATCH --job-name="training-$(date +"%Y-%m-%dT%H:%M")"
#SBATCH --job-name=training-$(date +"%Y-%m-%dT%H:%M")
EOT
elif [ "$scheduler" == "PBS" ]; then
cat <<'EOT' >> job.sh
#!/bin/bash

#PBS -l select=1:ncpus=8:mpiprocs=1:ngpus=1
#PBS -l walltime=01:00:00
#PBS -j oe
#PBS -q gpu
#PBS -N isaaclab
#PBS -m bea -M "user@mail"
EOT
fi

cat <<EOT >> job.sh

# Pass the container profile first to run_singularity.sh, then all arguments intended for the executed script
sh "$1/docker/cluster/run_singularity.sh" "$2" "${@:3}"
sh "$cluster_isaaclab_dir/docker/cluster/run_singularity.sh" "$container_profile" "${@:4}"
EOT

sbatch < job.sh
# Submit the job
if [ "$scheduler" == "SLURM" ]; then
sbatch < job.sh
elif [ "$scheduler" == "PBS" ]; then
qsub job.sh
else
echo "Invalid job scheduler. Available options [SLURM/PBS]."
rm job.sh
exit 1
fi

# Clean up the job script after submission
rm job.sh
29 changes: 22 additions & 7 deletions docker/container.sh
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,25 @@ x11_cleanup() {
fi
}

submit_job() {

case $CLUSTER_JOB_SCHEDULER in
"SLURM")
CMD="sbatch"
;;
"PBS")
CMD="bash"
;;
*)
echo "[ERROR] Unsupported job scheduler specified: $CLUSTER_JOB_SCHEDULER"
exit 1
;;
esac

echo "[INFO] Arguments passed to job script ${@}"
ssh $CLUSTER_LOGIN "cd $CLUSTER_ISAACLAB_DIR && $CMD $CLUSTER_ISAACLAB_DIR/docker/cluster/submit_job.sh \"$CLUSTER_JOB_SCHEDULER\" \"$CLUSTER_ISAACLAB_DIR\" \"isaac-lab-$container_profile\" ${@}"
}

#==
# Main
#==
Expand Down Expand Up @@ -366,18 +385,14 @@ case $mode in
ssh $CLUSTER_LOGIN "mkdir -p $CLUSTER_ISAACLAB_DIR"
# Sync Isaac Lab code
echo "[INFO] Syncing Isaac Lab code..."
rsync -rh --exclude="*.git*" --filter=':- .dockerignore' /$SCRIPT_DIR/.. $CLUSTER_LOGIN:$CLUSTER_ISAACLAB_DIR
rsync -rh --exclude="*.git*" --exclude="wandb/" --filter=':- .dockerignore' /$SCRIPT_DIR/.. $CLUSTER_LOGIN:$CLUSTER_ISAACLAB_DIR
# execute job script
echo "[INFO] Executing job script..."
# check whether the second argument is a profile or a job argument
if [ "$profile_arg" == "$container_profile" ] ; then
# if the second argument is a profile, we have to shift the arguments
echo "[INFO] Arguments passed to job script ${@:3}"
ssh $CLUSTER_LOGIN "cd $CLUSTER_ISAACLAB_DIR && sbatch $CLUSTER_ISAACLAB_DIR/docker/cluster/submit_job.sh" "$CLUSTER_ISAACLAB_DIR" "isaac-lab-$container_profile" "${@:3}"
submit_job "${@:3}"
else
# if the second argument is a job argument, we have to shift only one argument
echo "[INFO] Arguments passed to job script ${@:2}"
ssh $CLUSTER_LOGIN "cd $CLUSTER_ISAACLAB_DIR && sbatch $CLUSTER_ISAACLAB_DIR/docker/cluster/submit_job.sh" "$CLUSTER_ISAACLAB_DIR" "isaac-lab-$container_profile" "${@:2}"
submit_job "${@:2}"
fi
;;
config)
Expand Down
32 changes: 26 additions & 6 deletions docs/source/deployment/cluster.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ convert the Isaac Lab Docker image into a singularity image and use it to submit
.. attention::

Cluster setup varies across different institutions. The following instructions have been
tested on the `ETH Zurich Euler`_ cluster, which uses the SLURM workload manager.
tested on the `ETH Zurich Euler`_ cluster (which uses the SLURM workload manager), and the
IIT Genoa Franklin cluster (which uses PBS workload manager).

The instructions may need to be adapted for other clusters. If you have successfully
adapted the instructions for another cluster, please consider contributing to the
Expand Down Expand Up @@ -59,7 +60,9 @@ Configuring the cluster parameters

First, you need to configure the cluster-specific parameters in ``docker/.env.base`` file.
The following describes the parameters that need to be configured:

- ``CLUSTER_JOB_SCHEDULER``:
The job scheduler/workload manager used by your cluster. Currently, we support SLURM and
PBS workload managers [SLURM | PBS].
- ``CLUSTER_ISAAC_SIM_CACHE_DIR``:
The directory on the cluster where the Isaac Sim cache is stored. This directory
has to end on ``docker-isaac-sim``. This directory will be copied to the compute node
Expand Down Expand Up @@ -108,8 +111,8 @@ specified, the default profile ``base`` will be used.
Job Submission and Execution
----------------------------

Defining the job parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Defining the job parameters (SLURM)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The job parameters are defined inside the ``docker/cluster/submit_job.sh``.
A typical SLURM operation requires specifying the number of CPUs and GPUs, the memory, and
Expand All @@ -119,9 +122,9 @@ The default configuration is as follows:

.. literalinclude:: ../../../docker/cluster/submit_job.sh
:language: bash
:lines: 12-19
:lines: 14-22
:linenos:
:lineno-start: 12
:lineno-start: 14

An essential requirement for the cluster is that the compute node has access to the internet at all times.
This is required to load assets from the Nucleus server. For some cluster architectures, extra modules
Expand All @@ -136,6 +139,22 @@ by adding the following line to the ``submit_job.sh`` script:
:linenos:
:lineno-start: 3

Defining the job parameters (PBS)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The job parameters are defined inside the ``docker/cluster/submit_job.sh``.
A typical PBS operation requires specifying the number of CPUs and GPUs, and the time limit. For more
information, please check the `PBS Official Site`_.

The default configuration is as follows:

.. literalinclude:: ../../../docker/cluster/submit_job.sh
:language: bash
:lines: 27-33
:linenos:
:lineno-start: 27


Submitting a job
~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -173,6 +192,7 @@ The above will, in addition, also render videos of the training progress and sto

.. _Singularity: https://docs.sylabs.io/guides/2.6/user-guide/index.html
.. _ETH Zurich Euler: https://scicomp.ethz.ch/wiki/Euler
.. _PBS Official Site: https://openpbs.org/
.. _apptainer: https://apptainer.org/
.. _documentation: https://www.apptainer.org/docs/admin/main/installation.html#install-ubuntu-packages
.. _SLURM documentation: https://www.slurm.schedmd.com/sbatch.html
Expand Down