Skip to content

Commit dfaeab6

Browse files
pascal-rothLtesfayeMayankm96
authored
Fixes cluster workflow to work with different container profiles (isaac-sim#486)
# Description Cluster workflow did not work with the different profiles and introduced names. This PR fixes the workflow and in addition, introduces additional checks that the profile can be selected. In detail: - checks whether a profile can be selected depending on whether a `.env.$container_profile` exists - allows for `job` to have multiple arguments, also without a profile, for all other options, the second argument has to be the profile - check if a docker image exists before building the singularity image - check if the path for the singularity image exists on the cluster, otherwise create it - check if the path for orbit exists on the cluster, otherwise create it ## Type of change - Bug fix (non-breaking change which fixes an issue) ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./orbit.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have run all the tests with `./orbit.sh --test` and they pass - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there --------- Co-authored-by: Leul Tesfaye <[email protected]> Co-authored-by: Mayank Mittal <[email protected]>
1 parent 51a5e7c commit dfaeab6

File tree

4 files changed

+120
-54
lines changed

4 files changed

+120
-54
lines changed

docker/cluster/run_singularity.sh

Lines changed: 8 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#!/bin/bash
22

3-
echo "(run_singularity.py): Called on compute node with arguments $@"
3+
echo "(run_singularity.py): Called on compute node with container profile $1 and arguments ${@:2}"
44

55
#==
66
# Helper functions
@@ -46,19 +46,14 @@ mkdir -p "$CLUSTER_ORBIT_DIR/logs"
4646
touch "$CLUSTER_ORBIT_DIR/logs/.keep"
4747
cp -r $CLUSTER_ORBIT_DIR $TMPDIR
4848

49-
# copy singulary image to the compute node
50-
folder="$TMPDIR/isaac-sim.sif"
51-
52-
# Check if the folder exists
53-
if [ -d "$folder" ]; then
54-
echo "1 (run_singularity.py): Folder was already copied to local SSD."
55-
else
56-
tar -xf $CLUSTER_SIF_PATH/orbit.tar -C $TMPDIR
57-
fi
49+
# copy container to the compute node
50+
tar -xf $CLUSTER_SIF_PATH/$1.tar -C $TMPDIR
5851

5952
# execute command in singularity container
53+
# NOTE: ORBIT_PATH is normally set in `orbit.sh` but we directly call the isaac-sim python because we sync the entire
54+
# orbit directory to the compute node and remote the symbolic link to isaac-sim
6055
singularity exec \
61-
-B $TMPDIR/docker-isaac-sim/cache/kit:${DOCKER_ISAACSIM_PATH}/kit/cache:rw \
56+
-B $TMPDIR/docker-isaac-sim/cache/kit:${DOCKER_ISAACSIM_ROOT_PATH}/kit/cache:rw \
6257
-B $TMPDIR/docker-isaac-sim/cache/ov:${DOCKER_USER_HOME}/.cache/ov:rw \
6358
-B $TMPDIR/docker-isaac-sim/cache/pip:${DOCKER_USER_HOME}/.cache/pip:rw \
6459
-B $TMPDIR/docker-isaac-sim/cache/glcache:${DOCKER_USER_HOME}/.cache/nvidia/GLCache:rw \
@@ -68,8 +63,8 @@ singularity exec \
6863
-B $TMPDIR/docker-isaac-sim/documents:${DOCKER_USER_HOME}/Documents:rw \
6964
-B $TMPDIR/orbit:/workspace/orbit:rw \
7065
-B $CLUSTER_ORBIT_DIR/logs:/workspace/orbit/logs:rw \
71-
--nv --writable --containall $TMPDIR/orbit.sif \
72-
bash -c "cd /workspace/orbit && /isaac-sim/python.sh ${CLUSTER_PYTHON_EXECUTABLE} $@"
66+
--nv --writable --containall $TMPDIR/$1.sif \
67+
bash -c "export ORBIT_PATH=/workspace/orbit && cd /workspace/orbit && /isaac-sim/python.sh ${CLUSTER_PYTHON_EXECUTABLE} ${@:2}"
7368

7469
# copy resulting cache files back to host
7570
cp -r $TMPDIR/docker-isaac-sim $CLUSTER_ISAAC_SIM_CACHE_DIR/..

docker/cluster/submit_job.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@ cat <<EOT > job.sh
1717
#SBATCH --mail-user=name@mail
1818
#SBATCH --job-name="training-$(date +"%Y-%m-%dT%H:%M")"
1919
20-
sh "$1/docker/cluster/run_singularity.sh" "${@:2}"
20+
# Pass the container profile first to run_singularity.sh, then all arguments intended for the executed script
21+
sh "$1/docker/cluster/run_singularity.sh" "$2" "${@:3}"
2122
EOT
2223

2324
sbatch < job.sh

docker/container.sh

Lines changed: 100 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,16 @@ SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
2121
print_help () {
2222
echo -e "\nusage: $(basename "$0") [-h] [run] [start] [stop] -- Utility for handling docker in Orbit."
2323
echo -e "\noptional arguments:"
24-
echo -e "\t-h, --help Display the help content."
25-
echo -e "\tstart Build the docker image and create the container in detached mode."
26-
echo -e "\tenter Begin a new bash process within an existing orbit container."
27-
echo -e "\tcopy Copy build and logs artifacts from the container to the host machine."
28-
echo -e "\tstop Stop the docker container and remove it."
29-
echo -e "\tpush Push the docker image to the cluster."
30-
echo -e "\tjob Submit a job to the cluster."
24+
echo -e "\t-h, --help Display the help content."
25+
echo -e "\tstart [profile] Build the docker image and create the container in detached mode."
26+
echo -e "\tenter [profile] Begin a new bash process within an existing orbit container."
27+
echo -e "\tcopy [profile] Copy build and logs artifacts from the container to the host machine."
28+
echo -e "\tstop [profile] Stop the docker container and remove it."
29+
echo -e "\tpush [profile] Push the docker image to the cluster."
30+
echo -e "\tjob [profile] [job_args] Submit a job to the cluster."
31+
echo -e "\n"
32+
echo -e "[profile] is the optional container profile specification and [job_args] optional arguments specific"
33+
echo -e "to the executed script"
3134
echo -e "\n" >&2
3235
}
3336

@@ -64,12 +67,24 @@ check_docker_version() {
6467
resolve_image_extension() {
6568
# If no profile was passed, we default to 'base'
6669
container_profile=${1:-"base"}
70+
# check if the second argument has to be a profile or can be a job argument instead
71+
necessary_profile=${2:-true}
6772

6873
# We also default to 'base' if "orbit" is passed
6974
if [ "$1" == "orbit" ]; then
7075
container_profile="base"
7176
fi
7277

78+
# check if a .env.$container_profile file exists
79+
# if the argument is necessary a profile, then the file must exists otherwise an info is printed
80+
if [ "$necessary_profile" = true ] && [ ! -f $SCRIPT_DIR/.env.$container_profile ]; then
81+
echo "[Error] The profile '$container_profile' has no .env.$container_profile file!" >&2;
82+
exit 1
83+
elif [ ! -f $SCRIPT_DIR/.env.$container_profile ]; then
84+
echo "[INFO] No .env.$container_profile found, assume second argument is no profile! Will use default container!" >&2;
85+
container_profile="base"
86+
fi
87+
7388
add_profiles="--profile $container_profile"
7489
# We will need .env.base regardless of profile
7590
add_envs="--env-file .env.base"
@@ -92,6 +107,24 @@ is_container_running() {
92107
fi
93108
}
94109

110+
# Checks if a docker image exists, otherwise prints warning and exists
111+
check_image_exists() {
112+
image_name="$1"
113+
if ! docker image inspect $image_name &> /dev/null; then
114+
echo "[Error] The '$image_name' image does not exist!" >&2;
115+
exit 1
116+
fi
117+
}
118+
119+
# Check if the singularity image exists on the remote host, otherwise print warning and exit
120+
check_singularity_image_exists() {
121+
image_name="$1"
122+
if ! ssh "$CLUSTER_LOGIN" "[ -f $CLUSTER_SIF_PATH/$image_name.tar ]"; then
123+
echo "[Error] The '$image_name' image does not exist on the remote host $CLUSTER_LOGIN!" >&2;
124+
exit 1
125+
fi
126+
}
127+
95128
#==
96129
# Main
97130
#==
@@ -111,7 +144,27 @@ fi
111144

112145
# parse arguments
113146
mode="$1"
114-
resolve_image_extension $2
147+
profile_arg="$2" # Capture the second argument as the potential profile argument
148+
149+
# Check mode argument and resolve the container profile
150+
case $mode in
151+
build|start|enter|copy|stop|push)
152+
resolve_image_extension "$profile_arg" true
153+
;;
154+
job)
155+
resolve_image_extension "$profile_arg" false
156+
;;
157+
*)
158+
# Not recognized mode
159+
echo "[Error] Invalid command provided: $mode"
160+
print_help
161+
exit 1
162+
;;
163+
esac
164+
165+
# Produces a nice print statement stating which container profile is being used
166+
echo "[INFO] Using container profile: $container_profile"
167+
115168
# resolve mode
116169
case $mode in
117170
start)
@@ -169,43 +222,53 @@ case $mode in
169222
if ! command -v apptainer &> /dev/null; then
170223
install_apptainer
171224
fi
225+
# Check if Docker image exists
226+
check_image_exists orbit-$container_profile:latest
172227
# Check if Docker version is greater than 25
173228
check_docker_version
174-
# Check if .env.base file exists
175-
if [ -f $SCRIPT_DIR/.env.base ]; then
176-
# source env file to get cluster login and path information
177-
source $SCRIPT_DIR/.env.base
178-
# clear old exports
179-
rm -rf /$SCRIPT_DIR/exports
180-
mkdir -p /$SCRIPT_DIR/exports
181-
# create singularity image
182-
# NOTE: we create the singularity image as non-root user to allow for more flexibility. If this causes
183-
# issues, remove the --fakeroot flag and open an issue on the orbit repository.
184-
cd /$SCRIPT_DIR/exports
185-
APPTAINER_NOHTTPS=1 apptainer build --sandbox --fakeroot orbit.sif docker-daemon://orbit:latest
186-
# tar image and send to cluster
187-
tar -cvf /$SCRIPT_DIR/exports/orbit.tar orbit.sif
188-
scp /$SCRIPT_DIR/exports/orbit.tar $CLUSTER_LOGIN:$CLUSTER_SIF_PATH/orbit.tar
189-
else
190-
echo "[Error]: ".env.base" file not found."
191-
fi
229+
# source env file to get cluster login and path information
230+
source $SCRIPT_DIR/.env.base
231+
# make sure exports directory exists
232+
mkdir -p /$SCRIPT_DIR/exports
233+
# clear old exports for selected profile
234+
rm -rf /$SCRIPT_DIR/exports/orbit-$container_profile*
235+
# create singularity image
236+
# NOTE: we create the singularity image as non-root user to allow for more flexibility. If this causes
237+
# issues, remove the --fakeroot flag and open an issue on the orbit repository.
238+
cd /$SCRIPT_DIR/exports
239+
APPTAINER_NOHTTPS=1 apptainer build --sandbox --fakeroot orbit-$container_profile.sif docker-daemon://orbit-$container_profile:latest
240+
# tar image (faster to send single file as opposed to directory with many files)
241+
tar -cvf /$SCRIPT_DIR/exports/orbit-$container_profile.tar orbit-$container_profile.sif
242+
# make sure target directory exists
243+
ssh $CLUSTER_LOGIN "mkdir -p $CLUSTER_SIF_PATH"
244+
# send image to cluster
245+
scp $SCRIPT_DIR/exports/orbit-$container_profile.tar $CLUSTER_LOGIN:$CLUSTER_SIF_PATH/orbit-$container_profile.tar
192246
;;
193247
job)
194-
# Check if .env file exists
195-
if [ -f $SCRIPT_DIR/.env.base ]; then
196-
# Sync orbit code
197-
echo "[INFO] Syncing orbit code..."
198-
source $SCRIPT_DIR/.env.base
199-
rsync -rh --exclude="*.git*" --filter=':- .dockerignore' /$SCRIPT_DIR/.. $CLUSTER_LOGIN:$CLUSTER_ORBIT_DIR
200-
# execute job script
201-
echo "[INFO] Executing job script..."
202-
ssh $CLUSTER_LOGIN "cd $CLUSTER_ORBIT_DIR && sbatch $CLUSTER_ORBIT_DIR/docker/cluster/submit_job.sh" "$CLUSTER_ORBIT_DIR" "${@:2}"
248+
source $SCRIPT_DIR/.env.base
249+
# Check if singularity image exists on the remote host
250+
check_singularity_image_exists orbit-$container_profile
251+
# make sure target directory exists
252+
ssh $CLUSTER_LOGIN "mkdir -p $CLUSTER_ORBIT_DIR"
253+
# Sync orbit code
254+
echo "[INFO] Syncing orbit code..."
255+
rsync -rh --exclude="*.git*" --filter=':- .dockerignore' /$SCRIPT_DIR/.. $CLUSTER_LOGIN:$CLUSTER_ORBIT_DIR
256+
# execute job script
257+
echo "[INFO] Executing job script..."
258+
# check whether the second argument is a profile or a job argument
259+
if [ "$profile_arg" == "$container_profile" ] ; then
260+
# if the second argument is a profile, we have to shift the arguments
261+
echo "[INFO] Arguments passed to job script ${@:3}"
262+
ssh $CLUSTER_LOGIN "cd $CLUSTER_ORBIT_DIR && sbatch $CLUSTER_ORBIT_DIR/docker/cluster/submit_job.sh" "$CLUSTER_ORBIT_DIR" "orbit-$container_profile" "${@:3}"
203263
else
204-
echo "[Error]: ".env.base" file not found."
264+
# if the second argument is a job argument, we have to shift only one argument
265+
echo "[INFO] Arguments passed to job script ${@:2}"
266+
ssh $CLUSTER_LOGIN "cd $CLUSTER_ORBIT_DIR && sbatch $CLUSTER_ORBIT_DIR/docker/cluster/submit_job.sh" "$CLUSTER_ORBIT_DIR" "orbit-$container_profile" "${@:2}"
205267
fi
206268
;;
207269
*)
208-
echo "[Error] Invalid argument provided: $1"
270+
# Not recognized mode
271+
echo "[Error] Invalid command provided: $mode"
209272
print_help
210273
exit 1
211274
;;

docs/source/deployment/cluster.rst

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -91,11 +91,13 @@ To export to a singularity image, execute the following command:
9191

9292
.. code:: bash
9393
94-
./docker/container.sh push
94+
./docker/container.sh push [profile]
9595
9696
This command will create a singularity image under ``docker/exports`` directory and
9797
upload it to the defined location on the cluster. Be aware that creating the singularity
9898
image can take a while.
99+
``[profile]`` is an optional argument that specifies the container profile to be used. If no profile is
100+
specified, the default profile ``base`` will be used.
99101

100102
.. note::
101103
By default, the singularity image is created without root access by providing the ``--fakeroot`` flag to
@@ -141,13 +143,18 @@ To submit a job on the cluster, the following command can be used:
141143

142144
.. code:: bash
143145
144-
./docker/container.sh job "argument1" "argument2" ...
146+
./docker/container.sh job [profile] "argument1" "argument2" ...
145147
146148
This command will copy the latest changes in your code to the cluster and submit a job. Please ensure that
147149
your Python executable's output is stored under ``orbit/logs`` as this directory will be copied again
148150
from the compute node to ``CLUSTER_ORBIT_DIR``.
149151

150-
The training arguments anove are passed to the Python executable. As an example, the standard
152+
``[profile]`` is an optional argument that specifies which singularity image corresponding to the container profile
153+
will be used. If no profile is specified, the default profile ``base`` will be used. The profile has be defined
154+
directlty after the ``job`` command. All other arguments are passed to the Python executable. If no profile is
155+
defined, all arguments are passed to the Python executable.
156+
157+
The training arguments are passed to the Python executable. As an example, the standard
151158
ANYmal rough terrain locomotion training can be executed with the following command:
152159

153160
.. code:: bash

0 commit comments

Comments
 (0)