-
Notifications
You must be signed in to change notification settings - Fork 763
{ai}[foss/2023a] DeepSpeed v0.14.5, CUTLASS v3.4.0, DLPACK v0.8 w/ CUDA 12.1.1 #21438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
{ai}[foss/2023a] DeepSpeed v0.14.5, CUTLASS v3.4.0, DLPACK v0.8 w/ CUDA 12.1.1 #21438
Conversation
…tches: DeepSpeed-0.14.5_pic-compile.patch, DeepSpeed-0.14.2_no-ninja-dep.patch
|
Will probably want to change triton used to #21318 |
|
DLPack |
|
CUTLASS |
|
DeepSpeed Typo in test command after linebreaks. |
|
Test report by @VRehnberg |
|
Latest failures have in common that they use the multi-node launcher. Unsure if it's only the test that's broken or something else. As an example of a failing command: pdsh -S -f 1024 -w localhost export NCCL_IB_HCA=^mlx5_1; export PYTHONNOUSERSITE=1; export UCX_MODULE_DIR=[...]; export PYTHONPATH=[...]; /apps/Test2/software/Python/3.11.3-GCCcore-12.3.0/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --node_rank=%n --master_addr=127.0.0.1 --master_port=29500 /cephyr/NOBACKUP/priv/c3-staff/eb-tmp/eb-5t779o2m/pytest-of-c3-builder/pytest-0/test_user_args_True_I_m_going_0/user_arg_test.py --prompt "I\'m going to tell them \\"DeepSpeed is the best\\""\n'.decodeso probably is just because LD_LIBRARY_PATH is not also exported. Looking for where this command is built... It seems like an add_export for that is missing here https://github.com/microsoft/DeepSpeed/blob/v0.14.5/deepspeed/launcher/runner.py#L564-L578 So should just be to add it to https://github.com/microsoft/DeepSpeed/blob/v0.14.5/deepspeed/launcher/runner.py#L34 Will try that. |
|
Test report by @VRehnberg |
|
Test report by @VRehnberg |
|
Test report by @VRehnberg |
|
Compared the environment variables before and after loading the DeepSpeed module. Will probably update the pdsh-env-vars patch with those. So remaining
|
|
Env report seems to indicate that pre-built ops are not being picked up properly: |
|
Test report by @VRehnberg |
|
Test report by @VRehnberg |
|
Three failing files in latest 4xA100 run:
Can't reproduce them when skipping test step and running them manually (though first one is unclear as it fails for another reason). |
|
Test report by @VRehnberg |
|
Test report by @VRehnberg |
|
Test report by @VRehnberg Here's the build error again. Mostly confused why it only appears some times. Perhaps, need to actually fix it. |
|
Test report by @VRehnberg Build failures mostly seem flaky. Only change with this one to the one before |
|
There seems to still be something flaky about the build step. But, I have no idea what it could be. At this point I'd welcome others testing and see if they also experience this. |
…asyconfigs into 20240918144941_new_pr_DeepSpeed0145
|
@boegelbot please test @ jsc-zen3-a100 |
|
@pavelToman: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2577221419 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
|
Seen those failed tests before. It's not finding the |
|
I think it wasn't using the easyblock in the test build, let me retry... |
|
@boegelbot please test @ jsc-zen3-a100 |
|
@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2607787939 processed Message to humans: this is just bookkeeping information for me, |
|
@boegelbot please test @ jsc-zen3-a100 |
|
@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2609989695 processed Message to humans: this is just bookkeeping information for me, |
Hm, the job is still running after more than a day $ squeue --all --long
Fri Jan 24 07:36:19 2025
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
5582 jsczen3g test_PR_ boegelbo PENDING 0:00 4-04:00:00 1 (Resources)
5588 jsczen3g test_PR_ boegelbo PENDING 0:00 4-04:00:00 1 (Priority)
5571 jsczen3g test_PR_ boegelbo RUNNING 1-14:26:08 4-04:00:00 1 jsczen3g1
Maybe PyTorch is being built, which takes a while... |
|
Test report by @boegelbot |
|
@Thyre thanks for checking, it could have been PyTorch indeed... Anyway, the bot never reported back on |
|
Hmm... checking the logs, I see there was a lock file for DeepSpeed. Maybe |
|
@boegelbot please test @ jsc-zen3-a100 |
|
@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... - notification for comment with ID 2619317477 processed Message to humans: this is just bookkeeping information for me, |
|
Did that test build ever finish? |
|
Trying to build this on an isolated cluster - is there a way of disabling the tests that (for example) try to download from HuggingFace? |
(created using
eb --new-pr)Requires:
Edit: It also patches an existing version of CUTLASS (instead of adding a new version as it did initially).