Skip to content

Conversation

@VRehnberg
Copy link
Contributor

@VRehnberg VRehnberg commented Sep 18, 2024

(created using eb --new-pr)
Requires:

Edit: It also patches an existing version of CUTLASS (instead of adding a new version as it did initially).

…tches: DeepSpeed-0.14.5_pic-compile.patch, DeepSpeed-0.14.2_no-ninja-dep.patch
@VRehnberg VRehnberg marked this pull request as draft September 18, 2024 14:52
@VRehnberg
Copy link
Contributor Author

VRehnberg commented Sep 18, 2024

Will probably want to change triton used to #21318

@VRehnberg VRehnberg marked this pull request as ready for review September 23, 2024 14:24
@VRehnberg VRehnberg changed the title {ai}[foss/2023a] DeepSpeed v0.14.5 w/ CUDA 12.1.1 {ai}[foss/2023a] DeepSpeed v0.14.5, CUTLASS v3.5.0, DLPACK v0.8 w/ CUDA 12.1.1 Sep 23, 2024
@VRehnberg
Copy link
Contributor Author

VRehnberg commented Sep 24, 2024

DLPack
Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-16 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 560.35.03, Python 3.6.8
See https://gist.github.com/VRehnberg/9e271e35ebb3964a7749770e14eb3c42 for a full test report.

@VRehnberg
Copy link
Contributor Author

VRehnberg commented Sep 24, 2024

CUTLASS
Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-15 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 560.35.03, Python 3.6.8
See https://gist.github.com/VRehnberg/01f85607114e140d887898842effea9f for a full test report.

@VRehnberg
Copy link
Contributor Author

VRehnberg commented Sep 24, 2024

DeepSpeed
Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
alvis1-06 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 560.35.03, Python 3.6.8
See https://gist.github.com/VRehnberg/5a708f22771e1d099be9f64a37d3eccb for a full test report.


Typo in test command after linebreaks.

@VRehnberg
Copy link
Contributor Author

Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
alvis1-12 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 560.35.03, Python 3.6.8
See https://gist.github.com/VRehnberg/d519a58b098e8dcc932f3eaba43627cd for a full test report.

@VRehnberg
Copy link
Contributor Author

Latest failures have in common that they use the multi-node launcher. Unsure if it's only the test that's broken or something else.

As an example of a failing command:

pdsh -S -f 1024 -w localhost export NCCL_IB_HCA=^mlx5_1; export PYTHONNOUSERSITE=1; export UCX_MODULE_DIR=[...]; export PYTHONPATH=[...]; /apps/Test2/software/Python/3.11.3-GCCcore-12.3.0/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --node_rank=%n --master_addr=127.0.0.1 --master_port=29500 /cephyr/NOBACKUP/priv/c3-staff/eb-tmp/eb-5t779o2m/pytest-of-c3-builder/pytest-0/test_user_args_True_I_m_going_0/user_arg_test.py --prompt "I\'m going to tell them \\"DeepSpeed is the best\\""\n'.decode

so probably is just because LD_LIBRARY_PATH is not also exported. Looking for where this command is built...

It seems like an add_export for that is missing here https://github.com/microsoft/DeepSpeed/blob/v0.14.5/deepspeed/launcher/runner.py#L564-L578

So should just be to add it to https://github.com/microsoft/DeepSpeed/blob/v0.14.5/deepspeed/launcher/runner.py#L34

Will try that.

@VRehnberg
Copy link
Contributor Author

Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
alvis1-02 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 560.35.03, Python 3.6.8
See https://gist.github.com/VRehnberg/03bc8fc4dfc955cd4a20c9c4f68a7fc8 for a full test report.

@VRehnberg
Copy link
Contributor Author

Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
alvis1-02 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 560.35.03, Python 3.6.8
See https://gist.github.com/VRehnberg/7bcc5750617d9cc507edc761af91d199 for a full test report.

@VRehnberg
Copy link
Contributor Author

Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-01 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 560.35.03, Python 3.6.8
See https://gist.github.com/VRehnberg/1afe632871f634e216b2ee080529681f for a full test report.

@VRehnberg
Copy link
Contributor Author

VRehnberg commented Oct 1, 2024

Compared the environment variables before and after loading the DeepSpeed module. Will probably update the pdsh-env-vars patch with those.

So remaining

  • Update pdsh-env-vars patch
  • Check deepspeed.env_report and see if everything was built alright
  • Run on some more hardware (especially Ampere or newer GPUs)
  • Get this reviewed by a maintainer

@VRehnberg
Copy link
Contributor Author

Env report seems to indicate that pre-built ops are not being picked up properly:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  FP Quantizer is using an untested triton version (2.1.0), only 2.3.0 and 2.3.1 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/apps/Test2/software/PyTorch/2.1.2-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/torch']
torch version .................... 2.1.2
deepspeed install path ........... ['/apps/Test2/software/DeepSpeed/0.14.5-foss-2023a-CUDA-12.1.1/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.14.5+unknown, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 377.03 GB

@VRehnberg
Copy link
Contributor Author

Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
alvis6-11 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 1 x NVIDIA NVIDIA A40, 560.35.03, Python 3.6.8
See https://gist.github.com/VRehnberg/d35d54e5f5c7926d09cd494d1addf826 for a full test report.

@VRehnberg
Copy link
Contributor Author

Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
FAILED
Build succeeded for 4 out of 5 (3 easyconfigs in total)
alvis3-36 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/VRehnberg/f4f23e0b89d8a5b5304d352fe0f04964 for a full test report.

@VRehnberg
Copy link
Contributor Author

Three failing files in latest 4xA100 run:

  1. tests/unit/inference/quantization/test_intX_quantization.py
  2. tests/unit/ops/aio/test_aio.py
  3. tests/unit/runtime/zero/test_nvme_checkpointing.py

Can't reproduce them when skipping test step and running them manually (though first one is unclear as it fails for another reason).

@VRehnberg
Copy link
Contributor Author

Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis4-18 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/VRehnberg/123208d9718223547670a43a4832f542 for a full test report.

@VRehnberg
Copy link
Contributor Author

Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-09 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 2 x NVIDIA Tesla V100-SXM2-32GB, 560.35.03, Python 3.6.8
See https://gist.github.com/VRehnberg/e2dd6f6b643bc2a197e8472b0f48679b for a full test report.

@VRehnberg VRehnberg changed the title {ai}[foss/2023a] DeepSpeed v0.14.5, CUTLASS v3.5.1, DLPACK v0.8 w/ CUDA 12.1.1 {ai}[foss/2023a] DeepSpeed v0.14.5, CUTLASS v3.4.0, DLPACK v0.8 w/ CUDA 12.1.1 Nov 12, 2024
@VRehnberg
Copy link
Contributor Author

VRehnberg commented Nov 12, 2024

Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
alvis4-28 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/VRehnberg/246df5e396a2f7020bde0b5fb1c497dd for a full test report.


Here's the build error again. Mostly confused why it only appears some times. Perhaps, need to actually fix it.

@VRehnberg
Copy link
Contributor Author

VRehnberg commented Nov 12, 2024

Test report by @VRehnberg
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3450
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis4-28 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/VRehnberg/3b3629b684c865b5c5cb9e5c5ac6407f for a full test report.


Build failures mostly seem flaky. Only change with this one to the one before

@VRehnberg
Copy link
Contributor Author

There seems to still be something flaky about the build step. But, I have no idea what it could be.

At this point I'd welcome others testing and see if they also experience this.

@pavelToman
Copy link
Collaborator

@boegelbot please test @ jsc-zen3-a100

@boegelbot
Copy link
Collaborator

@pavelToman: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=21438 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_21438 --ntasks=8 --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 5517

Test results coming soon (I hope)...

- notification for comment with ID 2577221419 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.5, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 555.42.06, Python 3.9.21
See https://gist.github.com/boegelbot/71562778980885b1d31c7b57dce9cc97 for a full test report.

@VRehnberg
Copy link
Contributor Author

Seen those failed tests before. It's not finding the deepspeed executable. IIRC either the update of the pythonpackage easyblock in easybuilders/easybuild-easyblocks#3450 was enough or that together with https://github.com/easybuilders/easybuild-easyconfigs/pull/21438/files#diff-ae8697a8ff4d97c6e606b5836c5821350a1c2691516e06d799f8c5c466318dbc was what was needed for me. Either way, don't understand why it is showing up again for boegelbot

@casparvl
Copy link
Contributor

I think it wasn't using the easyblock in the test build, let me retry...

@casparvl
Copy link
Contributor

@boegelbot please test @ jsc-zen3-a100
EB_ARGS="--include-easyblocks-from-commit 1486d87f1f8076d006803fa5b7459f50a951049e"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=21438 EB_ARGS="--include-easyblocks-from-commit 1486d87f1f8076d006803fa5b7459f50a951049e" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_21438 --ntasks=8 --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 5571

Test results coming soon (I hope)...

- notification for comment with ID 2607787939 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@casparvl
Copy link
Contributor

@boegelbot please test @ jsc-zen3-a100
EB_ARGS="--include-easyblocks-from-commit 1486d87f1f8076d006803fa5b7459f50a951049e"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=21438 EB_ARGS="--include-easyblocks-from-commit 1486d87f1f8076d006803fa5b7459f50a951049e" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_21438 --ntasks=8 --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 5582

Test results coming soon (I hope)...

- notification for comment with ID 2609989695 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@Thyre
Copy link
Collaborator

Thyre commented Jan 23, 2025

@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=21438 EB_ARGS="--include-easyblocks-from-commit 1486d87f1f8076d006803fa5b7459f50a951049e" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_21438 --ntasks=8 --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

* exit code: 0

* output:
Submitted batch job 5571

Test results coming soon (I hope)...

Hm, the job is still running after more than a day

$ squeue --all --long
Fri Jan 24 07:36:19 2025
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
              5582  jsczen3g test_PR_ boegelbo  PENDING       0:00 4-04:00:00      1 (Resources)
              5588  jsczen3g test_PR_ boegelbo  PENDING       0:00 4-04:00:00      1 (Priority)
              5571  jsczen3g test_PR_ boegelbo  RUNNING 1-14:26:08 4-04:00:00      1 jsczen3g1

Maybe PyTorch is being built, which takes a while...

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from easybuilders/easybuild-easyblocks@1486d87
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.5, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 555.42.06, Python 3.9.21
See https://gist.github.com/boegelbot/3a44d585aa591865d2ab7351d7866688 for a full test report.

@casparvl
Copy link
Contributor

@Thyre thanks for checking, it could have been PyTorch indeed... Anyway, the bot never reported back on 5571, so I'm not sure. It did report back on 5582, which is probably from my second 'submission'. That job was a lot faster. My guess would be that 5571 built PyTorch + everything in this PR, and 5582 'only' the things from this PR. Anyway, good enough to see the failure now :)

@casparvl
Copy link
Contributor

Hmm... checking the logs, I see there was a lock file for DeepSpeed. Maybe 5582 timed out, and didn't clean up the lock file? I'll try to rebuild, ignoring the locks - I believe this PR is the only one touching DLPACK CUTLASS and DeepSpeed anyway :)

@casparvl
Copy link
Contributor

@boegelbot please test @ jsc-zen3-a100
EB_ARGS="--include-easyblocks-from-commit 1486d87f1f8076d006803fa5b7459f50a951049e --ignore-locks"

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=21438 EB_ARGS="--include-easyblocks-from-commit 1486d87f1f8076d006803fa5b7459f50a951049e --ignore-locks" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_21438 --ntasks=8 --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 5618

Test results coming soon (I hope)...

- notification for comment with ID 2619317477 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@verdurin
Copy link
Member

Did that test build ever finish?
We've been asked for this, so I'm keen to help.
Don't currently have an A100 node setup to submit test results.

@github-actions github-actions bot removed the update label Jun 17, 2025
@Thyre Thyre added the 2023a label Aug 18, 2025
@verdurin
Copy link
Member

verdurin commented Sep 5, 2025

Trying to build this on an isolated cluster - is there a way of disabling the tests that (for example) try to download from HuggingFace?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants