Issue for collecting gradients from hyper-scale LLMs

Hi, thanks for sharing such a nice work!

I am trying to collect gradients for LLaMA2-13B with the released code, but out-of-memory issue occurs.
Currently, I am using a single NVIDIA H100 GPU.
How many GPUs do we need for models having more than 13B parameters?

Thanks!