Skip to content

Conversation

@thomasw21
Copy link
Member

Syncing of torch random state.

@thomasw21 thomasw21 changed the base branch from main to olruwase/sync_layer_norms April 4, 2022 09:20
@thomasw21 thomasw21 requested a review from stas00 April 4, 2022 09:20
# We use rank 1 as source of truth and sed the new
torch.distributed.broadcast(
torch_rng_state,
src=mpu.get_tensor_model_parallel_src_rank() + 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may I ask why rank 1 and not rank 0?

(which would fail with tp=1, but I'm aware that this is a hack for our particular case with tp=4 so should be fine)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1,2,3 were synchronized, 0 was out of sync. So I thought the path of least change was to force match 0 to the rest, instead of matching everyone to 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha - makes sense! thank you for explaining.

So is this fix related to #276?

Copy link
Member Author

@thomasw21 thomasw21 Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry this fix here is going to be merged on the force syncing branch of Meg-DS olruwase/sync_layer_norms. It incorporates the force syncing strategy where the current training is out of sync. It currently works only because the sample we use is not random. (otherwise we can't guarantee that we don't see duplicated samples)

#276 is the general fix that should be integrated in master.

@thomasw21 thomasw21 merged commit d576775 into olruwase/sync_layer_norms Apr 6, 2022
@thomasw21 thomasw21 deleted the thomas/olruwase/sync_layer_norms branch April 6, 2022 16:42
adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Dec 18, 2023
…p#277)

* Introduce LayerNorm optimization from NVIDIA/apex#1715

* Fix args call

* Ad-hoc apex version check

* Remove unnecessary TransformerConfig arg
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants