Thomas/olruwase/sync layer norms #277

thomasw21 · 2022-04-04T09:20:37Z

Syncing of torch random state.

stas00 · 2022-04-04T15:34:21Z

megatron/training.py

+    # We use rank 1 as source of truth and sed the new
+    torch.distributed.broadcast(
+        torch_rng_state,
+        src=mpu.get_tensor_model_parallel_src_rank() + 1,


may I ask why rank 1 and not rank 0?

(which would fail with tp=1, but I'm aware that this is a hack for our particular case with tp=4 so should be fine)

1,2,3 were synchronized, 0 was out of sync. So I thought the path of least change was to force match 0 to the rest, instead of matching everyone to 0.

gotcha - makes sense! thank you for explaining.

So is this fix related to #276?

Yes sorry this fix here is going to be merged on the force syncing branch of Meg-DS olruwase/sync_layer_norms. It incorporates the force syncing strategy where the current training is out of sync. It currently works only because the sample we use is not random. (otherwise we can't guarantee that we don't see duplicated samples)

#276 is the general fix that should be integrated in master.

…p#277) * Introduce LayerNorm optimization from NVIDIA/apex#1715 * Fix args call * Ad-hoc apex version check * Remove unnecessary TransformerConfig arg

Sync torch_rng_state

0e008dc

thomasw21 changed the base branch from main to olruwase/sync_layer_norms April 4, 2022 09:20

thomasw21 requested a review from stas00 April 4, 2022 09:20

stas00 reviewed Apr 4, 2022

View reviewed changes

thomasw21 merged commit d576775 into olruwase/sync_layer_norms Apr 6, 2022

thomasw21 deleted the thomas/olruwase/sync_layer_norms branch April 6, 2022 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thomas/olruwase/sync layer norms #277

Thomas/olruwase/sync layer norms #277

Uh oh!

thomasw21 commented Apr 4, 2022

Uh oh!

stas00 Apr 4, 2022

Uh oh!

thomasw21 Apr 4, 2022

Uh oh!

stas00 Apr 4, 2022

Uh oh!

thomasw21 Apr 5, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Thomas/olruwase/sync layer norms #277

Thomas/olruwase/sync layer norms #277

Uh oh!

Conversation

thomasw21 commented Apr 4, 2022

Uh oh!

stas00 Apr 4, 2022

Choose a reason for hiding this comment

Uh oh!

thomasw21 Apr 4, 2022

Choose a reason for hiding this comment

Uh oh!

stas00 Apr 4, 2022

Choose a reason for hiding this comment

Uh oh!

thomasw21 Apr 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thomasw21 Apr 5, 2022 •

edited

Loading