-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Implementation of layer-norm in the training script #3256
Conversation
No Taskcluster jobs started for this pull requestThe `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request. |
Master moved to r2.3
Pretty sure we already have those |
Ok, so you just have a few files to add? Please make sure you:
|
Yes, these new files that I need to check if it works on the r2.3. But I believe this will work just fine |
Do you mind sharing more on that? |
The experiments we have done are a little old right now. I performing a new training benchmark right now, soon I'll let you know. |
reuben
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes look good to me. Thanks Bernardo!
|
All is working in the binaries, already create a PR for mozilla tensorflow: mozilla/tensorflow#124 |
|
For training with already existing checkpoints you have to initialize the new layers first. This did work for me: I also did run a short test transfer-learning the English checkpoint to German with a small dataset (~32h), but layer-norm didn't help here:
Maybe training the reinitialized LayerNorm tensors only and freezing the rest of the network (see #3247) before training the complete network would help here |
|
@DanBmh you cannot take the weights (checkpoint) from a model trained without layer-norm, and use them to finetune/transferlearning to a model that uses layernorm. The model's architecture are just different. For instance, while the 2nd dense layer from the current checkpoint (without LN) was trained to process tensors in a certain range; the 2nd dense layer in an architecture with LN will process tensors with a completely different range (not only range, but mean and var as well). Also, that's why I've put the argument |
@bernardohenz you're right. I just was hoping that transfer-learning performance wasn't that bad and did a short test run. |
Thanks @bernardohenz ! Can you send a PR against
This is required for us to be able to run your PR with all your changes and ensure nothing regresses |
|
@bernardohenz This is not complete, you have not updated |
Yes, I was about to post asking for some help with this. What am I supposed to do? Just replace the old sha references ('4336a5b49fa6d650e24dbdba55bcef9581535244') to the new one ('6dc2a1becfd1316eb4d77240133a548e93dbff63')? |
Just replace, that is the purpose of those references: our CI will check if the taskcluster index exists, and if not, it will build it. |
|
@bernardohenz Please don't merge but rebase. |
|
@bernardohenz Can you please clean the history? No merge, no "revert" of the previous commit. Force-push is fine, no worries. |
5e78032 to
cdcbfdf
Compare
Replacing old sha references ('4336a5b49fa6d650e24dbdba55bcef9581535244') with the new one ('6dc2a1becfd1316eb4d77240133a548e93dbff63')
cdcbfdf to
77e6c93
Compare
|
I believe this is ok now. Sorry for the mess. |
Thanks; one thing I forgot, can you please update |
|
You can see the progress on the Community-TC link 😊 |
Nice :D |
macOS CI was a bit of a burden (it always is), but it's green in the end. I'm going to merge your TensorFlow part and then re-run PR with new sha1 and take care of the rest. |
PR that allows the use of layer-norm in the model. For our experiments, it allows for training more epochs without overfitting.
PS: I'll be creating a PR in you tensorflow repository, as layer-norm uses some dependencies that are not included in your build rule. Nonetheless, I'm sending the new rule here (which were found in
tensorflow/core/kernels/BUILD). When compiling in the 0.6.1 version, the following rule worked just fine:I'll be compiling the binaries in the
masterto check if it still works.