How to perform gradient accumulation WITH distributed training in TF 2.0 / 1.14.0-eager and custom training loop (gradient tape)?

Question

Background: I have a model and I'm trying to port it to TF 2.0 to get some sweet eager execution, but I just can't seem to figure out how to do distributed training (4 GPU's) AND perform gradient accumulation at the same time.

Problem:

I need to be able to use a custom training loop with gradient tape because I have a complex multi-model problem (several input models and output models training together), I do not need 2nd order gradients
With the size of my model (moderate, something like a medium-sized transformer) I can't get a batch size larger than ~32 with 4 GPU's which is the largest instance I can get get a hold of, sadly, these are really old 11GB K80's because Azure seems to think that GPU's that Google doesn't even give away for free anymore is good enough...........
I have a dataset that requires very large batches because I have to account for a very big imbalance (I'm also using weighting and focal loss ofc), thus I need to perform 4-8 steps of gradient accumulation to smooth out the gradients.

I've read distributed training loops guide and managed to implement it: https://www.tensorflow.org/beta/tutorials/distribute/training_loops

I've also implemented gradient accumulation in TF 2.0 for custom training loops and tf.keras: https://colab.research.google.com/drive/1yaeRMAwhGkm1voaPp7EtFpSLF33EKhTc

If you had really small batches, would that make your model fit? Very big imbalances can be handled by only applying backprop on the worst samples. There is a very good paper on it that I cover in this video https://www.youtube.com/watch?v=pglJizzJsD4 which will handle your imbalance and batch size at the same time. — Anton Codes, Nov 20 '20 at 04:14
Here is how Nvidia does it for BERT, if that of any help: https://github.com/NVIDIA/DeepLearningExamples/blob/ae76b894b96c6102a7b53468bdbb6099c843382b/TensorFlow/LanguageModeling/BERT/optimization.py#L112 — y.selivonchyk, Dec 18 '20 at 03:20
Have you read [https://www.tensorflow.org/tutorials/distribute/custom_training](https://www.tensorflow.org/tutorials/distribute/custom_training) ? — DachuanZhao, Mar 11 '21 at 02:12

score -1 · Answer 1 · answered Mar 11 '21 at 02:14

-1

Read https://www.tensorflow.org/tutorials/distribute/custom_training and update your question if you still have any question .

answered Mar 11 '21 at 02:14

DachuanZhao

638
7
20

1

This doesn't seem to be an answer to the Question. Please visit the [tour](https://stackoverflow.com/tour) and [how to answer](https://stackoverflow.com/help/how-to-answer) to see how Answers on Stack Overflow work. Also see [Your answer is in another castle: when is an answer not an answer?](https://meta.stackexchange.com/questions/225370/your-answer-is-in-another-castle-when-is-an-answer-not-an-answer) – Scratte Mar 11 '21 at 10:16

How to perform gradient accumulation WITH distributed training in TF 2.0 / 1.14.0-eager and custom training loop (gradient tape)?

1 Answers1