14

Background: I have a model and I'm trying to port it to TF 2.0 to get some sweet eager execution, but I just can't seem to figure out how to do distributed training (4 GPU's) AND perform gradient accumulation at the same time.

Problem:

  • I need to be able to use a custom training loop with gradient tape because I have a complex multi-model problem (several input models and output models training together), I do not need 2nd order gradients

  • With the size of my model (moderate, something like a medium-sized transformer) I can't get a batch size larger than ~32 with 4 GPU's which is the largest instance I can get get a hold of, sadly, these are really old 11GB K80's because Azure seems to think that GPU's that Google doesn't even give away for free anymore is good enough...........

  • I have a dataset that requires very large batches because I have to account for a very big imbalance (I'm also using weighting and focal loss ofc), thus I need to perform 4-8 steps of gradient accumulation to smooth out the gradients.

I've read distributed training loops guide and managed to implement it: https://www.tensorflow.org/beta/tutorials/distribute/training_loops

I've also implemented gradient accumulation in TF 2.0 for custom training loops and tf.keras: https://colab.research.google.com/drive/1yaeRMAwhGkm1voaPp7EtFpSLF33EKhTc

M.Innat
  • 8,673
  • 5
  • 27
  • 46
  • Did you solve this issue? – Stefan Falk Nov 10 '20 at 16:22
  • If you had really small batches, would that make your model fit? Very big imbalances can be handled by only applying backprop on the worst samples. There is a very good paper on it that I cover in this video https://www.youtube.com/watch?v=pglJizzJsD4 which will handle your imbalance and batch size at the same time. – Anton Codes Nov 20 '20 at 04:14
  • Here is how Nvidia does it for BERT, if that of any help: https://github.com/NVIDIA/DeepLearningExamples/blob/ae76b894b96c6102a7b53468bdbb6099c843382b/TensorFlow/LanguageModeling/BERT/optimization.py#L112 – y.selivonchyk Dec 18 '20 at 03:20
  • Have you read [https://www.tensorflow.org/tutorials/distribute/custom_training](https://www.tensorflow.org/tutorials/distribute/custom_training) ? – DachuanZhao Mar 11 '21 at 02:12

1 Answers1

-1

Read https://www.tensorflow.org/tutorials/distribute/custom_training and update your question if you still have any question .

DachuanZhao
  • 638
  • 7
  • 20
  • 1
    This doesn't seem to be an answer to the Question. Please visit the [tour](https://stackoverflow.com/tour) and [how to answer](https://stackoverflow.com/help/how-to-answer) to see how Answers on Stack Overflow work. Also see [Your answer is in another castle: when is an answer not an answer?](https://meta.stackexchange.com/questions/225370/your-answer-is-in-another-castle-when-is-an-answer-not-an-answer) – Scratte Mar 11 '21 at 10:16