Background:
I have a model and I'm trying to port it to TF 2.0
to get some sweet eager execution, but I just can't seem to figure out how to do distributed training (4 GPU's) AND perform gradient accumulation at the same time.
Problem:
I need to be able to use a custom training loop with gradient tape because I have a complex multi-model problem (several input models and output models training together), I do not need 2nd order gradients
With the size of my model (moderate, something like a medium-sized transformer) I can't get a batch size larger than ~32 with 4 GPU's which is the largest instance I can get get a hold of, sadly, these are really old 11GB K80's because Azure seems to think that GPU's that Google doesn't even give away for free anymore is good enough...........
I have a dataset that requires very large batches because I have to account for a very big imbalance (I'm also using weighting and focal loss ofc), thus I need to perform 4-8 steps of gradient accumulation to smooth out the gradients.
I've read distributed training loops guide and managed to implement it: https://www.tensorflow.org/beta/tutorials/distribute/training_loops
I've also implemented gradient accumulation in TF 2.0 for custom training loops and tf.keras
:
https://colab.research.google.com/drive/1yaeRMAwhGkm1voaPp7EtFpSLF33EKhTc