I am trying to better understand how the YOLO2 & 3 algorithms works. The algorithm processes a series of convolutions until it gets down to a 13x13
grid. Then it is able to classify objects within each grid cell as well as the bounding boxes for those objects.
If you look at this picture, you see that the bounding box in red is larger than any individual grid cell. Also the bounding box is centered at the center of the object.
My questions of to do with how do the predicted bounding boxes exceed the size of the grid cell, when the network activations are based upon the individual grid cell. I mean everything outside of the grid cell should be unknown to the neurons predicting the bounding boxes for an object detected in that cell right.
More precisely here are my questions:
1. How does the algorithm predict bounding boxes that are larger than the grid cell?
2. How does the algorithm know in which cell the center of the object is located?