14

This question has maybe been answered but I didn't find a simple answer to this. I created a convnet using Keras to classify The Simpsons characters (dataset here).
I have 20 classes and giving an image as input, I return the character name. It's pretty simple. My dataset contains pictures with the main character in the picture and only have the name of the character as a label.

Now I would like to add an object detection ask i.e draw a bounding box around characters in the picture and predict which character it is. I don't want to use a sliding window because it's really slow. So I thought about using faster RCNN (github repo) or YOLO (github repo). Should I have to add the coordinates of the bounding box for each picture of my training set? Is there a way to do object detection (and get bounding boxes in my test) without giving the coordinates for the training set?

In sum, I would like to create a simple object detection model, I don't know if it's possible to create a simpler YOLO or Faster RCNN.

Thank you very much for any help.

A. Attia
  • 1,211
  • 3
  • 16
  • 27

2 Answers2

13

The goal of yolo or faster rcnn is to get the bounding boxes. So in short, yes you will need to label the data to train it.

Take a shortcut:

  • 1) Label a handful of bounding boxes for (lets say 5 per character).
  • 2) Train faster rcnn or yolo on the very small dataset.
  • 3) Run your model against the full dataset
  • 4) It will get some right, get alot of it wrong.
  • 5) Train the faster rcnn on the ones that are correctly bounded, your training set should be much bigger now.
  • 6) repeat until you have your desired result.
Andrew Tu
  • 233
  • 2
  • 8
4

You may already have a suitable architecture in mind already: "Now I would like to add an object detection ask i.e draw a bounding box around characters in the picture and predict which character it is."

So you just split the task in two parts:
1. Add an object detector for person detection to return bounding boxes
2. Classify bounding boxes using the convnet you already trained

For part 1 you should be good to go by using a feature detector (for example a convnet pretrained on COCO or Imagenet) with an object detector (still YOLO and Faster-RCNN) on top to detect people. However, you may find that people in "cartoons" (let's say Simpsons are people) are not properly recognized because the feature detector is not trained on cartoon-based images but on real images. In that case, you could try to re-train a few layers of the feature detector on cartoon pictures in order to learn cartoon features, according to the transfer learning methodology.

Michelagio
  • 41
  • 3