In machine learning, it usually requires you to have large sets of data to train a well working (i.e generalizing) model. If you’re working with image recognition, even tens of thousands of images may not be enough for your model to pick up all of the general traits.
For this reason, one of the biggest problem is the lack of data to train on. This can sometimes be solved with transfer learning, but it usually requires that the original model is trained on something similar. Like, if you want to train a model that recognizes different furniture, it would be unwise to use a network that was trained for facial recognition.
For many purposes, you can use free datasets available online (like imageNet or MNIST), but sometimes your data needs to be more specific and that type is not available anywhere.
Like, if you want to build an app where you want to recognise dogs and cats there might be a suitable dataset out there to use. But lets say that you would like to detect different breeds of dogs. Then you would have a slightly different problem and your original model may no longer be able to differentiate between the types and give you meaningful results.
That’s where data augmentation come in. Let’s say that we have a hundred different images of apples and bananas. We could take each one of those and mirror flip them horizontally, vertically and both. Then we would end up with a total of 400 images instead of 100. We could also modify the colors or add noise to them in different variations. Then we could end up with perhabs x20 the amount of data.
Here is an example of what that could look like.
One issue with this approach is that our augmented data still is pretty similar to the original data. If we have too much augmentation, this could potentially lead to the model finding patterns in the background, or some other feature that is not of interest, instead.
To fix this we could use an alternative approach. If we were to cut out the relevant piece from the image and pasting it onto different backgrounds we could end up with vastly different images. The purpose of this is to try to make the model ignore whatever is in the background and focus only on the object that we want it to identify.
It’s usually not relevant if it looks like the piece belongs in the scene or not since the object is still visibly there and should be recognized (of course within reason).
Below is an example with a piece from a chair.
Personally when I tried this approach it increased my hit-rate on validation significantly.
End note. Of course, doing this work manually will still be tedious so ideally you would want to cut out a few pieces, get many background and have a script that automatically can paste the cutout onto every background image.