With regards to training, It is possible to perform this learning task without direct supervised (tagged data) training.
Imagine the following:
Take a trillions of images from the web, and use unsupervised, clustering methods to group images into groups of equivalence, given that you have great features that allow you to do that.
Then, given a cluster of millions of examples, take the surrounding text around the images source and try to find common denominators in the text. It's not far fetched to think that similar objects in the images will have similar words in the text.
Such "Big Data" research is now being done in various research facilities around the world.