Not as "advanced" in image recognition as advertised.
Basically they took the output of a common object classifier and instead of just picking the most likely object (which is what a typical object classifier looks for), it leaves in in a form where multiple objects are detected in various parts of the scene. Then they train a neural network to create captions (by giving it training pictures with associated captions).
According to the paper, it sometimes apparently generates a reasonable description. Other times it reads in picture of a street sign covered with stickers and emits a caption like "refrigerator filled with lots of food and drink".
Actually the most interesting thing about it is the LSTM-based Sentence Generator that is used to generate the caption from the objects. LSTM's are notoriously hard to train and they apparently they borrow some results from language translation techniques to attempt to form intelligible sentences.
This is all very googly-researchy in that they want to see what the limits of pure data driven machine learning are (w/o human tuning). This is not a however much of an advance in image recognition as it is an advance in the language for caption construction.