As a researcher in this field not affiliated with this work, the merit is that most SLAM methods (that's essentially mapping an environment and tracking your position within it) have generally had very little understanding of the map that results. The world is most commonly a fuzzy blob of pixels or voxels.
In contrast, a human might "map" an environment in terms of salient objects, like "The potted plant" or a set of office chair. Such a semantic map has several possible advantages--- it could support more natural interactions with people, and it can serve as a powerful regularizer that prevents the robot from learning incorrect maps.
The particular method described in this paper is pretty well executed, and making a system that runs in real time with such a large amount of data is not easy. Of course, many researchers are looking at building semantic maps.