they should just devise a better contest quite frankly, with combination categories or lists of "whats in the picture in relation to each other", like "wine in a glass" vs. "wine glass and a wine bottle"
Yes, they should 'just' create a better contest. The issue with that is that creating a contest, identifying objects, labels, testing, error-correcting, etc. is a slow, expensive, and unglamorous process. The ILSVC is only a couple of years old. And already it is showing its age; I really don't think that they expected it to be solved for much longer.
So, what's next in terms of contests? Probably a multi-object challenge, where a picture can have many objects; alternately the task would be to label not only the main object but also the parts. The previous were limited because there was a single primary labeled object. ILSVC doesn't even using a bounding box (which Pascal VOC did). So, the next step is to create a data set with lots of objects and have them all labeled, and the computer has to draw the boundary (not just the bounding box) around the object.
Deciding the performance is a pain in many of these contests, and eventually it becomes kind of arbitrary. How do you decide that a bounding box correctly covers the ground truth bounding box? Any measurement (i.e. 50% overlap) is going to be arbitrary. Doing it for object boundaries is going to be even harder