A computer program called the Never Ending Image Learner (NEIL) is running 24 hours a day at Carnegie Mellon University, searching the Web for images, doing its best to understand them on its own and, as it builds a growing visual database, gathering common sense on a massive scale.
NEIL leverages recent advances in computer vision that enable computer programs to identify and label objects in images, to characterize scenes and to recognize attributes, such as colors, lighting and materials, all with a minimum of human supervision. In turn, the data it generates will further enhance the ability of computers to understand the visual world.
While definitely a welcomed advance, it is clear by examining some of the not-so-accurate "attributes" and "relationships", which Neil teases from the associated image text, that such a system would be much better if it had true understanding of language rather than simple statistics. An IBM Watson crawler performing the same task would do a much better job; better still, when crawling within a specific knowledge domain, like medicine.
Google is involved in the project, so we can all imagine the day, perhaps after a year's more worth of crawling/training, when we can send an image to the cloud service and have the service return text that describes the objects, object attributes, relationships, and scene contained within the image. With a little further sophistication, it could return an image clip for each detected object, or in reverse, it could render a scene given the textual descriptions: "OK, Google. Create image with [objects, attributes, etc.]."
It's not hard to imagine the next step being video/audio crawling where the common motions (physics), moving relationships, and sounds of objects and contexts are learned. And when movie scene synthesis arrives... "OK, Google. Create movie scene with [actors, objects, scene, motions, etc.] and send it to..."
I believe that de Broglie-Bohm theory also puts things similarly.