I've done a little bit of work in a related area, so I skimmed the paper (at the bottom of the first link,) and it's nowhere near as impressive and automagical as the video makes it seem. The user has to provide a mask distinguishing the object they are manipulating from the rest of the image, and then the user also has to provide the 3D model for the object! The model is then smoothed to better fit the original using the mask and the inferred illumination, textured using the image, and then popped out to be manipulated in 3D. Not to detract from how cool this all is, but the user is still doing a lot of the heavy lifting.
I bet a combination of the techniques in this paper and the techniques of multiple view geometry (which is where I've actually done a bit of work) would be considerably more impressive and automagical.