Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×

Researchers Teach Computers To Perceive 3D from 2D 145

hamilton76 writes to tell us that researchers at Carnegie Mellon have found a way to allow computers to extrapolate 3 dimensional models from 2 dimensional pictures. From the article: "Using machine learning techniques, Robotics Institute researchers Alexei Efros and Martial Hebert, along with graduate student Derek Hoiem, have taught computers how to spot the visual cues that differentiate between vertical surfaces and horizontal surfaces in photographs of outdoor scenes. They've even developed a program that allows the computer to automatically generate 3-D reconstructions of scenes based on a single image. [...] Identifying vertical and horizontal surfaces and the orientation of those surfaces provides much of the information necessary for understanding the geometric context of an entire scene. Only about three percent of surfaces in a typical photo are at an angle, they have found."
This discussion has been archived. No new comments can be posted.

Researchers Teach Computers To Perceive 3D from 2D

Comments Filter:
  • Awesome! (Score:5, Funny)

    by rblum ( 211213 ) on Wednesday June 14, 2006 @03:19PM (#15534413)
    Now run it on an Escher picture!
  • leaning tower (Score:3, Interesting)

    by ZivZoolander ( 964472 ) on Wednesday June 14, 2006 @03:21PM (#15534434)
    Wonder how this will handle those optical illusion photos. like me nocking over the leaning tower of pisa, or holding hte statue of liberty.
  • ...challenge. I think Carnegie Mellon wants revenge against Stanford for beating them in the 2006 DARPA grand challenge. Maybe 2007 will be Carnegie Mellon's year to win the grand challenge. If this happens, we're only a hop skip and a jump to having these things drive us around (esp on freeways).
    • Bus-ted. (Score:1, Funny)

      by Anonymous Coward
      "If this happens, we're only a hop skip and a jump to having these things drive us around (esp on freeways)."

      Man that would be a pretty neat invention [basintransit.com].
    • Granted you can extrapolate an estimate of the surroundings for a 3d scene from a single image.
      This is good when the source material doesn't exist.

      However if I were in the grand challenge I wouldn't be swapping the (minimum) stereo imaging most cars appear to have.

      1) its an approximation and may not be applicable for different terrain or obsticles (similar rock against similar floor)
      2) its harder to fool 2 cameras than a single one, glitches could send you off the cliff.
      3) with a stereo pair you can interpo
  • One could concievably take a pictures of a city, upload them to this program, stich the pieces together and then import it into a game world. How awesome would it be to actually be able to run around a city(say Toronto) and do things you always wanted to do... (dropping a penny off of the CN tower and having it hit someone :D)
    • The Getaway [gamerankings.com] already has a startlingly accurate virtual London.
    • Errr... (Score:5, Informative)

      by Ayanami Rei ( 621112 ) * <rayanami AT gmail DOT com> on Wednesday June 14, 2006 @03:26PM (#15534476) Journal
      you've always been able to do that.
      Cities aren't the kind of thing this is target for.
      You can get building plans and architectural drawings and everything from the city for free. There are algorithms that can easily map pictures to objects if you know ahead of time the shape of the things that "should" be there.

      This stuff is for deciding the shape of unknown things, and more importantly, to gain new heuristics for image searches.

      With this technology, you could ask for "things that are round, and have a box".

      More importantly, you could show the computer one picture of something, and have it attempt to find more pictures of it (from different angles, with different colors, etc.). Like you show it a Volvo C90, and it shows you any and all pictures of Volvo C90s by the shape.

      • How about building a 3D representation of a terrorism suspect?

        There's your grant money right there, boys!
      • With this technology, you could ask for "things that are round, and have a box"

        Really...

        hmm...

        I was thinking "things that are round, and have a nipple"
      • This is only for outdoor scenes and only extracts planar information. It isn't designed for objects at all. It provides general geometric context, ie this area is ground, this area is a left facing wall, etc. That's not to say that a similar technique couldn't be used for identifying round objects, but that isn't what this is for.
      • Re:Errr... (Score:3, Funny)

        by jackbird ( 721605 )
        You can get building plans and architectural drawings and everything from the city for free. There are algorithms that can easily map pictures to objects if you know ahead of time the shape of the things that "should" be there.

        Dear Sir,

        ha ha ha.

        ha ha ha ha ha ha ha.

        ha.

        If only.

        Signed,

        every CAD operator in the world

        • This all pre-supposes you can translate the diagram accurately and position it in the 3d world. You'd probably need GPS readings at different points on the building, and on the camera to get decent results.

          And you need a light model and surface texture models (or a lot of pictures from different angles).

          So this isn't trivial. But it's doable. Such techniques are used in film for scene composition and for texturing 3d representations of real-world objects.

          It's not like you can just take a picture of a buildi
          • Re:Well... (Score:3, Interesting)

            by jackbird ( 721605 )
            I've used Photomodeler and Canoma, and made camera mapped environments in 3D software by hand for years. It is incredibly nontrivial. it is a lot of blood, sweat, tears, handpainting, and a not-so-terribly good result. Some typical problems:
            • Camera barrel distortion
            • chromatic abberations
            • hot colors in high-contrast areas of digital photos
            • JPEG compression artifacts
            • specular highlights and reflections
            • lens flares and blooms from those specular highlights and reflections
            • clipped/out of gamut areas
            • occluding objec
  • X-Files quote:

    "Your scientists have yet to discover how neural networks create self-consciousness, let alone how the human brain processes two-dimensional retinal images into the three-dimensional phenomenon known as perception. Yet you somehow brazenly declare seeing is believing?"

    -- Jesse "The Body" Ventura as a Man In Black

  • Typical photos? (Score:3, Interesting)

    by doti ( 966971 ) on Wednesday June 14, 2006 @03:24PM (#15534456) Homepage
    Only about three percent of surfaces in a typical photo are at an angle

    What typical photos are those? No faces, people, trees or any organic thing?
    No cars? No roofs?
    • Obviously not myspace photos. Those are about 50% angle. Also, if a computer did read them it would have to kill a bunch of scene-agers (scenester + teenager) for being idiots.
    • by moultano ( 714440 ) on Wednesday June 14, 2006 @03:42PM (#15534595)
      The complexity of the models that the program is able to extract is similar to what you would see in a game like doom. All "floors" are perfectly horizontal, all "walls" are perfectly vertical, and most objects (people, trees, cars) become small vertical walls. This doesn't attempt to capture surface geometry at all; it approximates things with large planes. What they are saying is that most things you see in pictures are very well approximated by these simple primitives, such that when they create a scene using them it provides convincing parallax as you move around it. It's a really neat effect.
    • Yes, pretty much post-neutron bomb pictures only, please.
    • From TFA:

      Hoiem found the computer often discerned which surfaces were vertical or horizontal, and whether a vertical surface faced left, right or toward the viewer.

      Faces have a number of vertical and horizontal surfaces, like the sides of your nose, bottom of your chin, cheeks, etc. And cars have plenty of horizontal and vertical sides. And not all roofs are peaked.

      As someone else commented, this will give you very blocky representations, but there is plenty of use to those blocky representations. Fo

  • Robot vision (Score:5, Insightful)

    by amightywind ( 691887 ) on Wednesday June 14, 2006 @03:26PM (#15534470) Journal

    They've even developed a program that allows the computer to automatically generate 3-D reconstructions of scenes based on a single image

    This is so not new [amazon.com]. These researchers may have advanced techniques is some areas, but shape from shading inversion problems like this have been worked successfully since the 1970's and earlier. The theory is well established. Horn's Robot Vision is a classic.

    • Shape from shading works only on a very narrow set of objects. If you are trying to recover the shape of a marble statue, use shape from shading. If your object has color forget about it.

      What you are saying amounts to "People have done research into computer vision in the past, therfore any new research into computer vision is soooo not new."
      • Shape from shading works only on a very narrow set of objects. If you are trying to recover the shape of a marble statue, use shape from shading. If your object has color forget about it.

        Not true at all. If you understand the photometric function [psu.edu] of the materials in the scene variation due to color can be separated from variation due to shading. Image classification techniques are useful for doing this. This is discussed in the book and elsewhere. We used the technique for Voyager II to measure topograp

  • ...the CMU web site. My Commodore 64 would really like to sign up for this.
  • by Onimaru ( 773331 ) on Wednesday June 14, 2006 @03:29PM (#15534501)
    ...pr0n, of course. Now we can accurately predict and model the exact size and specularity of Linsey Lohan's boobies, using this revolutionary new (wait for it) Mellon Engine. Truly, we live in the future.
  • by Rob T Firefly ( 844560 ) on Wednesday June 14, 2006 @03:31PM (#15534510) Homepage Journal
    So we're one step closer to actually being able to do the dramatic image-enhancing stuff that's routine in film and television crime drama? You know, where the brooding detective notices four interesting pixels in the background of a scratchy security video, strokes his chin thoughtfully, and says "enhance this bit" to the stereotype computer geek. The geek types noisily, the computer zooms in on thouse four pixels, and clears it up into a detailed image of the bad guy, often moving other foreground stuff out of the way to do so.
    • by Jerf ( 17166 ) on Wednesday June 14, 2006 @03:50PM (#15534659) Journal
      It's worth pointing out that a lot of that stuff isn't, strictly speaking, impossible.

      What's impossible is to take a single photo out of the stream and "enhance" it to the n-th degree without using the rest of the video.

      And no matter how good your technique, you can't generate information, so there will be some limit to your zooming in.

      But the idea that if you consider the entire video stream, you can extract a lot more information is not impossible at all, and you'd probably be surprised by both what is in there and what isn't. Seeing "through" something probabilistically is possible if the object being "seen" was in video at some point. On the other hand, "zooming" in to something on the counter that has been there for the entire duration of the video and has never moved is impossible, because while you may have 15,000 pictures of the object, they're all the same pictures.

      Normally I don't bring this up when we're having one of our usual bitch-fests about CSI here on Slashdot because by and large the standard bitching is still correct. But as AI advances, some of the stuff that seems impossible now will become very possible.

      One early example I remember seeing is the demonstration of a system that could identify a person with about 15x15 pixel, high-temporal-resolution monochrome video of them walking, by comparing walking patterns. This was a while ago, and it's worth pointing out your brain can do a pretty decent job of the same task when shown the same video. I mention this because any given frame of the video is basically a random assortment of gray blobs, but in motion, not only is it "a person" but it's a specific person; making it a video adds a lot of information.
      • An excellent example, in linux do:

        mplayer somefile.avi -vo aa

        It's amazing how well you can make it out. But pause it and it's much more difficult.
      • I have seen an example of this video enhancement technology where they have some crappy video of a car leaving a parking garage and the front license plate is completely unreadable due to grainy pixelation. But when they selected the area of the plate and compared the data from every frame of the video it because quite clear what the license plate said. It is very convincing.

        Ever since the 9/11 conspiracy theorists started posting captured stills of the airplane hitting the tower, pointing out unknown dev
      • Comment removed based on user account deletion
      • On the other hand, "zooming" in to something on the counter that has been there for the entire duration of the video and has never moved is impossible, because while you may have 15,000 pictures of the object, they're all the same pictures.

        Not true... the camera moves very slightly, but enough to change the value of certain pixels. This is how super resolution is possible. You can extrapolate a 1600x1200 picture from a 800x600 source time with a "stationary" camera. Everything moves (your camera includ
      • > no matter how good your technique, you can't generate information

        horsepucky. you can generate all the information you want. about half of it is wrong, in a 2symbol stream, if you just toss coins, but you can do a whole lot better than that without straining yourself, and an order of magnitude more if you are willing to burn the midnite. being wrong is not a bad thing either. being credibly wrong is often better than being incredibly right.
  • I remember doing something similar to this while an undergrad at Penn State. It was just an undergraduate computer vision course, but one of our exercises involved identifying common reference points from one or more images of the same object. These points can then be used to make an estimation of parallax between the images. It is really fun to play with since you can use a few still images to create the illusion that a camera is panning around the object. Of course, that example is quite simple. It i
    • Uhhh, what I'm trying to understand is how this routine is supposed to figure out what the other sides of all of those 3D objects look like. I grant you that some objects are uniform across their 3 dimensions, but most are not.

      Naturally, I have not RFTA yet, but common sense dictates some basic limitations to a routine such as this.
      • You are absolutely correct that it won't be able to tell what the 'reverse' side looks like, other than they will know that it has to be within certain size constraints.

        So if I'm looking at a football, I won't be able to tell what is behind it from a single picture. You would have a blind spot, that would grow based upon the vectors from the image aperture to the edges of the object.

        However, this could be a breakthrough for facial recognition. Given a facial photo, if they are able to extract the di
    • Oh, reading further, it says they are doing so from a single 2d image. In that case, this is even more interesting.
  • By 1980 most had concluded that the feat was either impossible or, if possible, computationally impractical.

    Nice to see we're doing things for shits & giggles, is this some sort of practical joke ?
  • by Penguinisto ( 415985 ) on Wednesday June 14, 2006 @03:35PM (#15534543) Journal
    It's called Canoma. Problem is, it's been limited in scope, and the original company that wrote it (MetaCreations) went out of business ages ago: It still exists as an orphan that Adobe has been sitting on, however [canoma.com].

    (MetaCreations also produced Poser, Bryce, and Carrara. - all three of which are still alive and in use by the 3D hobbyist market).

    /P

  • by ortholattice ( 175065 ) on Wednesday June 14, 2006 @03:36PM (#15534551)
    I wonder what the software would end up doing with this: M.C. Escher's Waterfall [techeblog.com]. Would the program self-destruct like that robot in Star Trek?
    • Imagine if it actually suceeded in modelling it in 3d. Now THAT would be an interesting (read: mindbending) sight.
    • My mind practically self destructs when looking at that.

      Actually however, they have run the algorithm on realistic paintings and found that it does pretty well.
    • I think the computer would start claiming that the universe is a spheroid region, 705 meters in diameter. ^_^
    • You might have wanted to use the impossible triangle. The waterfall thing can exist in 3d space; this program probably doesn't care about the laws of gravity.

      Then again, it would be cool if all of (Insert name of cartel here (**AA, M$, etc))'s computers blew up whenever someone carried something illogical near a webcam!
  • by jsharkey ( 975973 ) on Wednesday June 14, 2006 @03:38PM (#15534561)
    Last year I worked on an Artificial Intelligence project [jsharkey.org] to recognize objects from several video angles. It takes 2D images (from camera video) and turns them into a 3D path.

    It uses a super-neat concept called "Geometric Hashing" which can be used to recognize an object regardless of size, rotation, or even partially-obscured regions.
    • by Anonymous Coward
      actually, there is a technique called Scale Invariant Feature Transform (SIFT) that can do the same thing. I'm doind an undergraduate research project on it right now. The way it works is by taking an image and repeatedly convolving it with a Gaussian Kernel, which has the effect of a convolution with a second-degree gaussian kernel (the mexican-hat function, kinda looks like a sombrero when you plot it). You do this throughout your "Octave" (however many it is, I usually use n = 6), getting n+2 images,
      • There's a really easy way to code fast approximate (but *nice* approximate) gaussian convolutions. Forget FFT. Take *any* filter all of whose kernel values are non-negative. Repeatedly iterate it. The resulting image approaches a gaussian convolution as you increase the number of iterations. This is just the central limit theorem. The easiest filter to iterate is the box filter using a summed area table giving you time O(N) where N is the number of pixels. Just three might be enough, you'll get a nice bicub
      • For FFT you should use www.fftw.org. Also, for image processing in C++ www.itk.org can be very helpful (even if it's just for file io). Coincidentally, I've implemented SIFT myself for an automated image stacking application used to reassemble a volume of Electron Transmission Microscopy images.
  • I'd like to see this applied more directly to something like Google Earth. They already have the "show buildings".... this would be a great boon to that. It might need a different shading than the grey boxes used by Google earth as it stands now, to show which structures are derived from the 2d images, but still, I think it'd be great.

    Google, you can send me my check now, please.
    • Of course this varies for different parts of the Google Earth material, but quite a lot of it is from a very steep angle. You can't tell the true height of the buildings from those pictures (maybe indirectly from shadows, but unless you know the time of day, latitude and time of year, that's a guess based on some object you think you know the size for). This algorithm is similar in scope to what we do when we face a 2D image, deciding what structures indicates depth. It still needs depth cues, arguably more
      • Good points, but wouldn't the metadata (time of day, and date) be embedded within the original image files? Plus, the approximate lattitude should be easy to determine given that they already have everything mapped onto the earth.

        I'm not arguing that everything would be able to be modeled, but every bit helps.
  • This could be a revolution in the CSI field. There are already products that make 3D virtual crime scenes but this could be applied to just every case were a picture was taken.
    • Of course, the CSI version will allow you to explore the crime scene, including things that were *behind* the camera when the picture was taken.
  • So when is this going to be used to turn real environments into virtual environemts?

    Taking reconnaisance photos and turning them into training simulations, for example. Or, closer to my level, taking photos of public places and turning them into deathmatch levels. :)

    (Always wanted to make a Quake level of my high school, but then became worried people would thing I'd be the source of the next Columbine. Then I wanted to do one of my college, but then 9/11 came along, and I was worried of being investigate
  • by Anonymous Coward
    Left 30 degrees

    click click click click click

    Up twenty degrees

    click click click click click

    Enhanse

    click click click click click

    Zoom in on that

    click click click click click

    Enhanse

    click click click click click

    OK, give me a hardcopy right there.

    "More human than human is oour motto"
  • by cranesan ( 526741 ) on Wednesday June 14, 2006 @04:41PM (#15535018)
    http://www.cs.cmu.edu/~dhoiem/projects/popup/index .html [cmu.edu]

    Looks like some of the software they wrote to do this has been GPL'ed.
  • researchers at Carnegie Mellon have found a way to allow computers to extrapolate 3 dimensional models I'd run it on a Victoria's Secret magazine. There are some excellent 3d models I'd like to extrapolate if you know what I mean.
  • in the context of my stereoscopy hobby for use with my emagin z800 vr visor i discovered software that was able to detect some depth dimension from the movement from frame to frame in a movie. The tech has been developed by a company called Soft4D, which doesn't exist anymore. But it seems http://www.colorcode3d.com/ [colorcode3d.com] sells a version of the software for use with any normal 2D DVD's and their stereoscopic 50 eurocent glasses. It sure adds some depth to a 2D movie, no true 3D effect but still remarkable and mo
  • Unfortunately this is done by neural learning techniques, "machine learning". So it is essentially randomly taught artificial neurons and the researchers have no idea how the machine solves it. However machine learning techniques, or Artificial Neural Networks (ANN) have alot of potential as custom IC's and computing power become better and better.
  • Now if only they could teach this to my dogs.
  • otherwise known as a steinmetz solid [wolfram.com], which is often used as a demonstration for engineering drawing or architecture classes to show that a 3-d drawing of an object is not sufficient to determine its actual shape. A mouhefanggai in 3-D drawings looks like a sphere, but is actually a ridged object with a surface consisting entirely of flat-wrapped curves, rather than compound curves.
  • Hmm let me see here.. what could be considered prior art?

    Maybe Pablo Picasso's Guernica? [wikipedia.org]?!?! Man, that Picaso was waaaay ahead of his time!

    *watches out for rotten tomatoes*

    SixD

  • How impressive this research really is won't be known until we can have a look at their methods, algorithms and training data set. I have a feeling that the novel aspect of their work is not in the extraction of features, or the method used to determine whether a surface is vertical or horiztonal. As others have already said, shape from shading (think shading a lit cube with a pencil on paper) and even geometric approaches can get you a 3D model from 2D images. It all depends on the assumptions you make bef
  • When it comes down to it, these men are shaking hands about teaching a computer to read Magic Eyes.

    Isn't that like a second year problem at most universities?
  • "Only about three percent of surfaces in a typical photo are at an angle, they have found."

    Doesn't it depend on whether the photo's of a city and man built objects or of nature, trees and mountains...

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...