The trick lies in the fact that the picture is a projection, not the scene. There do exist 3D displays, which are volumetric, but a lightfield display doesn't replicate the objects, only the light passing through the screen. This is just like a hologram (although digital lightfield processing is far from the fidelity of chemical holography). The more commonly advertised "3D" screens approximate the effect for two points that represent your eyes, which breaks down in several ways: The points may be misplaced, such as looking at the screen from anywhere but dead center at the right distance and with the estimated interpupilary distance (yeah, that's not happening, particularly with multiple viewers); this is common for TVs and such. For HMDs and VR, a growing issue is that the points are not points at all; your pupils have a shape, and dynamic optics used to focus (accomodate). That's what these displays are designed to address. A related issue in turn is that cinematographers are used to using blurring effects to suggest focus, which will conflict if you're not looking exactly where you were expected to.
Light field imaging really does operate in 4D; two dimensions of position and two dimensions of angle. Normal stereoscopic imagery means using two cameras, each of which takes 2D angular images (e.g. the pixels represent a direction from the camera), and having them placed separately; this gives you a single step of third dimension, which is intended to exactly match the offset between your eyes. It's only an estimation as eyes have more axis of adjustability, including vergence and accomodation, and the direction of your eyes does affect your interpupilary distance for the same reason a panoramic camera setup needs a depth offsetting gimbal; the front end optics are in front of the rotation axis. Common stereoscopic displays like TVs and cinema have this as one of the less inaccurate tradeoffs, however, as the mere fact they don't know where you are (and there are frequently multiple watchers) means they can't show your perspective (if they did, you would see a wider field if you sat closer). A lightfield camera like a Lytro uses a lens array to distinguish such places on the lens itself. From that data you could focus to render 2D images, but a true lightfield display (like this one from Standford, the microlens projection system from MIT, or the very similar HMD shown by Nvidia) leaves that task to your eye's normal accomodation. Some lightfield systems simply use multiple cameras in an array; a few are designed for 3D and thus only have a linear array. Due to the unsolved problem of video transfer of true 4D lightfields, this is the category most 3D panoramic content falls in, which restricts the user to panning only (no yaw, little tilt, no translation) to avoid serious distortion.
If you look at a stereoscopic image, and move your head a little, you see the scene shearing to make objects further away move the same direction; this effect is because the images shown to your eyes were made for a different perspective. An eye tracking stereoscopic display could avoid this (sadly, the New 3DS does not), and a true light field display would not need to; it already displays different perspectives in different directions. In principle you'd require a capture array the size of your screen, but display prototypes avoid that simply by using CG, and it's also less of a problem for VR than cinema. A common application has been lenticular 3D pictures, which frequently have 5 or more perspectives.