I've been working on a slightly more ambitious (but still a ways off!) similar project, see http://jonathanclark.com. Initially I tried using a wiimote, but found it has a extremely limited coverage area and accuracy. If you move a few feet out of a sweet spot it will stop working, also the wiimote has a lot of noise in it's samples so you end up having to smooth the samples - but this introduces a lot of latency which destroys the illusion. On the low-cost end, the TrackIR system works a lot better (faster, more accurate samples). I have a demo using TrackIR posted here:
TrackIR also has a limited area it can work with, so now I've moved to using OptiTrack which gets pricer but can cover fairly large areas (at least a small room).
One other issue I found is that flat video doesn't look entirely convincing because motion parallax should occur within a frame - for example, when you move left to right, the bridge and the water behind it should move at different speeds. To help address this, I'm currently trying to create a depth-map per video frame and convert that depth map into a mesh which the video is mapped onto. To start, I'm drawing the depth map by hand (should be ok if objects don't move much), but I'd like to create it automatically by filming from multiple angles and using feature point extraction to estimate the depth for every frame automatically.