So my verbal description confused readers, I get that. I'll try again using examples.
This site hosts an Image-to-Triangle-Converter.
I invite you to play with it. You can see it's possible to convert any 2D image to bunches of triangles. The more triangles one uses, the better the resolution. The defaults on this site are not high-resolution, but high-res can be achieved by using much smaller triangles. (The optimum number of vertices per polygon and polygon sizes is an R&D project.)
So you agree any 2D image is "polygonizable"? Good.
Now extrapolate this idea to a movie. Rather than each frame be an independent triangle (polygon) set, an extrapolation algorithm connects similar "adjacent" polygons in each frame. Think of the frames as stacked on top of each other like a card deck.
In most cases, Frame n + 1 will be very similar to Frame n, giving us gradually-changing candidate connection lines. The extrapolation algorithm will give us a best fit, or best "economical" fit in terms of vertex conservation (depending on chosen resolution settings). The end result would be 3D polygons that together make up a giant cube: the Polycube. Our proverbial card deck is kind of melted together into one big "card cube".
(The boundary between "cut scenes" won't end up sharing very many polygons, but this is not a problem.)
If one digitally takes a "slice" of this cube, they get a frame of the movie. Note there are infinite slice points such that the frames don't have to be displayed in fixed intervals. The slice spacing would be determined by a particular display device, as each is capable of different display rates.
Make sense? If not, which phrase isn't clear?