I think the main reason that it's not obvious is that the structure of the retina is quite a bit more complex than you make it out to be. First of all, there is essentially an exponential fall-off of receptor density as we move away from the fovea. Secondly, there are several horizontal channels in the lamina of the retina that aggregate receptor inputs in an center-surround manner (eg. on-center, off surround, off center, on surround)- and these horizontal channels are of differing lengths.
So it's not such an easy question of which, if any, are the privileged pieces of the circuit, or which, if spatial areas of the retina are privileged, since there are multiple spatial scales in the former, and spatial frequency gradients in the latter.
There are also some complications about time averaging. The retina has both on and off channels - on channels have fast temporal response to increased light, then their activity decays back to 'base', while off channels have the opposite transient response. So you have asymmetric temporal responses between the channels (which is one of the reasons you have center-surround processing). You also have the detail that most neurons in the retina don't spike, they communicate using membrane potentials rather than action potentials (spikes) - and the temporal resolution of many of these channels is still not fully understood.
I think the reason that your insight isn't obvious, is because it's very difficult to translate that insight into a form that's understood by those expert in the anatomy and physiology so they can tell you whether your assumptions are consistent with the data.
Actually, if the center-surround system has the on & off channels spike, then during a microsaccade, any crossing of the boundary should cause a fast-spike. If the motion is known, then the difference in time between the activation of the "on" channel and the activation of the "off" channel (as the edge crosses the boundary) should give a finer-resolution location of the edge.
Membrane potentials make sense in terms of color processing; we already know that color is a lower-bandwidth channel. It's also fuzzy and separate from edge recognition (I know this from... er... some experiments I did on myself involving... oh heck, pretty damn pure MDMA). So if the edge recognition triggers hit, they respond strongly to the extant color data (which is diffuse), and assume the gross color of the area pretty strongly. It seems that even in the fovea, color fills in pretty wildly - which is consistent with a membrane potential action. Also, that kind of slow response should be related to exposure control.
[My specific test case here: Reflected light of varying colors on a blank white surface. The edge-recognition triggers were misfiring, giving something that looked akin to a rolling segmented LCD text display. The color of the phantom text assumed the very light, diffuse and even color across the white surface as a strong, vivid primary color. I'm pretty certain that colors data is not spatially tightly encoded; edges are used to trigger the association of the two.
Other things I noticed during that experience were eigenfaces - apparently when the facial recognition system breaks down and you look at someone, all you get is eigenfaces - or at least the low & high frequency recognition systems go out of sync leading to something that looks a lot like them.
Another thing I noticed was that feature detection is rather interesting. Gross-feature detection is separate from texture-determination. It's kind of like the way GPU's paint a scene; you have the Z-buffer which provides depth, and then you have gross features (triangles in the case of a GPU), and then the actual texture of the surfaces themselves. When the texture system misfires, you get interesting effects, including something that looked like lots of little white bubbles mapped onto the surface, to something that looked like a rolling set of five-pointed stars and linear ridges rolling across the surface.
Anyway... that's a total aside. Anyone involved in that kind of research should at least try a little of that stuff... it's relatively safe, and to an even mildly trained eye will provide a lot of insight in how the processing systems work]