>(Same with the optimization issues we covered in that class - that it can make a real difference in runtime whether you iterate first over the rows and then over the columns of a 2-dimensional array or vice versa, depending on how your software stores arrays in memory, was a huge puzzle for minds far brighter than mine.)
If you are still curious, read the short article at http://en.wikipedia.org/wiki/Instruction_prefetch, and when you come to the bit about prefetching texels, think of those texels as data coming from certain rows/columns of your array. Then think about the way a 2 dimensional array is laid out in linear memory, and whether the next few texels (array cells) is closer you are about to process are closer to the current one if they are from the same row or instead, the same column. In one case, they are going to be packed tightly together, and so will be more likely to be all prefetched into the cache; in the other case, they will be spread out over the memory addresses, and be less likely to all wind up in the cache.
As a game programmer, I attended a conference where one extremely knowledgable fellow demonstrated a crazy thing: he could insert reads into array processing loops where the read DID NOTHING with the single data element it had just read; the whole loop would run faster, though, because that 'useless' read caused a prefetch of data that would be used. It was nuts, it made no sense if you just looked at the code, but it was a significant measurable speedup.