if you understand scalar assembly, understanding the basic "how" of vector/SIMD programming is conceptually similar
Actually, if you think back to pre-32bit x86 assembler, where the X registers (AX, BX) were actually addressable as half-registers (AH and AL were the high and low sections of AX), you already understand, to some extent, SIMD
SIMD just generalizes the idea that a register is very big (e.g. 512 bits), and the same operation is done in parallel to subregions of the register.
So, for instance, if you have a 512 bit vector register and you want to treat it like 64 separate 8 bit values, you could write code like follows:
C = A + B
If C, A, and B are all 512 bit registers, partitioned into 64 8 bit values, logically, the above vector/SIMD code does the below scalar code:
for (i == 1..64) {
c[i] = a[i] + b[i]
}
If the particular processor you are executing on has 64 parallel 8-bit adders, then the vector code
C = A + B
Can run as one internal operation, utilizing all 64 adder units in parallel.
That's much better than the scalar version above - a loop that executes 64 passes..
A vector machine could actually be implemented with only 32 adders, and could take 2 internal ops to implement a 64 element vector add... that's still a 32x speedup compared to the scalar, looping version.
The Cray 1 was an amazing machine. It ran at 80mhz in 1976
http://en.wikipedia.org/wiki/C...
According to WP, the only patent on the Cray 1 was for its cooling system...