Rendering a world in 3d requires you to do *a lot* of very simple maths. OpenGL breaks down the operations into two steps, DirectX is essentially the same.
"Per vertex": the 3d world is made out of triangles. The camera position and viewpoint is expressed as a matrix, and subject to a matrix inversion, a simple mathematical transform. Every corner of every triangle (vertex) is expressed as a matrix, and then multiplied by the inverted camera matrix. The product is the (x,y) position on your screen, together with the depth (distance).
Matrix maths is easy, it's a series of multiplications followed by a series of sums. A general purpose CPU has to follow the instructions for each vertex every time, which is time consuming. A specialised GPU has circuitry which does this maths, and only this, and thus can run many at once at considerable speed.
"Per fragment": if you were rendering to the screen, you could think of this as per-pixel, but you could rerender each pixel many times to anti-alias, or you could render to a file, or do 3d tricks like Valve's Portal by rendering alternative views elsewhere, first.
Once you've converted your triangles into what they look like on the screen, you need to colour them in. Old-school 3d graphics (late 80's) might use a single colour for each one, but we've come to expect more. Texture rendering is easy maths: load the texture from memory, interpolate how far along the texture the part of the triangle you want is, and put that colour of the texture on the fragment you're rendering. General purpose CPUs can do this, but rendering the entire world takes a lot of time. As a bonus, each "fragment" is a seperate bit of maths, so lots of "pixel shaders" (specialised GPU circuits) can work on fragments separately, and pool their results for speed.
GPUs have really, really, fast local memory compared to CPUs. It's optimised for reading, not writing, as textures don't need to be changed that often. General purpose CPUs need to check whether multiple cores are using the same memory, as they are quite likely to write and change it in general operation, slowing things down. Also, GPUs have specialised circuits for texture maths, and only this maths, which lets them do it faster.
In addition, some other fragment tricks (eg. if you add a bit of grey with increasing distance, you get a "fog" effect; if you add some white depending on the angle of the triangle, it looks shiny and reflective) can have specialised bits of circuitry on the GPU. Not hard maths, but you can do it faster if that's all you can do.
GPUs now also tend to have provision for managing your triangle lists and whatnot which lets them crank out the maths faster than if they had to wait on the CPU.
That clear things up?