Here I go feeding the trolls...
So just so I can get this right, a printer driver is so complex a feat of engineering it is analogous to a skyscraper? A printer driver takes input data in the form of text, fonts, and images, formatting, and translates it into a format compatible with the printer in question. Entire operating systems have been written in less than 1/10th the size of some of HP's modern shipping print drivers. I never said it has to be 254 bytes, but the current level of bloat is absolutely insane and I have no idea how they even reach it. Do you know how many lines of C++ code 500MB compiled and compressed is actually equivalent to? Billions. Literally billions. I just want to understand; Do you actually believe it takes a billion lines of code to transform input document data into a format readable by a laser printer?
Why, yes I am! I've written home-brew Xbox games that included graphics and animations, sounds and music(<10MB), and reasonably complicated network software that runs as a service on many of my servers at work (<100KB). I've dabbled in writing demo code as well writing a complex synthesizer with DSP effects and tons of music content in 64kb.
If you're actually defending the need to ship printer drivers literally over 500MB I would really love to hear your logic.
Sigh. Here I go feeding the trolls.
I'm not sure what point you're trying to make here, since MY main point in the rest of this topic was that modern GPUs are mostly limited by memory bandwidth, which makes the development in TFA pretty pointless. You're right! 32GB/s isn't enough to make the most of the computing resources available on a modern GPU! That was my point; How exactly would the GPU accessing main memory directly help? The fastest system RAM currently available in consumer markets in the fastest possible configuration can barely reach 30GB/s. In order for GPUs to confer a computional advantage they need to be doing heavy lifting on GDDR RAM which could deliver over 160GB/s on cards that are over 4 years old.
The trouble is that GPUs suck. They have teeny amounts of local memory and a slow interconnect to main memory. They also suck at certain things and batting data between the fast (for some things) GPU and fast (for other things) CPU is a real drag becuase of the latency. This limits the applicability of GPUs.
The "slow interconnect" you're talking about to main memory, PCI Express v3.0 has an effective bandwidth of 32GB/s which actually exceeds the best main memory bandwidth you'd get out of an Ivy Bridge CPU with very fast memory, so no, that's not a bottleneck for bandwidth, though yes, there is some latency there.
I don't know why everyone seems to forget that GPUs aren't just fast because they have a lot of ALUs (TFA included), they are fast because of the highly specialized GDDR memory they are attached to. One would be completely useless without the other. Even the lowly GTX 285 from 4 years ago was pushing 160GB/s for memory bandwidth.
Real Programmers don't eat quiche. They eat Twinkies and Szechwan food.