Of course general purpose CPUs exist, simply because we call them that way. But it is also true that each design has it's own strengths, and "dark silicon" is another driver for special purpose hardware. Efficiency is another. Andrew Chien has published some interesting research on this subject. In his 10x10 approach he suggests to use 10 different types of domain-specific compute units (e.g. for n-body, graphics, tree-walking...), each of which is 10x more efficient than "general purpose CPUs" in its domain (YMMV). Those compute units bundled together, make up one core of the 10x10 design. Multiple cores can be connected via a NoC.
Let's see how software will cope with this development...
ps: can special purpose hardware exist if general purpose hardware doesn't?
One reason might be that railways are more efficient in densely populated areas. There express trains can even compete with airplanes. Yesterday we went from Tokyo to Osaka. Flight time would have been ~1h, plus 1h checkin and transfer to/from the airport (~45min. each). The Nozomi Shinkansen took us there in 2:30, and both stations were directly at the center of the cities.
Most of Japan's population is situated in coastal regions, so just a hand full of routes can service all major cities. Imagine how many connections you'd need in the US...
It's not that specialized. It's just plenty of DSPs strapped together on a torus.
Actually Anton uses ASICS, their cores are specially geared at MD codes. This goes way beyond just "strapping together DSPs". They have IIRC ~70 hardware engineers on site. (Source: I've been to DE Shaw Research last year).
Unlike what wikipedia claims, you could probably achieve comparable performance using a more classical and general-purpose supercomputer setup with GPU or Xeon Phi accelerators, provided the network topology is well tuned to address this sort of communication scheme
No, you can't, and here is why: Anton is built for strong scaling of smallish, long running simulations. If you ran the same simulations on a "x86 + accelerator" system (think ORNL's Titan) then you'd observe two effects:
- The GPU itself might idle a lot as each timestep only involves few computations, leaving many shaders idle or waiting for the DRAM.
- Anton's network is insanely efficient for this use case. IIRC it's got a mechanism equivalent to Active Messages, so when data arrives, the CPU can immediately forward it to the computation which is waiting for it. That leads to a very low latency compared to a mainstream "InfiniBand + GPU" setup.
(most recent supercomputers don't use tori)
Let's take a look at the current Top 500:
- #1 Tianhe-2: Fat Tree
- #2 Titan: 3D Torus
- #3 Sequoia: 5D Torus
- #4 K Computer: 6D Torus
- #5 Mira: 5D Torus
- #6 Piz Daint: 3D Torus
- #7 Stampede: Fat Tree
- #8 JUQUEEN: 5D Torus
- #9 Vulcan: 5D Torus
- #10 nn: 3D Torus
So, torus networks are the predominant topology for current supercomputers.
Computational drug design and bitcoin miners have in common that both run best on custom hardware. The crux is, that both require very different types of hardware. As an example, please refer to Anton, designed by DE Shaw Research exactly for molecular dynamics (MD) codes.
Bitcoin mining is classified as a so called embarrassingly parallel algorithm, while MD is a tightly coupled problem. Hence an efficient parallelization for MD codes is much harder to speed up: communication gets in the way, and communication is essentially always bound by the speed of light.
ps: fun fact: bitcoin mining and MD can be carried out (at least somewhat) efficiently on GPUs.
History is full of tragedies facilitated by people "just doing their job".
Source: I'm from Germany.