All of this points out what I'm saying... they've optimized for small(ish) systems that have to run very quickly, with a heavy emphasis on "routing fabric" internally. This makes them hard to program, as they are heterogeneous as all get out.
Imagine a grid of logic cells, a nice big, homogenous grid, that was symmetric. You could route programs in it almost instantly, there's be no need for custom tools to program it.
The problem IS the routing grid... it's a premature optimization. And for big scale stuff it definitely gets in the way.
I would have a 4 bits in, 4 bits out lookup table as the basis of this, and I call it the "bitgrid".... I've been writing about it for years, feel free to make the chip, and send me an email (or preferably a sample, please)., because that puppy is disclosed as far as patents go.... I have none, and can't now.
You should be able to get a 64k x 64k grid on a chip for a few bucks, in any kind of quantity. It should do Exaflops, or consume almost nothing if you idle it.