I'm familiar with Dr. Baas' older work (ASaP and ASaP2). He presented his work to a team of processor architects I was a part of several years ago.
At least at that time (which, as I said, was several years ago), one class of algorithms they were looking at was signal processing chains, where the processing steps could be described as a directed graph of processing steps. The ASaP compiler would then decompose the computational kernels so that the compute / storage / bandwidth requirements were roughly equal in each subdivision, and then allocate nodes in the resulting, reduced graphs to processors in the array.
(By roughly equal, I mean that each core would hit its bottleneck at roughly the same time as the others whenever possible, whether it be compute or bandwidth. For storage, you were limited to the tiny memory on each processor, unless you grabbed a neighbor and used it solely for its memory.)
The actual array had a straightforward Manhattan routing scheme, where each node could talk to its neighbors, or bypass a neighbor and reach two nodes away (IIRC), with a small latency penalty. Communication was scoreboarded, so each processor ran when it had data and room in its output buffer, and would locally stall if it couldn't input or output. The graph mapping scheme was pretty flexible, and it could account for heterogenous core mixes. For example, you could have a few cores with "more expensive" operations only needed by a few stages of the algorithm. Or, interestingly, avoid bad cores, routing around them.
It was a GALS design (Globally Asynchronous, Locally Synchronous), meaning that each of the cores were running slightly different frequencies. That alone makes the cores slightly heterogeneous. IIRC, the mapping algorithm could take that into account as well. In fact, as I recall, you pretty much needed to remap your algorithm to the specific chip you had in-hand to ensure best operation.
The examples we saw included stuff familiar to the business I was in—DSP—and included stuff like WiFi router stacks, various kinds of modem processing pipelines, and I believe some video processing pipelines. The processors themselves had very little memory, and in fact some algorithms would borrow a neighboring core just for its RAM, if it needed it for intermediate results or lookup tables. I think FFT was one example, where the sine tables ended up stored in the neighbor.
That mapping technology reminds me quite a lot of synthesis technologies for FPGAs, or maybe the mapping technologies they use to compile a large design for simulation on a box like Cadence's Palladium. The big difference is granularity. Instead of lookup-table (LUT) cells, and gate-level mapping, you're operating at the level of a simple loop kernel.
Lots of interesting workloads could run on such a device, particularly if they have heterogenous compute stages. Large matrix computations aren't as interesting. They need to touch a lot of data, and they're doing the same basic operations across all the elements. So, it doesn't serve the lower levels of the machine learning/machine vision stacks well. But the middle layer, which focuses on decision-guided computation, may benefit from large numbers of nimble cores that can dynamically load balance a little better across the whole net.
I haven't read the KiloCore paper yet, but I suspect it draws on the ASaP/ASaP2 legacy. The blurb certainly reminds me of that work.
And what's funny, is about 2 days before they announced KiloCore, I was just describing Dr. Baas' work to someone else. I shouldn't have been surprised he was working on something interesting.