I found this extremely intriguing, as I am currently writing up my dissertation on high-GFLOPS/W 3-D layered reconfigurable architectures. I am also of the opinion that memory handling is the key, as it is the only way to resolve the von Neumann bottle-neck problem. Many processing elements with no means to feed them are useless. In my design I am using reconfigurability and flexibility to gain energy efficiency (my architectural range allows 111GFLOPs/W in some configurations).
I am also concentrating on dense linalg kernels, as they are a perfect challenge in variable computation:data ratio, varied and complex memory access patterns and regularity.
In my approach, I am of the opinion that forcing an application mapping to a given architecture via a compiler is inefficient. Instead, I am exploiting architectural flexibility gained from coarse-grained reconfigurable structures to adapt the architecture to an optimal ASAP/ALAP scheduling, thus constructing the perfect architecture to match an optimal mapping. Basically, keeping all processing elements busy all the time is the goal, leading to huge energy gains.
The way this is done is a bit weird, as my architecture has a function set as opposed to an instruction set, which is custom-definable and run-time reconfigurable to suit an application. The construction of the function set is done by composing elementary hardware functions based on meaning, a concept close to functional programming concepts from John Backus. Programming is meaning-based, efficiently constructing required functions and bringing them out to assembly.
Several kernels have been done this way, and programming stays easy via this functional reconfiguration (so far longest being TRSM with 112 assembly lines). Reached 21-25GFLOPs/W on 65nm tech pre-layout for 10 BLAS1-3 kernels)
I am now finishing up a 3D VIA-last physical layout in 40nm tech which already doubled my energy efficiency. (Why 3D? That's another story -- I think that division of computation, memory access and communication(intra-kernel data movement, sharing, broadcasting) needs custom hardware structures optimized for these tasks, which can be parallelized. Which is then native for 3D silicon -- each class on its own die).
I will be reading your papers ASAP to see how you deal with the von Neumann bottle-neck