I'm not saying that lagging software is a problem: it's not. The problem is that there are so few real needs that justify the top, say, 10 computers. Most of the top500 are large not because they need to be - that is, that they'll be running one large job, but rather because it makes you look cool if you have a big computer/cock.
Most science is done at very modest (relative to top-of-the-list) sizes: say, under a few hundred cores. OK, maybe a few thousand. These days, a thousand cores will take less than 32u, and yes, could stand beside your desk, though you'd need more than one normal office circuit and some pretty decent airflow. I think people lose touch with the fact that our ability to build very big machines, cheaply, filled with extremely fast cores. You read all that whinging about how we hit the clock scaling (dennard) wall around the P4 and life has been hell ever since - bullshit! Today's cores are lots faster, and you get a boatload more of them for the same dollar and watt. And that's if you stick with completely conventional x86_64/openmp/mpi tech, not delving into proprietary stuff like Cuda.
People who watch the top of top500 closely are addicts of hero-numbers and hero-facilities. The fact is you can buy whatever position you want: just pay up. Certainly it's impressive how much effort goes into a top10 facility, but we should always be asking: what whole-machine job is going to run on it? IMO, the sweet spot for HPC is a few tens of racks - easy to find space, easy to manage, can provide enough resources for hundreds of researchers.