There is no good unit. It all depends on your application. For some application FLOPS is what matter. For some application bandwidth is what matters. For some application size of the memory is what matters. There is no simple performance metric that account for every single application.
I don't know how familiar you are with the computations you are mentionning. But large scale scientific (read physics/engineering) applications are essentially composed of BLAS routines. I have been working with physicist where 50% of the time of their application was eigensolving a large sparse matrix (where the other 50% is building that sparse matrix). Or with nuclear engineers where 80% of the time of their application was spent solving dense linear systems. If BLAS routines (or blas like routine) are so popular it is precisely because they are a significant part of some scientific applications.
FLOPS is a good measure of performance for SOME applications which are compute bound. For instance many combinatorial optimization algortihm rely heavily on solving dense linear systems. For these applications FLOPS is the meaningful performance metric. That is why top-500 uses linpack as its benchmark because it is meaningful for many applications. Currently the top-500 benchmark is being reassessed to take more sparsity in and the new version should include conjugate gradient algorithm on stencils. Because it is more representative of current scientific application (solving heat equation on 3D objects)
I have also been working in graph analytics and here we measure more things like edge per second (for instance the graph500 benchmark does that). Here it is meaningful because most of these application boil down to graph traversals. But that metric is quite controversial since the performance you get depends highly on the structure of the graph/matrix. And you need an instance benchmark that represents your application target. For instance, graphs of roads get typically terrible performance on the GPU because they have high diameter and little parallelism can be exposed and CPU typically get better performance there. On social networks, the diameter of the graph is much smaller, there is more parallelism so GPU can be utilize closer to its peak performance.
tl;dr, there is no one metric that represent "real applications". Each application has different requirements. And each application needs to be investigated separately. FLOP is meaningful for many "real applications".