So we are discussing scientific code? General purpose code will not get huge advantages from advanced inlining etc.
I'm assuming we're talking about something where performance actually matters, for sure. If the problem doesn't require particularly efficient code given the speed of the system it's going to run on, it's probably not very useful to drop down to assembly anyway, nor to consider how well the code generator for a high level language optimises its output.
I don't know what you mean by fusion - the only thing that comes to mind is loop fusion.
It's a general category of optimisations used when you're composing multiple operations over the same stream, data structure, etc. A typical example might involve a programmer writing some list processing code as filtering with one function, composed with mapping with another function, composed with reducing using a third function to get the final answer. An optimising compiler might merge those operations into one space-efficient loop that calculates the final answer without ever generating the intermediate lists.
Put another way, fusion is similar in effect to applying some combination of inlining and loop-based optimisations in situation where you're composing multiple operations over data sets, with the goal of eliminating the storage of unnecessary intermediate values and the overheads of passing them around. It's particularly relevant with higher-level languages that describe their data crunching in functional terms, where a naive implementation is much slower than the fused version.
If you know assembly programmers who would routinely apply that degree of tight cross-function optimisation (and maintain the code well as the underlying functions evolved later) then I'll be both genuinely impressed at their diligence and somewhat disturbed at how much redundancy they must have in their code base.
I don't think there would be an hierarchy in the optimized case. IME compilers are very bad in handling the register-stack hybrid while assembly programmers are capable to handle them after a learning period.
I'm not sure that's entirely correct. Even when I did more work on these things a few years ago, compilers were already doing cross-function optimisation right down a call stack to optimise the use of the floating point register stack.
I brought this one up as another example where if you were writing the functions manually in assembly, you'd have to either devise your own custom calling conventions for every case (and so potentially reimplement the same functions multiple times) or accept less than optimal performance. As you point out, it's probably not the best example in the context of current CPUs, though.
Most code isn't compiled with whole-program optimization. In fact a huge amount of software are compiled with little optimization done.
Maybe, but then most code isn't developed with hand-tuned assembly for its hot spots either. I'm assuming that we're talking about performance-sensitive cases where that kind of effort would be justified, and that we're interested in which strategy is likely to give the best results in practice. My contention is that, in 2016 and on most modern CPUs, it is likely that using a high level language and a good optimising compiler will give better results than most people would achieve by dropping to assembly.