I'm French. France is definitely socialist, and so-called innovations happened in spite of that, not because of it. True capitalism is the best thing that could happen to France.
Northern Europe is indeed more friendly towards the US than Southern or even Western Europe.
The optimizations I highlighted all will affect the data access pattern
That's not quite true. Only some of the highlighted ones will, the others keep the same general pattern (unrolling in particular, which isn't so high level).
As I said in another part of the thread, those high level transformations you speak of don't really happen in the real world, and are not reliable (hence the "I don't know what world you live in"). Compiler authors can claim all they want, but having done a lot of looking at assembly and optimizing manually, I can tell you it's still quite far from what an expert can do. With the Intel compiler, I've even had issues with scalar replacement as soon as you have enough object encapsulation.
Compiler authors and researchers like to claim their software performs all kinds of optimizations automatically. I remember being told of a conference not so long ago where a colleague explained several transformations he did to his C code of a simple filter algorithm that multiplied performance by several levels of magnitude on POWER, and an IBM guy said their compiler should take care of all those transformations automatically. Well clearly, it didn't.
I'm quite aware of the state of the art, I have in particular seen the work of the researchers who worked on the polyhedric model for loop optimization which was relatively recently added to llvm and gcc. It can indeed do tiling or loop fusion, or even split your loops in several ones to identify the parallel sections, assuming you have full alias analysis information, your loops have simple boundaries and regular control flow, and you have enough time to find that this is the right transformation to make among the whole space of possible transformations.
I also know that in the real world, it doesn't work so well unless you explicitly program for it or guide it with pragmas. Automatic vectorization isn't reliable, compilers even fail at unrolling most of the time. Why would you rely on much more complex transformations to occur magically?
Sorry, but in your example the fact that dst and src overlap or not is irrelevant for vectorization. Even for dst[i+j] = src[i] with j small enough it's irrelevant too.
Sure, vectorization and other optimizations change memory access ordering (loadx4 computex4 storex4 instead of (load compute store)x4), but I was talking about the pattern, which doesn't change.
But that's a change of algorithm, not something a compiler can do.
A manual optimization would easily yield a 2 times improvement on that.
IANAL (I'm an entrepreneur who has been a subcontractor and has subcontracted in the software business), but I know that you can pretty much set the terms you want on a contract. In particular, you can ask that an incident you file to a subcontractor is investigated by said subcontractor in a given time frame.
In truth, it is up to the buyer and the seller to negotiate the terms of the contract, the lawyers are just there to advise both parties.
The seller can agree to provide minimal service for a minimal price, but the buyer can ask for better service, which he will probably have to pay extra for.
How is Floating Point Operations Per Second not a real unit?
Because 1) not all operations are equal and 2) the clock rate isn't constant.
It's not a very useful information, unless what you really want to count is the amount of data treated by second, which is better addressed by a simple B/s bandwidth unit.
And, I would argue that cache is extremely important when considering vectorization, especially when considering loop nests. I might get much more impressive vectorization if I execute a loop nest in a particular order. But, if I get better cache locality by interchanging two loops, I may see much better performance in the second case. Matrix multiply is a poster child for this.
So if you're looking at the output of the compiler's optimizer and saying "compiler A is better than compiler B at vectorizing" looking only at the instruction sequence, and ignoring the actual memory access pattern and the effects of the cache, you might draw the wrong conclusion.
Cache is of course of utmost importance for performance, but you fail to see it's a different problem entirely. Vectorization happens at a different level.
I don't know what world you live in, but no optimizer of any C compiler changes the memory access pattern of your code. If your code is bad from that point of view, there is nothing it can do, only the developer can optimize that. What the compiler can do is schedule your instructions better so as to fill the pipeline to the maximum while trying to minimize register usage, taking into account the fact that scalar and simd use different registers and pipelines (though if you want that to be done well, you're better off doing that by hand while reading the processor specs).
AFAIK the only negative effects compiler optimizations can have on the cache is that they can generate code bloat by unrolling or inlining too much. That's unlikely to prevent any vectorization though. Conservative inlining or unrolling policies in the middle-end can however prevent vectorization because they don't take into account the gains associated to switching to the simd ISA.
Depending on the support contract you negotiated with them, they shouldn't need to admit it.
I recommend you tell management to improve their legal department.
What's so expensive?
Doesn't a real laptop cost at least $1,500? This is pretty cheap.
Here is an example in GCC: the optimizer assumes the SSE minps instruction is commutative. It isn't.
As a result you can get unexpected results depending on the optimizer mood when you call this instruction with a NaN and a non-NaN value.
Actually, no. Computers are not getting faster.
Microprocessors stopped getting faster a few years ago, now we just get more of them. Supercomputers have mostly reached the limits of scalability, so there is a limit to that too.
Because the slowest part of the computer is memory, and vector notation leads to more cache misses.
Is that some sort of joke? Surely you can tell this is not the optimal assembly code at all.