What you say is pretty much correct.
The reason out of order execution is faster, is because in most cases the compiler doesn't know the full pipeline delay of each instruction.
This helps with making binaries compatible with different processors implementing the same instruction set.
For instance, the compiler might assume that +, -, *, and / all take the same amount of cycles to calculate, when in reality of course some of those will be faster than others.
A calculation like A = B*C; B = A+D; C= C+1 could then be reordered to A=B*C C=C+1 B=A+D. (assuming that multiplication takes longer than addition.)
In this case the B=A+D calculation waits for the A=B*C calculation, but the C=C+1 calculation doesn't have to wait.
If the compiler was fully aware of pipeline delays of each instruction, then it could have scheduled the instructions like that in the first place.
But then you code would only run optimally on processors with those exact pipeline delays. It also makes scheduling more complicated.