What's the justification for compilation unit boundary? It seems like you could expose the layout of the struct (and therefore any compiler shenanigans) through other means within a compilation unit. offsetof comes to mind. :-)
That's the granularity at which you can do escape analysis accurately. One thing that my student explored was using different representations for the internal and public versions of the structure. Unless the pointer is marked volatile or any atomic operations occur that establish happens-before relationships that affect the pointer (you have to assume functions that you can't see the body of contain operations), C allows you to do a deep copy, work on the copy, and then copy the result back. He tried this to transform between column-major and row-major order for some image processing workloads. He got a speedup for the computation step, but the cost of the copying outweighed it (a programmable virtualised DMA controller might change this).
I suppose you could do that in C++ with template specialization. In fact, doesn't that happen today in C++11 and later, with movable types vs. copyable types in certain containers? Otherwise you couldn't have vector >. Granted, that specialization is based on a very specific trait, and without it the particular combination wouldn't even work.
The problem with C++ is that these decisions are made early. The fields of a collection are all visible (so that you can allocate it on the stack) and the algorithms are as well (so that you can inline them). These have nice properties for micro optimisation, but they mean that you miss macro optimisation opportunities.
To give a simple example, libstdc++ and libc++ use very different representations for std::string. The implementation in libstdc++ uses reference counting and lazy copying for the data. This made a lot of sense when most code was single threaded and caches were very small but now is far from optimal. The libc++ implementation (and possibly the new libstdc++ one - they're breaking the ABI at the moment) uses the short-string optimisation, where small strings are embedded in the object (so fit in a single cache line) and doesn't bother with the CoW trick (which costs cache coherency bus traffic and doesn't buy much saving anymore, especially now people use std::move or std::shared_ptr for the places where the optimisation would matter).
In Objective-C (and other late-bound languages) this optimisation can be done at run time. For example, if you use NSRegularExpression with GNUstep, it uses ICU to implement it. ICU has a UText object that implements an abstract text thing and has a callback to fill a buffer with a row of characters. We have a custom NSString subclass and a custom UText callback which do the bridging. The abstract NSString class has a method for getting a range of characters. The default implementation gets them one at a time, but most subclasses can get a run at once. The version that wraps UText does this by invoking the callback to fill the UText buffer and then copying. The version that wraps in the other direction just uses this method to fill the UText buffer. This ends up being a lot more efficient than if we'd had to copy between two entirely different implementations of a string.