I'm sure you're trolling, but just in case you're not. X11 was designed to solve the problem of how to interconnect a screen on machine to a server, because when it was designed the model was very much big multi-user servers and research on how to migrate from dumb text terminals to dumb graphics terminals.
The result is a well designed stream-oriented protocol, designed such that high-latency links wouldn't cause a bad user experience. That stream protocol doesn't need to run over a network - it can operate just as easily over a UNIX pipe and in fact, that's how almost every distro has it set up nowadays by default. It could just as easily run over a serial line.
Since local machines became more powerful (a trend that started in the 90s), it's often now the case that the multi-user server and the X display are been run on the same machine and various optional extensions have been introduced to support sharing data through shared memory.
However, even though the approach taken by Windows is to lump everything together in one monolithic block, that doesn't mean it's the right way to go. If you actually look at how a modern GPU is designed, the model is actually far closer to X than it is the Windows API. You put commands into a FIFO command buffer and at some point later the GPU executes them, and you want to minimise the synchronisation points between the CPU and GPU because this always requires one waiting for the other. This is exactly analagous to how X commands are put into a serial stream and executed some point later. You'd notice, if you actually looked into it, that even GL is designed this way.
The X model makes sense. The GL model makes sense. Tightly coupled frameworks don't because they're inherently limited to current technology. See how Windows has migrated from GDI through the various versions of DirectX until it's finally closer to the GL model. The model SGI introduced 20 years ago and has been little changed since.