There was a time when displays did everything by passing around rendering primitives -- lines, rectangles, black and white bitmap pattern tiles. At that time it made a lot of sense to integrate network at the low level because you had to figure out how to send and decode all those drawing primitives over the wire.
Display technology moved on. Displays became rich and complex and colourful, and different applications had very different needs and took on more and more of the rendering task themselves and simply pushed bitmap buffers to the display system. Now the task of the display system was to mediate and manage and request complex bitmap buffers from the various clients.
At this point remote display was a matter of having the client send (potentially compressed) bitmap buffers -- let the clients do their own rendering. This is how most remote display systems written in the last 15 years do it. Indeed, that is how X does it these days for most applications: the applications do their own rendering via GTK or QT and Cairo and X pushes the pixmaps down the wire.
If all you are doing is throwing around bitmap buffers, and the display software is simply mediating and displaying those, then remote display doesn't need a whole lot of thought at the display level -- all it has to do is mediate and display the bitmaps it gets from clients. Now, providing a remoting system to let remote clients get their bitmap buffers to the display when requested ... well that's still a thing that needs to be done, but it the base display software doesn't have to care too much about how that gets done.
Think of it as teasing out the layers in the software. The base layer is pushing pixels to the screen (no matter where the data for those pixels came from, remote or local). That's one job: pixels on screen. Focus on that and do it well. Another job is getting the data that the base layer is going to display to it, and you can worry about remote/local differences in that layer.