How much OS-specific work needs to happen, and how is it distributed?
I'm assuming that the HDMI-in is fairly normal, unless they really broke the EDID/DDC or something(but obviously not going to be very pleasant unless the application drawing to the '1920x1080 monitor' knows that each of my eyes is only getting half of it).
Barring very good reasons(probably involving latency), I'd assume that the camera is just a UVC device; but that actually using it as anything but an expensive webcam requires the OR-specific head-tracking software to have access to it (the meat of which is presumably cross-platform; but DirectShow vs. V4L2 and other interacting-with-the-system stuff won't be).
The headset's USB interface presumably needs a specific driver, since 'read the outputs of a bunch of sensors and also firmware update' isn't exactly a USB Device Class; but would presumably be a comparatively lightweight 'wrap the sensor outputs and get them to the host as quickly as possible' thing, with the bulk of the motion and position tracking logic being mostly OS independent except for the layers it has to interact with to get headset and camera data.
Is this largely the extent of it (2 mostly standard interface, one device specific driver, plus having the motion and position tracking software running on Linux and interacting with the OS-specific interfaces to the drivers)? Do I fundamentally misunderstand how work is broken up within the Oculus system? Do I basically understand it; but it turns out that latency demands are so stringent that a variety of brutal modifications to the typical OS graphics system and GPU drivers are also required?
The problem isn't the OS support - in fact, it's possible the OR uses standard USB interfaces.
E.g., you say the sensors and firmware update aren't standard - in fact, they are. The sensors are typically just USB HID devices (HID devices are more than just mice/keyboards/joysticks - they include UPSes, sensors and many other devices. Basically all the device needs to do is send a "report" on its conditions, something sensors can do easily).
Firmware update has the DFU mode - device firmware update. You may remember it from "DFU mode" on an Apple iDevice, which is exactly the same thing - it's a USB device class that's simple enough to be implemented in a boot ROM (so you can never really "brick" it).
The problem I think is Linux' media handling just hasn't be up there. Sure, you need to port your software to use V4L2 (which can be a challenge to begin with). But OR prices itself on low-latency handling and all that, so I think there was a lot of optimizations that had to be done specifically for Linux to get the low-latency they require. The OR software was highly optimized for Windows to get low-latency, so it has to be re-done on Linux. And quite possibly it bypasses some of the Linux stack just to avoid abstraction layers to get even lower latencies. Maybe even doing some of the work in kernel mode.
It's not easy at all - sure to get it working initially is easy, but then OR works because it has low latency between the sensors and the display updates, and that's the hard work. Getting it "to work" is trivially easy, but you want it to work well.