I don't see how you can implement a lower-level protocol (eg: raw thunderbolt DMA) using a higher-level abstraction of that protocol (eg: pci-e traffic). That's like saying you'll implement Internet-layer frames only using TCP. Similarly, I don't see how you can expose something that doesn't conform to anything remotely like pci-e as a hot plug pci-e device - the latency tolerances to remain in spec are way different for a start.
I too have implemented a driver, from a high-end FPGA to the Mac, and the OSX kernel does not get involved unless you're traversing controllers within that Mac, or the route cannot be expressed within a single transaction, or if the destination is local. It just doesn't. These are to my knowledge the only 3 reasons for the local CPU to get involved:
[1] If you have a machine with devices (1,2,..) on multiple thunderbolt controllers (say A and B), it's possible to have a route like A2 -> A1-> A0 -/-> B0 -> B1, and of course the kernel is involved then because the individual controller chips A and B are not bridged together in any other way. The kernel has to route between A0 (local) and B0 (also local).
[2] The initial spec for thunderbolt allowed a lot of flexibility with source-defined routing tables, but it wasn't taken advantage of, and the later chips from Intel removed some of that functionality (or, more likely, just reassigned the chip real-estate to something more useful). There are now potentially valid routes that can't be expressed within a single frame, and the kernel has to be involved at that point as well, to make sure packets get to their correct destination. It is, however, unlikely that users will see these routing issues in real-world scenarios, you have to have a lot of devices on multiple busses before it's an issue.
[3] The destination is the local machine. Of course, the kernel has to get involved then.
I have a lot of diagnostic code that monitors bandwidth, packet lifetime and routing, and latencies. I've run massive stress tests on multiple machines and devices connected via thunderbolt, and so far, the above 3 reasons are the only ones that an OSX machine enters the kernel for any thunderbolt-related cause. It is quite clear when the kernel does get involved compared to when it doesn't, so I'm confident that if it doesn't have to get involved, there is no interaction.