Note: I'm no expert in this area, this is just some stuff I have picked up along with a basic understanding of how these techniques are employed. There may be inaccuracies or incomplete information, corrections welcome.
OIT is one area that modern graphics hardware really struggles with - A software render can just go ahead and allocate memory dynamically to keep track of the depth value and the colour of each fragment that contributes to a pixel's final colour in a list, but on a 'traditional' GPU, the big problem is that you have no easy way to store anything more than a single 'current' colour per pixel that will get irreversibly blended or overwritten by fragments with a lower depth value, and even if you could keep a list of them, you have no associated depth values, and nor do you have a simple way to sort them on the GPU. However, there is some clever trickery detailed below:
Realtime OIT has been researched and published on (notably by Nvidia and Microsoft) for over a decade.
Heres the basic technique - 'Depth Peeling', from 2001:
Depth peeling renders the scene multiple times with successive layers of transparent geometry removed, front to back, to build up an ordered set of buffers which can be combined to give a final pixel value.
This technique has severe performance penalties, but the alternative (z-sort all transparent polygons every frame) is much, much worse.
'Dual Depth Peeling' - from 2008:
This works in much the same way, but is able to store samples from multiple layers of geometry each rendering pass
Refinements to the DDP technique, cutting out another pass - from 2010:
Reverse depth peeling was developed where memory was at a premium - which extracts the layers back-to-front for immediate blending into an output buffer instead of extracting, sorting and blending, and it is also possible to abuse the hardware used for antialiasing to store multiple samples per output pixel.
Depth peeling really only works well for a few layers of transparent objects, unless you can afford a lot of passes per pixel, but in many situations, it is unlikely that the contribution of transparent surfaces behind the first 4 or so transparent surfaces means much in terms of visual quality.
AMDs 'new' approach involves implementing a full linked-list style A-buffer and a separate sorting pass using the GPU - this has only been possible with pretty recent hardware, and I guess is 'the right way' to do OIT, very much the same as a software renderer on a CPU would do it.
Heres some discussion and implementation of these techniques:
This really isn't anything new, single-pass OIT using CUDA for fragment accumulation and sort was presented at Siggraph 2009 - nor is it something PTS can claim as their own. Its possible AMDs FirePros have special support for A-buffer creation and sorting, which is why they run fast, and AMD in general has a pretty big advantage in raw GPGPU speed for many operations (let down by their awful driver support on non-windows platforms, of course) - but really any GPU that has the ability to define and access custom-structured buffers will be able to perform this kind of task, and given NVidia's long history researching and publishing on this subject, its pretty laughable that AMD and PTS can claim it is their new hotness.