Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×

Boost UltraSPARC T1 Floating Point w/ a Graphics Card? 71

alxtoth asks: "All over the web, Sun's UltraSPARC T1 is described as 'not fit for floating point calculations'. Somebody has benchmarked it for HPC applications, and got results that weren't that bad. What if one of the threads could do the floating point in the GPU, as suggested here? Even if the factory setup does not expect an video card, could you insert a low profile PCI-E video card, boot Ubuntu and expect decent performance?"
This discussion has been archived. No new comments can be posted.

Boost UltraSPARC T1 Floating Point w/ a Graphics Card?

Comments Filter:
  • No, you cannot (Score:5, Insightful)

    by keesh ( 202812 ) on Saturday April 22, 2006 @05:09PM (#15181949) Homepage
    Sun SPARC kit doesn't use a BIOS. Unfortunately, nearly all modern graphics cards that haven't been specifically designed to work on non-x86* kit rely upon the BIOS to initialise the card. This massively limits the hardware availability. PCI, sadly, is only a hardware standard.

    There's been some work by David S Miller on getting BIOS emulation into the Linux kernel so that regular cards can be fooled into working, but it's not there yet and will probably fall foul of Debian's firmware loading policy (does that apply to Ubuntu too?).
  • by pedantic bore ( 740196 ) on Saturday April 22, 2006 @05:29PM (#15181998)
    I remember when it was common practice to buy extra hardware to add to your system to implement fast floating point ops. First it was a box (FPS), then a few cards (Sky), then a card (Mercury), then a daughterboard (everyone), then a chip (Weitek)... and then it was on the CPU and everyone expected it to be there.

    But Sun realized that the more things change, the more they stay the same; the reason why vendors got away with making floating point an expensive option was that there are lots of workloads where floating point performance is unimportant. So they applied the RISC principle and chose to not waste a lot of silicon on the T1 implementing instructions that are not needed in their target workload, but instead figure out how to get lots of concurrent threads.

    Trying to improve floating point perf on a T1 by adding another card is like trying to figure out how to put wheels on a fish. It might be a cool hack and it might solve some particular problem but it doesn't generalize.

    If you want floating point perf and tons of threads, wait for the rock chip from Sun (and hope that Sun stays afloat long enough to ship it). It's like a T1 only moreso, with floating point for each thread.

  • Feh (Score:3, Insightful)

    by NitsujTPU ( 19263 ) on Saturday April 22, 2006 @05:45PM (#15182043)
    At that point, you're bound by the bandwidth between the graphics card and the CPU. Why not just purchase hardware that works for what you want to use it for in the first place?
  • by PaulBu ( 473180 ) on Saturday April 22, 2006 @08:43PM (#15182599) Homepage
    Most real life CAD software (as in, what is used to build chips inside your little computer box or your cellphone) used to be (~8 years ago) on Solaris, occasional HP/AIX, Linux. Now it is Linux, Solaris, the rest are somewhat supported, but not exactly healthy... You can get some FPGA/PCB/Solid 3D CAD on Windows, but it is nowhere near the true industrial-strength quality. Think about it this way, if you pay $100,000 for a seat, it does not really matter how much the hardware is and Sun's was winning due to general stability/availability. IBM (the big Cadence shop) pushed Cadence to release the Linux version of their software simultaneously with the Solaris version about 5 years ago, since then Linux was gaining popularity...

    There are no good techical reasons not to recompile something like this for OS-X, but if you can imagine porting a package which comes as a bookshelf of CDs from UN*X to Win API, I'd like some of the stuff you are smoking! ;-)

    Paul
  • by mosel-saar-ruwer ( 732341 ) on Saturday April 22, 2006 @10:18PM (#15182875)

    nVidia & IBM/Sony/Cell/Playstation can perform only 32-bit single-precision floating point calculations in hardware. [IBM/Sony can, at least in theory, perform 64-bit double-precision floating point calculations, but the implementation involves some weird software emulation thingamabob which invokes a massive performance penalty.]

    ATi is even worse - last I checked, they could perform only 24-bit "three-quarters"-precision floating point calculations in hardware.

    And just in case you aren't aware, 32-bit single-precision floats are essentially worthless for anyone doing even the simplest mathematical calculations; for instance, with 32-bit single-precision floats, integer granularity is lost at 2 ^ 24 = 16M, i.e.

    16777216 + 0 = 16777216
    16777216 + 1 = 16777216
    16777216 + 2 = 16777218
    16777216 + 3 = 16777220
    16777216 + 4 = 16777220
    16777216 + 5 = 16777220
    16777216 + 6 = 16777222
    16777216 + 7 = 16777224
    16777216 + 8 = 16777224
    16777216 + 9 = 16777224
    16777216 + 10 = 16777226
    16777216 + 11 = 16777228
    16777216 + 12 = 16777228
    16777216 + 13 = 16777228
    16777216 + 14 = 16777230
    16777216 + 15 = 16777232
    16777216 + 16 = 16777232
    etc
    Now while 64-bit double-precision floats [or "doubles"] are probably accurate enough for most financial calculations, where, generally speaking, accuracy is only needed to the nearest 1/100th [i.e. to the nearest cent], 64-bit doubles are still more or less worthless to the mathematician, physicist, and engineer.

    For instance, consider the work of Professor Kahan at UC-Berkeley:

    William Kahan [berkeley.edu]
    In particular, read a few of these papers from the late nineties:
    At the time, Kahan was arguing in favor of using the full power of the Intel/AMD 80-bit extended precision doubles [i.e. embedding 64-bit doubles in an 80-bit space, performing calculations with the greater accuracy afforded therein, and then rounding the result back down to 64-bits and returning that as your answer], but, truth be told, the Sine Qua Non of hardware-based calculations is true 128-bit "quad-precision" floating point calculations as performed in hardware.

    Sun has a "quad-precision" floating point number for Solaris/SPARC, but, sadly, it's a software hack, and, like IBM/Sony/Cell/Playstation, far too slow to be used in practice.

    I believe that IBM makes a chip for the Z-Series mainframe, which can perform 128-bits in hardware, but I imagine that it's prohibitively expensive [if you could even convince IBM to sell it to you in the first place].

    The best configuration here would probably look like a fancy-schmantzy Digitial Signal Processor [DSP] chipset, from someone like Texas Instruments, capable of 128-bit hardware calculations, mounted onto a card that would plug into something very fast, like a 16x PCIe bus, which in turn would be connected to a HyperTransport bus [but boy, wouldn't it be really cool if the DSP lay directly on the HyperTransport bus itself?].

    By the way, if anyone knows of a company that's making such a card, with stable drivers [or, God forbid, a motherboard with a socket for a 128-bit DSP on the HyperTransport bus], then please tell me about it, 'cause I'd be very interested in purchasing such a thing.

  • by Anonymous Coward on Sunday April 23, 2006 @12:29PM (#15185113)
    IBM/Sony can, at least in theory, perform 64-bit double-precision floating point calculations, but the implementation involves some weird software emulation thingamabob which invokes a massive performance penalty.

    Just, for the record. Cell uses no "software emulation" for their double calculations. It's 7 cycle latency to do two DP multiply-add, which is certainly not slow. The "slow" part is that the throughput is also 7 cycles, meaning that multiple DP MADDs don't pipeline. So, while this cuts the theoretical maximum GFLOPs down significantly (SP MADDs can issue one every cycle, in addition to a non-FP instr), the "in practice" performance is much closer...

    and we're still talking (4 flops / 7 cycles) * (8 SPEs) * x Ghz => 18.2 DP Gflops @ 4.0 GHz (pretty freaking fast!)

    Oh, and GPUs aren't viable as FPUs because the latency sucks so hard.

We are each entitled to our own opinion, but no one is entitled to his own facts. -- Patrick Moynihan

Working...