Forgot your password?
typodupeerror

Boost UltraSPARC T1 Floating Point w/ a Graphics Card? 71

Posted by Cliff
from the computed-outside-of-the-box dept.
alxtoth asks: "All over the web, Sun's UltraSPARC T1 is described as 'not fit for floating point calculations'. Somebody has benchmarked it for HPC applications, and got results that weren't that bad. What if one of the threads could do the floating point in the GPU, as suggested here? Even if the factory setup does not expect an video card, could you insert a low profile PCI-E video card, boot Ubuntu and expect decent performance?"
This discussion has been archived. No new comments can be posted.

Boost UltraSPARC T1 Floating Point w/ a Graphics Card?

Comments Filter:
  • No, you cannot (Score:5, Insightful)

    by keesh (202812) on Saturday April 22, 2006 @05:09PM (#15181949) Homepage
    Sun SPARC kit doesn't use a BIOS. Unfortunately, nearly all modern graphics cards that haven't been specifically designed to work on non-x86* kit rely upon the BIOS to initialise the card. This massively limits the hardware availability. PCI, sadly, is only a hardware standard.

    There's been some work by David S Miller on getting BIOS emulation into the Linux kernel so that regular cards can be fooled into working, but it's not there yet and will probably fall foul of Debian's firmware loading policy (does that apply to Ubuntu too?).
    • by NekoXP (67564) on Saturday April 22, 2006 @05:17PM (#15181967) Homepage
      We produce an Open Firmware solution which includes an x86 emulator to bootstrap x86 hardware, specifically graphics cards and the like.

      PowerPC boards, PC graphics chips with x86 BIOS, no driver edits required on the OS side.. it is there like it would be on a PC.

      http://metadistribution.org/blog/Blog/78A3C88E-1CE 7-45B8-9C79-420134DD9B8E.html [metadistribution.org]
      http://www.genesippc.com/ [genesippc.com]
    • Re:No, you cannot (Score:3, Informative)

      by Jeff DeMaagd (2015)
      That problem had been solved for Alpha computers around 1992. I was able to choose from any standard PCI video card, though driver support in the OS was a different issue. There may be some patent issues though, so the approach might need to be different.
    • Re:No, you cannot (Score:1, Interesting)

      by Anonymous Coward
      [...]will probably fall foul of Debian's firmware loading policy

      No, it won't. The firmware won't be shipped with debian, it would be run directly from the rom that is on the very card that is to be initialized. Debian has shipped XFree86 for a long time, and it supports a similar method to initialize secondary graphics cards that require their bios to set them up to function properly (probably only works on x86 CPUs).
    • Lack of a BIOS can be worked around (eg. the Pegasos boards [pegasosppc.com] have some sort of emulation built into its firmware that allows you to use normal PC graphics cards despite being PPC and OpenFirmware-based), but without drivers you ain't doing jack shit. And that's a very big problem if you're not using an x86 CPU. The open-source r300 driver is making progress but is not near production-quality and AFAIK nothing similar exists for nVidia chips yet, so unless you can convince Ati and nVidia to port their drivers
    • I have yet to see a low profile version, however, I have seen v210s and v240s with this card in them. It could only be a matter of time.
  • by the_humeister (922869) on Saturday April 22, 2006 @05:12PM (#15181955)
    Especially since current GPUs don't implement double-precision floating point math. Heh, in that vein you could add a dual Opteron single-board computer into one of the expansion slots...
  • by pedantic bore (740196) on Saturday April 22, 2006 @05:29PM (#15181998)
    I remember when it was common practice to buy extra hardware to add to your system to implement fast floating point ops. First it was a box (FPS), then a few cards (Sky), then a card (Mercury), then a daughterboard (everyone), then a chip (Weitek)... and then it was on the CPU and everyone expected it to be there.

    But Sun realized that the more things change, the more they stay the same; the reason why vendors got away with making floating point an expensive option was that there are lots of workloads where floating point performance is unimportant. So they applied the RISC principle and chose to not waste a lot of silicon on the T1 implementing instructions that are not needed in their target workload, but instead figure out how to get lots of concurrent threads.

    Trying to improve floating point perf on a T1 by adding another card is like trying to figure out how to put wheels on a fish. It might be a cool hack and it might solve some particular problem but it doesn't generalize.

    If you want floating point perf and tons of threads, wait for the rock chip from Sun (and hope that Sun stays afloat long enough to ship it). It's like a T1 only moreso, with floating point for each thread.

    • It might be a cool hack and it might solve some particular problem but ...
      There's no "but" here. Cool hacks don't happen because they're useful, they happen because they're cool.
    • Meanwhile, GPU developers have created a component that processes floating point math very quickly, sold for much less $:FLOPS than Sparcs (or any other CPU). Combining a T1 and GPGPU offers "best of breed" economies of scale appropriate to each component, like installing 3rd party memory and HD rather than the expensive Sun brands.

      That's why GPGPU is an interesting strategy. GPU APIs offer parallelism, too. When those APIs can be harnessed with bus signalling that's high-enough level symbolically to exploi
      • by Anonymous Coward
        >Combining a T1 and GPGPU offers "best of breed" economies of scale appropriate to each component, like installing 3rd party memory and HD rather than the expensive Sun brands.

        Combining a T1 and a GPU offers you jack, since GPUs use single-precision arithmetic.
      • Well, no, if you want flops/$ then the signal processing chips used in cell phones and MP3 players are the clear winners. There are some real screamers here. But they're a bit complicated to program and don't function well as general purpose processors, which is why they're primarily used in systems where they can be programmed once and then shipped by the million.

        As I wrote before, I'm sure there's some workload where it makes sense to mate a T1 and a GPU (besides the obvious one, i.e., rendering grap

        • by Doc Ruby (173196) on Sunday April 23, 2006 @02:05AM (#15183468) Homepage Journal
          Those DSPs you mention aren't CPUs, and they're not available on PCI cards - plus the programmability you mention.

          The way to think about the use of GPGPU in a host with its own (GP) CPU is client/server computing. I put together such a system in 1990, a 12MHz 80286, with 4 12.5MFLOPS DSPs (AT&T DSP32c) and an FPGA "scheduler" on the ISA card. The 286 ran a loop sending data and commands to a memory mapped page on the card's SRAM, and copying the page when a status register was set. I had realtime 24bit VGA renderings of megapolygons at 30FPS, all processed on the DSPs. The systems have all scaled up, but the price improvement per FLOPS of the GPUs over the CPU is even better now than then.

          As you say, the key is keeping the compute servers full, which amortizes the signalling overhead best, and keeping the signaling across the bus high-level enough that the bandwidth doesn't bottleneck. There are lots of demanding apps now which could use that architecture. Audio compression is my favorite - I'm waiting to stuff a $1000 P4 with 6 $400 dual GPUs, and beat the performance of any <$10K server, scalable down to $1500. That's the kind of host that could really transform telephony.
    • Sun's origional Motorola 68K based workstations had optional FPU's as did the first "desktop" SPARC workstation the 4/110. Sun workstations or servers equipped with a VME bus also had access to an optional Weitek FPU unit.

      Even more exotic was the TAAC-1 which was a wide instruction word processor which could be used for FFT's, imaging etc.

      One correction the TII (Niagara II) will be the first heavily multi-threaded SPARC CPU with one FPU per core, it is due out next year with rock being due out in 2008.
      • There was another system that had an optional FPU. I think it was called the IBM PC. You could get an FPU called the 8087. It was expensive and your software had to be compiled to support which very few programs where.
        Was the Weitek an FPU or a vector processor?
  • Wait for the T2 (Score:4, Interesting)

    by IvyKing (732111) on Saturday April 22, 2006 @05:32PM (#15182010)
    The T2 is supposed to have an FPU for each core, so would be a simpler solution tan trying to use a grpahics card. The T2 is also supposed to have double the number of threads per core and even more memory bandwidth.
    • Are you sure of that? I thought the whole point of the one FPU per chip was to dramatically cut down on power consumption, which is one of Niagara's main selling points.
      • there is a FPU per core and the power for niagara2 is still supposed to be remarkably low
        • Correct, T2 is expected to be lower power or equivalent to the T1, part of this is because T2 will be built in a 65nm process as opposed to the 90nm process used to fabricate T1's.

          The changes in T2 are 2 pipelines per core, up from 1. 8 threads per core, up from 4. FPU per core up from 1 per module. Faster memory subsystem, additional hw support for encryption and nework offload. On chip cache is expected to remain the same.
  • Feh (Score:3, Insightful)

    by NitsujTPU (19263) on Saturday April 22, 2006 @05:45PM (#15182043)
    At that point, you're bound by the bandwidth between the graphics card and the CPU. Why not just purchase hardware that works for what you want to use it for in the first place?
    • by Mr Z (6791)

      Why not just purchase hardware that works for what you want to use it for in the first place?

      What if you want a better solution than the ones that are normally available?

      • This is a workaround, and usually not a very good one. I've seen people do very specialized things by moving the floating point stuff off to video cards, but for general computation, I think it's a rather poor solution.

        IE, this is not a better solution thna the ones that are normally available.
  • by Fallen Kell (165468) on Saturday April 22, 2006 @06:38PM (#15182192)
    All kinds of problems will arise with a setup like this. Performance will possbily boost for certain things, but they need to be coded properly themselves, but code is not written for a unique setup like this. Multi-threaded code will be under the assumption that all CPU's will have approximitely the same abilities (in other words, they do not split floating point ops into one thread and i/o and int operations into other threads). Any thread for the application will potentially have floating point operations mixed with other operations.

    Now even if you custom code an application to do all floating point work in a specific thread, you would need to completely modify the kernel thread management sub-systems. The threads themselves would need meta flag data to signify what "kind" of thread they are so that the "floating point thread(s)" are queued for running on the GPU and not on the T1 (unless there are idle T1 cores and the GPU is already busy).

    Now even if you have the above changed, the only thing this will work on is custom made applications, in other words, you will need to completely re-write anything and everything to take advantage of this setup. This really isn't viable when you may possibly be dealing with non-open-source products like Matlab or Oracle. Even with open source products, it will take MAJOR rework to implement a change like this.

    The T1 is designed as it is, a multi-core processor that would make a very good NFS Data Server, ftp server, or web host server with highly efficient power usage. It is NOT a database, application, or HPC server core. Too many of the latter operations require too much floating point operations to be run efficiently on the T1. In a pinch you can use it for them, but it will not shine in that application.

    • Why would a database server need floating point?
      I have never written on but I have written btrees and hash algorithms and they never used floating point.
      For a database server I would guess you would tend to be IO bound.
      You do have a point in that the T1 is a good platform for a web server or file server but not ideal for many other tasks. I wonder how is it's SSL performance is?
    • Not quite sure how that got modded informative.

      DBMS don't require FPU performance since they don't issue floating point instructions. The app server market is also dominated by integer workloads, think Java and J2EE app servers as an example.

      The T1 looks like an exceptionally effective Java/J2EE platform from the slew of great benchmark results Sun has published for the paltform. It is also no slouch as a DBMS platform as is SAP results show. It does lack single threaded performance so its going to be
  • by mosel-saar-ruwer (732341) on Saturday April 22, 2006 @10:18PM (#15182875)

    nVidia & IBM/Sony/Cell/Playstation can perform only 32-bit single-precision floating point calculations in hardware. [IBM/Sony can, at least in theory, perform 64-bit double-precision floating point calculations, but the implementation involves some weird software emulation thingamabob which invokes a massive performance penalty.]

    ATi is even worse - last I checked, they could perform only 24-bit "three-quarters"-precision floating point calculations in hardware.

    And just in case you aren't aware, 32-bit single-precision floats are essentially worthless for anyone doing even the simplest mathematical calculations; for instance, with 32-bit single-precision floats, integer granularity is lost at 2 ^ 24 = 16M, i.e.

    16777216 + 0 = 16777216
    16777216 + 1 = 16777216
    16777216 + 2 = 16777218
    16777216 + 3 = 16777220
    16777216 + 4 = 16777220
    16777216 + 5 = 16777220
    16777216 + 6 = 16777222
    16777216 + 7 = 16777224
    16777216 + 8 = 16777224
    16777216 + 9 = 16777224
    16777216 + 10 = 16777226
    16777216 + 11 = 16777228
    16777216 + 12 = 16777228
    16777216 + 13 = 16777228
    16777216 + 14 = 16777230
    16777216 + 15 = 16777232
    16777216 + 16 = 16777232
    etc
    Now while 64-bit double-precision floats [or "doubles"] are probably accurate enough for most financial calculations, where, generally speaking, accuracy is only needed to the nearest 1/100th [i.e. to the nearest cent], 64-bit doubles are still more or less worthless to the mathematician, physicist, and engineer.

    For instance, consider the work of Professor Kahan at UC-Berkeley:

    William Kahan [berkeley.edu]
    In particular, read a few of these papers from the late nineties:
    At the time, Kahan was arguing in favor of using the full power of the Intel/AMD 80-bit extended precision doubles [i.e. embedding 64-bit doubles in an 80-bit space, performing calculations with the greater accuracy afforded therein, and then rounding the result back down to 64-bits and returning that as your answer], but, truth be told, the Sine Qua Non of hardware-based calculations is true 128-bit "quad-precision" floating point calculations as performed in hardware.

    Sun has a "quad-precision" floating point number for Solaris/SPARC, but, sadly, it's a software hack, and, like IBM/Sony/Cell/Playstation, far too slow to be used in practice.

    I believe that IBM makes a chip for the Z-Series mainframe, which can perform 128-bits in hardware, but I imagine that it's prohibitively expensive [if you could even convince IBM to sell it to you in the first place].

    The best configuration here would probably look like a fancy-schmantzy Digitial Signal Processor [DSP] chipset, from someone like Texas Instruments, capable of 128-bit hardware calculations, mounted onto a card that would plug into something very fast, like a 16x PCIe bus, which in turn would be connected to a HyperTransport bus [but boy, wouldn't it be really cool if the DSP lay directly on the HyperTransport bus itself?].

    By the way, if anyone knows of a company that's making such a card, with stable drivers [or, God forbid, a motherboard with a socket for a 128-bit DSP on the HyperTransport bus], then please tell me about it, 'cause I'd be very interested in purchasing such a thing.

    • truth be told, the Sine Qua Non of hardware-based calculations is true 128-bit "quad-precision" floating point calculations as performed in hardware.

      For in-hardware calculation, yes. For a quick approximation or when the result has no serious consequences, yes. For anyone serious about getting the correct answer, no no no

      We (by which I mean CS, math, and hard-science folks) have known since the earliest days of floating point that it has inherent, unavoidable flaws that no arbitrary fixed number of
      • by Anonymous Coward
        Unfortunately, and as one of your links mentions, I seriously wonder if many of the current generation of programmers even knows about this issue, nevermind cares (Huh, I sound like a cranky old man now).

        Not cranky and old enough.

        If you care about your answer, no matter how many bits the FPU supports, you do it in software. Period. You use GMP, and don't round until the final result... and while that might not always prove possible due to having finite memory, I highly doubt we'll ever see even a 1024-bit F
    • by Anonymous Coward
      IBM/Sony can, at least in theory, perform 64-bit double-precision floating point calculations, but the implementation involves some weird software emulation thingamabob which invokes a massive performance penalty.

      Just, for the record. Cell uses no "software emulation" for their double calculations. It's 7 cycle latency to do two DP multiply-add, which is certainly not slow. The "slow" part is that the throughput is also 7 cycles, meaning that multiple DP MADDs don't pipeline. So, while this cuts the t

    • At the time, Kahan was arguing in favor of using the full power of the Intel/AMD 80-bit extended precision doubles [i.e. embedding 64-bit doubles in an 80-bit space, performing calculations with the greater accuracy afforded therein, and then rounding the result back down to 64-bits and returning that as your answer], but, truth be told, the Sine Qua Non of hardware-based calculations is true 128-bit "quad-precision" floating point calculations as performed in hardware.

      The CDC 6600's single precision arit

    • And just in case you aren't aware, 32-bit single-precision floats are essentially worthless for anyone doing even the simplest mathematical calculations; for instance, with 32-bit single-precision floats, integer granularity is lost at 2 ^ 24 = 16M, i.e.

      The error in floating point calculations is supposed to be roughly 2^-N, where N is the number of bits. Although some ALGORITHMS can be unstable, because they use series of operations that greatly increase error, many useful algorithms can be accurately

    • wouldn't it be really cool if the DSP lay directly on the HyperTransport bus

      You may not be aware, but AMD just released the new HyperTransport spec version - and it includes along with the usual speed and signaling imporvements, externally connected devices.
    • Why not the virtex FPGA setup: http://www.theregister.co.uk/2006/04/21/drc_fpga_m odule/ [theregister.co.uk]
      I'm sure quad (or even possibly oct.) precision floats could be implemented in that bad boy.
      As I said in an earlier thread, this has my intel fanboi status at risk...
      -nB
  • In theory, if you run the mobo outside its normal case, you could throw a supported-on-sparc sun framebuffer in it and have things work .... not that I've got one handy nor would be willing to try and splice it into an atx chassis or whatnot ....

Premature optimization is the root of all evil. -- D.E. Knuth

Working...