Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
IBM

IBM unveils 64-way NUMA server; Promises Linux support 102

I just found this article at Info World which talks about IBM releasing a 64-way NUMA-Q server. The interesting part is that IBM promise to release a version of Linux optimized for NUMA servers. What do you think about it?
This discussion has been archived. No new comments can be posted.

IBM unveils 64-way NUMA server; Promises Linux support

Comments Filter:
  • Not an intentional flame - just trying to spark discussion.

    Linux runns well on 2cpus. 4 isn't bad and the kernel guys are working to improve it.

    But 64?? How about wasting 60 of those CPU's. Whats the point. At this time if you want that many cpu's in the same box you should stick with IRIX, etc. Linux is working on it but it isn't even close yet......

    Not all problems are nails. You shouldn't always try to hit them with your Linux hammer.
  • by oldmanmtn ( 33675 ) on Wednesday May 24, 2000 @12:14PM (#1050022)
    Good to see that I.B.M. isn't just about legacy systems anymore.

    Well, this isn't really an "IBM" system. It's a Sequent system which was far along in the design process when IBM bought them a year or so ago. There seems to have been a tremendous brain drain since the purchase (*), so this machine may be born as a legacy system.

    (*) According to one of those drained brains, IBM didn't seem to have a clue what to do with them. Lacking any top-down direction, they tried to launch some bottom-up initiatives, which IBM management squashed.

  • That will give you a much more robust solution, at a lower price point, and give you the flexibility to optimize the system for YOUR apps, not how the scheduler wants to distribute your data across a non uniform access time memory pool. And don't get me started about SANS, compared to a true cluster IO system...

    What if I were to get you started about run queues? :P You can totally limit any application, service, resource, or whatever in DYNIX to however many processors (out of the NUMA cluster) or RAM (also) you want it to have.

    Also, no one who buys them pays full price. People who buy a lot get a massive, massive discount.

    And they're purple. (well, they used to be in a purple cabinet)

    Honestly, though, I have no idea why anyone buys them. I love the one we have, but, well.. I woulda bought a bunch of rack-mounted alphas. I guess people buy them for the same reason all large corporate purchases go through -

    Someone made a well-timed compliment to someone else's golf fame.

    --
    blue
  • I'm confused... the article says these boxes are $73K. This makes a little more sense if you figure maybe around a grand for each Xeon. What're these going to be running? Certainly not NT if 8-way is the current limitation.

    I thought NUMA was a clustering method. Makes more sense to say you would be able to cluster 16 4-way boxes than to say you can buy a 64-way server.

    The article says you can strap together four of these 64-way boxes to get a total of 256 processors and 256 Gb of RAM. I'm wondering now if that is right.
  • by Anonymous Coward
    "Short for Non-Uniform Memory Access, a type of parallel processing architecture in which each processor has its own local memory but can also access memory owned by other processors. It's called non-uniform because the memory access times are faster when a processor accesses its own memory than when it borrows memory from another processor."

    I may be misunderstanding but, from what I understand NUMA is not strictly for multiple boxen. I was more under the impression it was the middle step between SMP and MPP clusters...
  • I can't wait to play Quake on one. Anybody want to give it to me for my birthday?

    There's no video card. There's also no keyboard or mouse port.

    Should be a fun game of quake. :P

    --
    blue
  • NT or DYNIX. No OS/2.
  • While this is all in one physical address space, access time will vary depending on whether you're accessing a local or non-local bank.

    Hey, sounds like just the Amiga! Anyone remember all that jazz in the Amiga boxes about "Fast" RAM and "Chip" RAM? The ol' Amiga boxes had two types of memory. The "Fast" section of memory was on a dedicated bus to the CPU, and can only be accessed by the CPU. If you needed something to push to any I/O chip in the system, that had to go into the "Chip" RAM region, where the CPU had contention with all the I/O chips in the system.

    The actual format for Amiga's binary executable files specified where each section of the program had to go. Amiga programs were loaded in segments, and the operating system loader read from the file whether an individual section must be loaded into fast ram, chip ram, or any available ram space. Typically the code segment would go into fast ram, and most data segments went into chip ram.

    Man, that brings some good memories. The old Amiga was WAY ahead of its time... What a shame...

  • Yes, the current version of Linux is not optimized for 64 CPUs. However, the cool thing about open source is that IBM is free to rewrite the kernel using best-case practices to make a new version of the kernel that DOES run well on 64 CPUs. IBM must then give those improvements back to the main Linux developers, who then have to decide whether they want to incorporate the changes, package them as an add-on, etc.

    So the cool thing about this announcement is it means Linux will be getting good, efficient NUMA support even sooner than expected! Which should help it compete favorably with NT and perhaps even Solaris on high-end servers.

  • Going for the geek familes market, Rowenta unwrapped two new toasters -- a 4 toast QUAD-BURNER and a entry level 2-way energy saving version. "We hope to give all the toaster lovers out there more satisfying experience, from bachelors to large grandma and grandpa and lot of grandchildren type families" said John Williams, who manages the R&D at Rowenta. Company officials also announced their intent to deliver a version of Linux optimized for QUAD-BURNER. "It sure makes sense to use open source OS to make open faced sandwiches," John remarked.
    Rowenta is based in Standfort Offenbach in Germany.
    J.
    (mention Linux and get posted on /.!)
  • In what way is Beowulf's interconnect fabric better than NUMA? Is it that the software interface is more pleasant to work with? I imagine it's fairly difficult to beat the performance of being able to access either other's memory space. Also, with that ability, it should be quite trivial to implement virtually any form of interprocess communication in a fast and efficient manner. I don't know what sort of media they communicate over, however, for that price, I would expect they're coupled with some pricey crossover switching hardware of some sort.
  • The question is, which existing distro will they steal.

    Steal, schmeal.
    I believe the question you are asking, is which existing distro they will extend. And this is a real extension, not an embrace/extend/extinguish, as the changes will be open for others. The GPL will see to that.

  • User numbers?

    Bah. I remember before we had user accounts. I didn't like having /. putting cookies on my machine at the time and it took Rob quite a while for there to be any advantage to user accounts other than having your name automatically attached to the post. I held out til the mid-3000s. My neighbor got one just over 1000, and both of us had started reading at the same time.

    What are user numbers up to these days, anyway?
  • An actual NUMA-Q server is a (up to) 4-way Xeon box in a 4X rack cabinet.

    From what I remember during the last time I visited sequent (3-4 years ago) They had some NUMA-Q systems with multiple proc boards, each holding 4 PPros. I think the one they had running had 16 PPros in it. It was a refridgerator sized box, too.

    What really intrigues me though is the SMP boxes with mixed x86 architectures. And for that, I shall be ever impressed with Dynix/ptx

    Cursed are those who use sequent (IBM,whatever) products in buildings with no elevators.
  • It's not that difficult.. "it" being porting linux to a NUMA-Q.

    Really? So you've done this?

    I wonder if IBM is actually gonna write a NUMA layer for linux?

    SGI has the best in the business, and they're busy putting it into linux as we speak. However, IRIX is _the_ OS as far as NUMA goes, coupled with the sgi Origin. With sgi's massive linux movement, I'm sure linux will be up to par with IRIX in no time at all. Check em out at www.sgi.com/origin [sgi.com]. Ever heard of ASCI Blue Mountain? Checkout the top 500 list [top500.org].

  • What you are paying for is peace of mind.

    Consider the mean time between failures of a single component. Then work out the figure for n co-dependent components. A rough cut can be estimated in if you assume x% chance of failure of any n components in any period, then the chance of the whole system failing is 1-(1-x)^n. Plugging hypothetical numbers of x=0.1%% and n=64 then you get overall probability of failure as 6.4%. You can do more complex calculations based on Poisson processes and stochastic simulations but then you'd need a supercomputer to solve the optimisaton problems. You also have to keep in mind the greater engineering problems with heat dissipation, signal timing, cache coherance, and testing to make sure the same program runs on 1-n processors. On the otherhand disposable PCs are generally based on tried and true techniques (lower risk = lower profits = lower cost). Also the high-end systems tend to incorporate the latest and greatest which means they price according to market demand (the military tends to be a little inelastic about certain performance parameters). Knocking up a rackmount of boards may give you the same perceived aggregate *peak* performance but I will guarentee you that the actual performance will be quite dissimilar (unless the problem is so small it doesn't matter). Sometimes a 18-wheeler is more useful than a pack of motorcycle couriers. You get paid for knowing the difference.

    It comes down to what you value more, whether you think the extra cost is worth not getting up at 4am to nursemaid the hardware or not.

    LL
  • it's not that difficult.. "it" being porting linux to a NUMA-Q.

    Really? So you've done this?


    uhhhm... you're talking about two different things. i said 'porting linux to a numa-q.' numa-q is a brand name of a type of (x86) server. so, there's really no porting of linux to be done.

    you said 'coding an OS for numa,' which is completely different from what i said. you prolly meant, in the context of linux, 'coding numa for an OS,' but, whatever.

    and yes, i've heard of the ASCI boxen.

    have you ever used DYNIX side by side with IRIX in a NUMA environment? I guess it's really apples to oranges, because of the processor difference, but I'd be interested to see how an equitable comparison would look.

    --
    blue
  • If I recall correctly, your description matches what Sequent published on their web site. They seemed to be using ccNUMA with a fairly large MESI cache on their custom chips.

    Their hardware interconnect was a SCI (Scalable Coherent Interconnect) ring with a bandwidth of around 1 Gb/sec/link. This is not, IMHO, the best link to use today. In fairness to Sequent, SCI may have been the best thing available at the time. They started shipping in '96 or '97, so the R&D must have happened a few years earlier.

    I also thought the multi-path I/O was pretty cool. It is a FibreChannel SAN (Storage Area Network) with multiple controllers, F-C switches, and EMC RAID boxes, with fail-overs that work a little like the Internet, rerouting around dead hardware.
    --

  • I'm pretty sure it's not a 64-way SMP box. It's 16, 4 way SMP boxes in giant purple cabinets.

    No, the cabinets are gray and black. They used to be a rather poor mixture of maroon and off-gray, but maybe Sequent's marketing department got tired of weird colors and switched to something suitable for a funeral. ;-)

    The article is misleading. NUMA is not the same as SMP. Hope that helps.

    True, NUMA is not the same as SMP. It is a bunch of SMP boxes doing a Vulcan Mind Meld via expensive high-speed HW interconnects and caches so that it acts like a whopping big SMP box, without all the usual bus bottlenecks. (Instead, it has unusual bus bottlenecks.)

    Btw, these boxes also all run NT. (but who cares?) :P

    Amusingly enough, NT doesn't run at all well on NUMA boxes. And, it won't until changes are made in the memory management, scheduling, buffer DMA, and interrupt routing algorithms. But, we don't have the source code for NT.... 8-)
    --

  • But then scientific apps is what pays the bills for people who need these machines - Not necessarily - we're using a chunky sequent/dynix box to serve out an Oracle DB and a load of EJBs, for an Insurance company's back-office processing - nothing scientific or glamourous, we just needed something with more scalability / redundancy than a 'normal' unix box.
  • Hey, I wonder if IBM is actually gonna write a NUMA layer for linux? I mean, if they don't then all you end up with a buncha 4-way rack mounted linux boxes.. for $365,000 apiece.

    The (preliminary) NUMA support for page allocations is already there, written by Kanoj Sarcar of SGI. Look at linux/mm/numa.c [linux.cz]

    -Yenya
    --

  • by JacobO ( 41895 ) on Wednesday May 24, 2000 @11:08PM (#1050042)
    NUMA is not really a clustering method. It's a way of addressing some of the drawbacks of large-scale SMP. You have quads (4 CPUs), and each quad has memory, and cache. Memory access is non-uniform, because special techniques are used when memory is accessed between quads. Think of it as a distributed shared memory computer, all in one computer.
    Each quad has connections to other quads and IO buses.
    Because the quads are actually separate, you can subdivide your machine and run different operating systems on different quads, yet they can share (at bus speeds) data between them, such as a fibre-channel array.
    And Dynix is a good UNIX, too. It has it's problems (like it's low on the port list of just about everything) but it runs all the GNU software I've tried on it and is very reliable.
  • > It's based off of SysVr2

    No.

    > /var has to be on the root partition

    No.

    > What is normally in var (logs and such) is in /usr/adm

    vi /etc/syslogd.conf

    > Most sysadmin tasks (such as adding a user) should not be done on the command line

    Oh really? Seems to work here.

    > You should use their menu system because it twiddels with bits on some internal-not-text database

    Nonsense.

    Dynix is a great OS. Sure, you need to spend a few hours installing bash, a decent sendmail, etc etc but that's no different from Solaris or AIX.

  • 2.4 will have SMP support which is so massively much better than previous versions that it will blow you away. The old SMP support was to some degree a hack; every time a CPU called a kernel routine the entire kernel was locked out for other CPUs.

    The new code has completely rewritted the locking so that there is no longer a single global lock, but seperate discrete locks for each thing that needs them.

    The impact of this, plus a few other IO related changes, should be that Linux will scale better than Solaris to large numbers of CPUs. Well, that's the theory anyway...

  • Actually, the article says the cost of the boxes is $73,000, much less than $365K, am I reading this wrong? The two-way boxen mentioned in the article (for $1830) sounds pretty cool by itself (I'll take one, please!).

    The interesting thing is, they plan to devote serious attention to *app* porting and applying proven techniques from their RS/6000 architectures to this development. This sort of "validation" is always nice to see.
  • they're writing a version of linux optimized for the servers, and therefore will be able to handle 64 processors. which you would have known, had you bothered to read the entire article.

    Power Corrupts
  • by Christopher Thomas ( 11717 ) on Wednesday May 24, 2000 @11:54AM (#1050047)
    "NUMA" stands for "Non-Uniform Memory Architecture". It's one approach to dealing with system memory on a machine with a large number of processors.

    The idea is that each processor module has its own dedicated RAM, which can be accessed both locally and remotely by other machines across the network. System memory as a whole is the aggregate of the local memory banks on all of the processor modules. While this is all in one physical address space, access time will vary depending on whether you're accessing a local or non-local bank. Hence, "Non-Uniform".

    There are undoubtedly extensions to NUMA that do more complicated things with system memory; this is just the version that I was told about at university.

  • Company officials also announced their intent to deliver a version of Linux optimized for NUMA servers.

    This is a tad scary. If I read this correctly, IBM is going to come out with their own distro? The question is, which existing distro will they steal.

    My guess is Debian, but Red Hat knows how to market. Hmmmm
  • But 64?? How about wasting 60 of those CPU's. Whats the point. At this time if you want that many cpu's in the same box you should stick with IRIX, etc. Linux is working on it but it isn't even close yet......

    I'm pretty sure it's not a 64-way SMP box. It's 16, 4 way SMP boxes in giant purple cabinets.

    The article is misleading. NUMA is not the same as SMP. Hope that helps.

    Btw, these boxes also all run NT. (but who cares?) :P

    --
    blue
  • My eyes are bugging out of my head.

    I can't wait to play Quake on one. Anybody want to give it to me for my birthday?
  • Overkill? Not if you need it. like, animators!
    mmmmm, I can smell the FPU goodness from here ;-]
    digitalun
  • Actually, there's nothing special about the non-uniformity (NU) of access. You have that feature whenever you throw in caches, for example: the time to access a memory location depends on whether it's in cache or not. A Beowulf cluster is also a non-uniform access architecture: accessing your local memory is cheaper than accessing the memory on a remote machine (via some sort of message passing layer, such as MPI or PVM or RPC, for example). Network file systems also exhibit non-uniform access cost behavior: the speed at which you can access /net/machine/filename depends on whether "machine" is local or remote, and how far away it is.

    So what makes a "NUMA machine" so special? It's the hardware cache coherence. That is, a cc-NUMA turns a (cheap) Beowulf cluster into an (relatively expensive) IBM machine.

    Whenever access costs are non-uniform, algorithms must be tuned to be latency sensitive. If algorithms are not aware of this nonuniformity, then they'll run really slow!!

    Why is hardware cache coherence silly? eheh (my opinion). If we need to make algorithms latency sensitive to take advantage of non-uniform access architectures, then we need explicit control of data movement throughout the system. But a cc-NUMA takes this control away from you.

    Notice that your L2 cache effectively has hardware cache coherence. You can't really control what's in L2 cache. And neither do your algorithms need to be conscious of the existence of an L2 cache to perform correctly. To perform well, however, they do.

  • So that's what they call it. I wrote the numerical analysis code for my thesis research on a Kendall Square Research KSR-1; 32 nodes 20MHz each (yeah!), 32 MB each, UNIX derivative OS with Posix threads (Mutex's, barriers, thread-private/public data). You treated it like an SMP with the additional knowledge that memory access had an affinity for the processor.


    KSR went under, as I recall, but I always wondered what happened with the technique. Now I know that it is NUMA!

  • Between IBM's announcement and SGI's growing support for the LInux platform, it seems that Linux is rapidly carving a place for itself as the *NIX of choice for high-end computing environments.

    Well, before we get too excited, keep in mind that both companies support (or flip-flop between if you want to be less charitable) a wide variety of operating systems. This is SGI's third "bet the company" OS strategy in 5 years. IBM's wide support for Linux is far more interesting, but to a certain extent (the 390 comes to mind) it has the feel of a stunt. This is all great news for Linux, but it's still early in the game.

    Using a standard platform like Linux that has developed an independent following will give both companies a difinitive advantage over Sun and their Solaris platform.

    Until there actually is anything resembling a "standard platform" for Linux, I don't think this is a serious point. There are already plenty of differences between (for example) Debian and Red Hat on the x86 platform alone, so it seems like a huge stretch to suggest that SGI and IBM/Sequent machines will provide a "standard platform" simply because they both have Linux-based kernels available. Again, this is all good for Linux, but don't set your expectations too high.

  • by Anonymous Coward
    These days, big IBM/Sequent machines pretending to be SMPs are actually SCI clusters collapsed into a single box. So are pretty much everybody else's big machines. You've got to be kidding in claiming that Myrinet interconnect beats shared memory. MPP is a solved problem, but there are people who don't want you to know this. You can build your own SCI cluster, but it is not going to be much faster or much cheaper than Myrinet, because of PCI bus suckage. SCI should live on the CPU in which case it takes up 20K gates and costs pennies.
  • OS/2 is not supported (yet).
    But since DYNIX is the only one that will scale to 64-CPU, if you were IBM, wouldn't you just certify and run OS/2 on it? It's the only other Intel OS that will handle 64-way SMP RIGHT NOW.
    NT and 2000 officially support, what is it, 4 or 8-way SMP. But realistically, the gain drops right off after 2-CPU.

    If you ever needed to support decent office apps, like word processors, off a 64-way Intel SMP machine... (why???) OS/2 would be the only way to go.
  • IBM already has.. S/390 ring any bells?? RS/6000?? Take a look around /usr/src/linux-2.3.99/arch/ sometime..

    That's definitly about as high-end as you could get..
  • ccNUMA (r) is SGI's cache coherent NUMA. There are a lot of sophisticated tricks played with memory, but it all boils down to a system with lower memory latency over the 'link than the old Power Series machines had locally. The coherent cache turns out to be handy, too. I never did delve deeply into those machines, but they definitely are kewl. And they scale like no one's business. More info can be found here [sgi.com].

    NUMA-Q is the Sequent technology. It is also cache coherent according to this [sequent.com] paper, but the details are lacking. It does not appear to scale as well as SGI's NUMA, though.

  • "Being able to fake SMPs is the greatest thing since sliced bread. You can't build true SMPs that big and message passing code is a bitch to write. "

    So, effectively you end up with 'dumb' software (cause its 'easier' to write) which runs at about a tenth of the speed the hardware should be able to manage. Then of course there are a bunch of 'NUMA' implementation specific optimizations you can make to get your code up to speed.... Except, by the time your done with that, you might have written a distributed application with a proper message passing protocol that would have scaled better.

  • Data General have been doing this for years. IBM are way slow on this.

    http://www.dg.com/aviion/html/av_25000_enterpris e_server.html

    AND they can cluster them together - real clusters not failover.

    No Linux though :(.

  • These boxes are _really_ expensive, it might just be cheaper to re-write the application.

    The 64CPU NUMA box will be in the millions of dollars range. How much do 64 boxes single CPU boxes, a couple of admins and some developers cost?

  • You don't think the 64CPU version costs $74K do you?

    You'll get 2, maybe 4 CPUs for that, i.e. a bog standard intel SMP motherboard. There's no point doing this stuff unless you're going big.
  • I thought we'd already seen articles here that pretty much said IBM would port Linux to every system that they sell.

    carlos

  • Maxed out, with the enterprise cabinet, 4GB of RAM and 100GB of storage

    Pah! 4GB RAM? Call that maxed out? From one of the DG AViiONs I'm using at work:

    $> ./hinv
    Model ID: 0x0001abd2 -- AViiON AV20000 (SMCS (Multi-node))
    Prom revision: E00.08
    Number of CPUs: 16
    Memory: 8 GB

    Fully laden, it'll take 32 CPUs and significantly more than 100GB of storage. The newer AV25000 takes up to 64 CPUs and 64GB RAM. I'm hoping that if IBM add NUMA support to Linux for the Sequent box, it'll help with getting it running on the DG NUMA boxen too...

  • I wonder how fast this baby will run!

    The benchmark results (TPC) are here [numaq.com] (or as a PDF file [numaq.com]).

  • by Mr Z ( 6791 ) on Wednesday May 24, 2000 @11:35AM (#1050066) Homepage Journal

    We already have 7, count'em SEVEN FIRST POSTS! I wonder if IBM's including a 64-way First Post server with their NUMA boxes...

    --Joe
    --
  • It's always a good thing when a company undertakes a major port of Linux to a new architecture. Remember, more eyes find more bugs, and these are VERY talented eyes that are going to be adapting and scrutinizing the kernel for the sorts of multiprocessing bugs that only show up in configurations with large numbers of CPUs.

    Everyone wins.
  • I wonder if IBM will release patches for Linux + other patches/drivers needed to operate and let Redhat/SuSE/Caldera/Turbo-Linux to port their distributions - the same way they did with the S/390 and the RS/6000 - or will they create a whole distribution?
  • How will this support 64 processors?

    Does this mean IBM is going to look into ways to make it support 64 processors?

    If true, this seems like it could be excellent for the Linux community. Even if it isn't exactly '64 processor SMP'. Please, post your thoughts on this.. i'm interested in knowing a definate answer.

  • >>I'm pretty sure it's not a 64-way SMP box. It's
    >>16, 4 way SMP boxes in giant purple cabinets.
    >>The article is misleading

    Ah. That's different. :)
  • Firstly, how can you steal something that's free? Secondly, IBM usually seems to prefer working from scratch, and I can't see why this would be any different...
  • I wouldn't be too worried about the rush of new Linux distros based around specialized hardware. The specialized distro, in this case, will be short lived, but the optimized x86 code will go back into the kernel tree. It's a win/win situation for Linux users. I for one will be interested to see what patches old IBM pushes through the gate.

    But, waitaminnit, I thought that IBM was in bed with RedHat. Will this distro be another RPM-based hack on the stock RedHat distro?

    Only time will tell, but I'm sure that IBM will do The Right Thing when playing with a bunch of firey Linux zealots. Screw Microsoft, the guys over at IBM are the real innovators.
  • by Blue Lang ( 13117 ) on Wednesday May 24, 2000 @12:01PM (#1050073) Homepage
    I may be misunderstanding but, from what I understand NUMA is not strictly for multiple boxen. I was more under the impression it was the middle step between SMP and MPP clusters...

    An actual NUMA-Q server is a (up to) 4-way Xeon box in a 4X rack cabinet. NUMA is software that lets a bunch of those share RAM and processor time. Sequent (IBM) recently overcame the old 64 processor limit on their NUMA implementation.

    Maxed out, with the enterprise cabinet, 4GB of RAM and 100GB of storage, you're looking at hundreds of thousands of dollars.

    Sequent's web site seems to be down right now.. (cough). Heh.

    And, as to why you might want one, we average over 400 processes at any given time with a load avg of around, uhm, zero, on our production box. These things can take a LOT of abuse.

    --
    blue
  • by oldmanmtn ( 33675 ) on Wednesday May 24, 2000 @12:52PM (#1050074)
    Why would someone bother to use a NUMA-Q server for any application? Beowulf clustering (which IBM is also pushing) provides a significantly greater price/performance ratio for most applications, and gives you a better interconnect fabric for interprocess communications.

    • Manageability.

      Assuming that they've got this even close to right, managing a 64-CPU NUMA-Q system should be no more complicated than managing a 1 CPU NUMA-Q system. Until the sysadmin tools for Beowulfs get a hell of a lot more sophisticated, managing a 64-CPU Beowulf is going to be much more complicated than a 1-CPU Beowulf.

    • Programability.

      Again, assuming that they've gotten things right, programmers should be able to continue working with a model they know and have experience with. Applications have to be re-written (and frequently re-designed) from scratch to run on a Beowulf, and programmers need to use a totally different mindset. In theory, any application that works on an SMP should "just work" on a NUMA machine - possibly with a recompile. To really get peak performance, applications may well need some tuning, but that's certainly easier than rewriting.

    • Applications.

      I'll bet that when this machine ships, Oracle (just to pick a big example) will already be running on it. When will Oracle ship a Beowulf-aware version?

  • At a former employer we were feeding several (fairly) high-speed streams of packetized data into and out of a monolithic database.

    There are still a few applications in the world that need big iron.
  • by Anonymous Coward
    There is not much nice about dynix. It is based off of SysVr2 (I'm pretty sure); /var has to be on the root partition because /etc/vfstab is just a symlink to /var/adm/vfstab. You can't change that. What is normally in var (logs and such) is in /usr/adm. Most simple sysadmin tasks (such as adding a user) should not be done on the command line. You should use their menu system because it twiddles bits on some internal-not-text database. I could go on, but I don't feel like ranting any more. Compared to Dynix, I *LOVE* working with AIX, HP-UX, or just about any unix that I've touched.
  • Other posters have done a pretty good job of discussing NUMA (ccNUMA is what SGI uses in the Origin machines).

    Each Origin consits of node board that contain 2 CPUs and some RAM. The system can scale up to 512 node boards (1024 CPUs), but you obviously can't fit all of those CPUs in one Origin case (the little purple mini-fridge in the case of the Origin 2000). So, the CrayLINK is used to expand the CPU network topology beyond one box - it is basically a extremely large bandwidth short-range cable that connects Origin machines together to form one big cluster that is the equivalent of one box with 1024 CPUs.
  • middle step between SMP and MPP cluster

    Almost. SMP and DSM (Distributed Shared Memory, for those not up on acronyms) would be more correct. Some memory is local, some is not. By non-local, that means it could be in another box, or maybe just off another bus.

    The important thing is that all memory is accessed as if it were local (just ask for it by address). The hard part is of course getting all processes that want to stomp over the same memory to share it nicely. This involves both page migration (of DSMfame) to processes, and process migration. And of course, at all times maintaining coherency between all copies of a page.

    Ultimately, you want all processes that use the same memory to be on the same CPU that memory, or at least near by.

    This is hard stuff, as all by scientific applications have hard-to-predict paging behavior. But then scientific apps is what pays the bills for people who need these machines.
  • yeah, well what's the question hotshot? ;)
  • The zoned memory management in 2.3 is a starting point for NUMA. It allows you to break the address space into segments which are treated differently (ie. Non-Uniformly).

    SGI and IBM will have to cooperate, although maybe not officially. They'll each have teams adding features to the kernel and they'll talk to each other the same way all the other developers do. There's just no other way to do it.

  • It's more than likely that you are thinking of a setup like the integrated Netfinity cards in a AS/400.. The S/390 doesn't work that way.While most of the time the 'user' OS session is in a VM under 390, the guest OS must still pretend to access the underlying HAL in an identical manner. Any OS capable of guest is capable of running on the iron, since the OS doesn't know the difference!!

    With an exception for the newest rev, you can even completly replace the host OS with Linux, including the userland VMs..
  • If you were to rewrite the Linux kernel (as you suggest would be a good thing), how would what you ended up with still be the Linux kernel as oppsed to YA UNIX-flavored OS?
  • Me opening my big mouth w/o reading much, dammit, I haven't done that on /. for a long time.

    uhhhm... you're talking about two different things. i said 'porting linux to a numa-q.' numa-q is a brand name of a type of (x86) server. so, there's really no porting of linux to be done.

    But... just because it is x86 doesn't mean there is no coding to be done!

    you said 'coding an OS for numa,

    You're correct, I meant "coding numa for an OS". No, I've never used DYNIX. But you seem to be a pretty big fan, could you enlighten?

  • KSR's stuff was COMA, Cache-Only-Memory-Architecture. Similar to NUMA, but each local cpu's (or node's) memory was a big cache, remote accesses became encached in the local cpu's memory so you only took the big latency hit on the first access from that cpu. Now, multiple cpus accessing the same cache line might cause pathological problems.

    Sun owns all of KSR's COMA patents. Look for some derivative of it to show up in their 128-way 'Serengetti' which is the expected successor to the UE1000 (which they got from Cray when SGI bought them).

    All major vendors are doing some sort of NUMA-ish architecture. DEC just announced their GS-line of boxes that scale up to 32-cpus in a cabinet, in blocks of 4. Theoretically the architecture will scale to 512 cpus, but they are only selling 32 cpu boxes for now. Their memory-bandwidth and fault-tolerant features are nice, their latency, especially off-qbb, sucks. SGI's 3-year old Origin 2000 has better latency.

  • Ask a mouse to calculate the question.

  • Hey /. guys:
    Is there a filter I can set up on my account to filter out any posts containing the world "Beowulf" in them? I mean, sure, we'll miss some relevant posts (like those concerning medieval warriors who go around killing monsters), but I think the price would be well worth it.
    Or maybe we can just have a new classification for moderators: "-1, Beowulf"?
    --JRZ
  • Actually, NUMA, in general, is for a single box, not multiple boxes. It is also most assuredly *not* software. There are certain things you can/have to do to the software (especially the OS) to get it to run well, however, the base support is in hardware. You don't want the local processor to be involved at all (except for cache line invalidation) when a remote processor wants some memory on this node. The "4-way Xeon box" is probably a special 4-way motherboard with a big ASIC on it to handle the memory exchanges over a special (and probably very fast) bus interconnecting the nodes.

    I'm not sure how IBM/Sequent does it, but we (SGI) do it that way - the Origin 2000 and the followon to it have a board with processors, memory, I/O channels, and a "network link" to the intramachine memory network. These are linked together with a big ASIC, which we call the Hub. The Hub is linked to routers and the routers are connected in something approximating a hypercube. The latency to memory gets longer the further you are away and the more router hops you have to go. Our architecture scales to 512 (and perhaps 1024 in the future) processors. A future version will be based on IA64 and will run Linux also. I'm not sure if IBM needs routers as they only go to 64p.

    So what do you have to do to make the OS run? The big thing is getting the OS to recognize holes in memory which may exist between nodes. Once you have done that, you can probably boot. To actually run well, you have to modify the memory allocation and scheduler (among other things) to try to keep jobs physically close to their memory. This gets *really* hard, especially when the job takes more than one node's worth of memory.

    One last thing I should point out is that there are two flavors of NUMA. Regular NUMA and ccNUMA. In regular NUMA machines (the Cray T3D/T3E is the only one that comes to mind) you can get access to memory that is not on your node, but you will only get a snapshot of what is there at a given moment. In ccNUMA, the caches of all the processors are coherent so not only do you get a snapshot, but if you modify your copy, everyone will perceive that the modification happened at the same time. The T3E runs a seperate copy of the OS on every processor partly because of the lack of the "cc" part. The Origin stuff can run a single kernel on the entire system and my group is doing software which breaks the machine up and runs seperate kernel images in different pieces for reliability reasons. The images can then talk to each other over the memory interconnect at very high speeds.

    Finally, 4GB of memory is nothing :) I regularly play with a machine that has 196 gigs of RAM (and 512 processors :) and it's nowhere near maxed out :)

  • The idea is that each processor module has its own dedicated RAM, which can be accessed both locally and remotely by other machines across the network.

    Ideally, yes, although to meet the strict definition of NUMA, it only needs dedicated RAM. That RAM doesn't have to be shared with other (remote) CPU blocks. All it needs is for the memory to not be equally available to all CPUs. See the following example from a DG AViiON:

    CPU 15: Model: Intel PentiumPro, 200MHz, online
    Version: 0
    Family: PentiumPro class
    Stepping: 01/09
    8 KB L1 read-only instruction cache
    8 KB L1 writeback data cache
    1 MB L2 writeback data & instruction cache
    128 MB L3 writeback data & instruction cache (shared between CPUs 12 to 15)
    896 MB L3 shared UMA memory (shared between CPUs 12 to 15)
    1 GB L3 shared UMA memory (shared between CPUs 12 to 15)

    As you can see, the memory is shared between each block of 4 CPUs, but it's not accessable by remote blocks. NUMA AViiONs have 3 basic memory types -- shared UMA, local NUMA and remote NUMA.

  • The Hub is linked to routers and the routers are connected in something approximating a hypercube.
    Given that a hypercube is a four dimensional object how has it been approximated?
  • SGI has a linux kernel patch for NUMA since v2.3.30. look here: http://oss.sgi.com/projects/numa/ [sgi.com]
  • Does Linux currently have NUMA-aware memory management? The article states that IBM is developing their own patches for this, but it would be interesting to know what else is out there.

    Optimizing for local vs. nonlocal memory accesses can have a big performance impact, as (if memory serves) Sun found a while back.
  • Ok, so IBM has some hot new hardware out. They'll make huge money on the sale of it - but not everyone wants their OS. Lot's of techies know Linux, and since Linux is free, it makes good business sense to port it to the Hot New Hardware(tm).

    Looks like IBM is getting something of great value, for their new toy. But what about Linux-at-Large? What's the reciprocation?

    Don't get me wrong, the Linux community and IBM have a good working relationship. But it seems that supporting Linux on some new essoretic piece of metal doesn't do a whole lot of good for the 'little people' who made Linux available to IBM in the first place.

    I would love to hear that the 'information exchange' process works equitably in both directions. What are IBM's current projects which will benefit the community in tangible terms? Any more software being made available? Any sources being openned? How about a free RS/6000 for our favorite Geek Compound?
  • by Blue Lang ( 13117 ) on Wednesday May 24, 2000 @11:37AM (#1050093) Homepage
    "it" being porting linux to a NUMA-Q.

    (NUMA is a method of sharing CPU and RAM access across mutliple boxen)

    NUMA-Q's are x86 boxen. They have some really, really, really cool features. I do wonder if IBM plans to write drivers for the FibreChannel SCSI adaptors and etc that come standard with most NUMAs.

    OTOH, there is noooo reason not to use dynix on a NUMA. It's included with the (MASSIVE) cost of the box, it's based on BSD, it's a nice OS with tons of kick-ass features, and it's symbiotically enmeshed into these servers.

    Hey, I wonder if IBM is actually gonna write a NUMA layer for linux? I mean, if they don't then all you end up with a buncha 4-way rack mounted linux boxes.. for $365,000 apiece.

    One other thing, Sequent (now IBM) has the absolute best support I have ever seen. I have sent email to their web site about completely esoteric crap and had them call me back and get a dialogue open with the developer of whatever I was having trouble with. If you've got the cash and don't wanna deal with Solaris, DYNIX is the way to go.

    --
    blue, who is wearing his Sequent hat today.
  • It's kinda hard to describe. At 32p, the machine is a 3D cube with extra links going in an X across the cube. At 64p the machine is a 4D hypercube. At 128p, it's sort of two 4D hypercubes with 4 of their corners touching. Above 128p, I can't visulize it anymore enough to describe it :) Fortunately, my software doesn't need to know exactly how the machine is connected :)
  • 4 clustered 64-processor servers? for a total of 256 processors and 256 GIGS of ram? can you say overkill?

    on the up side, looks like linux is getting some real credibility now, with IBM hiring folks to create a special IBM-optimized version to be released on the servers. perhaps other major computer manufacturers will take notice.

    methinks things are looking up.

    Power Corrupts
  • Yeah, multi-path I/O is a *very* cool thing. FribreChannel cables are also so much nicer to work with than SCSI :) These things work well in clusters with a cluster wide filesystem too to provide a single huge filesystem that can be shared across the front end nodes and compute nodes.

    The interconnect on Origin 2000 (formerly "CrayLink", now "NUMAlink" with the sale of Cray...) is 800 MBytes/sec bidirectional from the node to the nearest router. The follow on will have twice the bandwidth.

  • No keyboard? Rats.

    Guess I'll have to work on that telepathic ethernet interface.
  • by Anonymous Coward
    Why would someone bother to use a NUMA-Q server for any application? Beowulf clustering (which IBM is also pushing) provides a significantly greater price/performance ratio for most applications, and gives you a better interconnect fabric for interprocess communications. With prices starting at around $73,000 for a two processor NUMA-Q setup, you can buy a LOT of celeron or athlon systems, or even alphas, using myrinet for an interconnect. That will give you a much more robust solution, at a lower price point, and give you the flexibility to optimize the system for YOUR apps, not how the scheduler wants to distribute your data across a non uniform access time memory pool. And don't get me started about SANS, compared to a true cluster IO system...
  • You mean the pinball wizard?
  • $1,830 US for the Netfinity isn't a bad price. IBM is sure pulling out all the stops when it comes to Open Sourced marketshare. Not the HW is packaged for small business (althought he E-410 is $83K) and could conceivably run any of your mainstream OS's and they buzz word "Linux" was plugged in several times but as of now isn't directly related to NUMA. The reference was only that they "intended" to optimize for Linux. Still I wouldn't mind having a few of these in my shop, but I sure hope they drop the price some, 83K is still a pretty big capital hit.
  • Okay, I remember from one of my classes the difference between SMP and NUMA. I remember a very brief discussion on SGI's CC-NUMA, and how it was basically as switched network for processors to be talking to different segemented areas of memory so that not all processors are trying to access the same address space of memory and blocking each other's access to the memory bus.

    Now, does anyone know what the differences are between SGI's CC-NUMA and IBM's NUMA-Q?
  • Linux does do multiprocessing, but not as well as Solaris and other commerical Unixes. Still, as the article says IBM will be releasing a beta version that will work on it. I wonder how much of that code will make it into 2.4? (Or rather 2.5 probably...) 2.4 is supposed to have better SMP support than 2.2, if the fella's over at IBM make a version of Linux that supports 64 way processing I would guess that there are direct modifications to the kernel, and many people have been posting here about how Linux needs better support for multiple (more than 4) processors.

    (I mention the below because I KNOW it will come up in this thread...)

    Kernel fragmentation? Possible, yes but unlikely I think because the code that benifits most Linux platforms from this version will eventually make it into the kernel anyways. Besides, even if it is a totally different version of Linux, how many people can afford the $73,000 price tag on one of these things!
  • I was surprised at the cost. $73k for a system like that is very reasonable.
  • Multiply $73K X 4 is what he was getting at.
  • OS/2 Warp Server theoretically already supports 64 CPU's, and since it scales better than NT/2000, it should work great on these machines. Unfortuantely, I can't get to www.sequent.com, so I can't tell if OS/2 is supported. Does anyone know?
  • check out their dist options for S/390: http://www.s390.ibm.com/linux/dist.html [ibm.com]
    also, see this Debian developer's short summary of his dealings with IBM: http://www.debian.org/News/w eekly/current/issue/mail#2 [debian.org]

    -l

  • Too bad the S/390 pictured is the baby A20... they shoulda put the whoppers on the page! Our company will be trying out Linux on one of our S/390 systems soon (not all by itself, it will be sharing space with the OS/390 LPARS)... I will faithfully report anything of interest that crops up to Slashdot.
  • Typically, you'd keep coherence on the cache line level, not the page level. That way, you can have a multithreaded job that has a lot of stuff in the same page share it nicely if each thread doen't bang on the same cache line as the others. Also, any processor with multi-processor capabilities will have the ability to synchronize cache lines between CPU's.
  • Our (SGI's) ccNUMA machines go to 512p in a single system image. The next generation has a hardware limit of 1024 CPU's, though we are not, at this point, committed to building one larger than 512p for a variety of technical reasons I'm not going to go into here. However, large sums of money tossed in our direction could push the CPU count up :)

    We are definetly working on a IA64/Linux version of this hardware that will scale to the same number of processors as the MIPS version, though it is unclear how far it will go as a single system image. However, we can split the thing up and run multiple kernel images that talk over shared memory in the same box.

  • You don't have to imagine - check out ASCI Blue - it's not a Beowulf, but it is a cluster of 48 boxes where each box is an SGI Origin 2000 with 128 processors. It is pretty high up the Top 500 list of the world's fastest supercomputers :)
  • Who would spend that kind of money on a system and then run Linux on it? If you've got the money to afford a system like that, you might as well shell out the extra few dollars and get an OS that can handle that many processors. I'm sure the people who buy these things aren't thinking, "Duh, it'd be real keen-o if I ran Linux on this thing. Golly gee, I wonder if I could play Quake3 on it."
  • Now that I understand these systems better, it appears that OS/2 cannot work on them because of the NUMA architecture. OS/2's kernel would have to be updated to support NUMA, and that's not going to happen.
  • Not a good sign i guess, this rather means that they have not found enough support for this project among top distribution or they did not look much for it and instead of having RedHat, Caldera, Suse and all releasing versions for NUMA or patches for NUMA - they did it themsleves... Open Source product from Close source company ...
  • Eeeewl! Not only will my infinite loops run faster, I'll be able to run several of them in parallel!

    Will IBM open a 12-year-old-script-kiddie special offer to allow all americans to democratically be able DDoS whoever upsets them or will this be kept for geeks and suits ?
  • uhhhm... you're talking about two different things. i said 'porting linux to a numa-q.' numa-q is a brand name of a type of (x86) server. so, there's really no porting of linux to be done.

    But... just because it is x86 doesn't mean there is no coding to be done!

    Very, very true. We (Sequent/IBM) have had Linux up on single quads in our labs for over a year. That part required 0 work because a single quad can be configured to look, basically, like a 4-way PC. However, as soon as you add another quad to the system, extra code needs to be added to boot strap the additional quads. It goes even beyond this because each quad can have different CPUs. One of our main production machines has 16 PPro 180s, 4 PII 450s, 8 PIII 495s, and 4 PIII 700s. Clearly the Linux assumption that all CPUs will be the same fails miserably here.

    To be quite frank about it, everything about NUMA is hard. Getting the hardware to work is hard. Getting the OS to work is hard. If it weren't so damn hard, don't you think that more than 3 companies (Sequent/IBM, SGI, and Data General) would have NUMA systems?

    you said 'coding an OS for numa,

    You're correct, I meant "coding numa for an OS".

    Well, NUMA (or cc-NUMA more specifially) is a function of the *hardware*, and the software has to support it. You wouldn't say "coding SMP for an OS" would you? Coding an OS for NUMA is correct.
  • If I recall correctly, your description matches what Sequent published on their web site. They seemed to be using ccNUMA with a fairly large MESI cache on their custom chips.
    I'm not exactly sure which cc protocol is used, but this is basically correct. It seems like there might be an extra state, but I can't remember. The cache is 4MB.
    Their hardware interconnect was a SCI (Scalable Coherent Interconnect) ring with a bandwidth of around 1 Gb/sec/link. This is not, IMHO, the best link to use today. In fairness to Sequent, SCI may have been the best thing available at the time. They started shipping in '96 or '97, so the R&D must have happened a few years earlier.
    That hits the nail right on the head. Also, at the time the company was not seeing the best of times, so getting a working product out fast was pretty much the top priority. There are quite a few people that IBM primarily bought us for our next generation interconnect, but that starts to wander into NDA territory. :)
    I also thought the multi-path I/O was pretty cool. It is a FibreChannel SAN (Storage Area Network) with multiple controllers, F-C switches, and EMC RAID boxes, with fail-overs that work a little like the Internet, rerouting around dead hardware.
    Hehe...throw Clusters, CFS (Clustered File System) and SVM (System Volume Manager) into this mix and you can basically get more I/O bandwidth than any other system available. Not to sound too much like a marketing lackey, being able to connect *4* 64 proc boxes to just about as much disk space as you can get (there is a limit, but I don't know what it is) and have them all access in a nice, friendly, coherent manner just kicks ass.
  • Data General have been doing this for years. IBM are way slow on this.

    http://www.dg.com/aviion/html/av_25000_enterpris e_server.html

    AND they can cluster them together - real clusters not failover.

    Actually, Sequent (recently bought by IBM) has been shipping systems for some years now, too. We also support real clustering, too (that is if you consider shared disks with multipath fibre channel I/O real clustering *grin* ). Not to sound like an ass, DG had never even given a thought to cc-NUMA until a Sequent VP left and went to work there. :/
  • SGI currently ships NUMA machines up ~1024 processors, and AFAIK are working on an Intel version of that architecture built around Merced, er, excuse me, Itanium. They also seem committed to bringing those NUMA features to Linux, although I strongly believe that in order to do so they will have to essentially rewrite the kernel (which would be a good thing, IMHO).

    Somehow I don't see a lot of cooperation between Big Blue and Small Purple on this one.

  • Fourty two

  • There are some big ISV's that will port to linux far faster than other OS's. This is for a variety of reasons including:
    1. There are a lot of eyeballs on linux at the moment.
    2. If you port to linux, you can work on something that scales between a huge range of machines.
    3. Linux will be the first commercial OS available on the ia64 platform.
    I know that most other OS's are better in different ways - NT's GUI, VMS's clustering, OS390's reliability, Dynix's NUMA but look at take-up per OS per year and you'll see a clear leader.

"May your future be limited only by your dreams." -- Christa McAuliffe

Working...