Call immediately to get your ObamaGun! Thanks, Obama!
Call immediately to get your ObamaGun! Thanks, Obama!
Why build it at NCSA instead of just upgrading Kraken? Because:
1) Kraken is an XT5, not an XE system - the associated changes of an upgrade from XT to XE would be very large.
2) NCSA already has a big machine room (that they just built) to support that scale of a system. Does ORNL have enough additional power and cooling capacity to support Keeneland, Jaguar, and growing Kraken by an order of magnitude in size?
3) ORNL is already installing Keeneland, an NSF track 2 system this coming year
4) The larger political implications to NSF of failing the $200M track 1 grant that was awarded to NCSA would probably be catastrophic.
As opposed to inexpensive IBM maintenance contracts? All of the big HPC machines are expensive to run and maintain, and NCSA/NSF would be incredibly foolish if they haven't already budgeted for this.
Pretty surprising development, given the length of time that IBM and NCSA had been working on this. Dropping a contract like this essentially puts into question IBM's costing on future contract bids, so it's not something that they'd do lightly. It'll be interesting to see the scuttlebutt that comes out afterward to see how much of this was technical shortcomings and how much pure financial considerations from IBM. Maybe since IBM already got their big publicity for Power7 from Watson, they're being more profit-concious on future Power systems so they don't tie themselves to margins that are too low.
From the NCSA side, there will certainly be a fallback of some sort - NSF and NCSA are already working out those details according to recent reports. I'd guess that they go with a large Cray XE6 system, given that a pretty sizeable version of that system is already being stood up and ironed out (the Sandia/Los Alamos Cielo system), and Cray has a lot of history successfully standing up big systems (e.g. ORNL Jaguar, Sandia Red Storm, etc.). SGI Altix is the other alternative, I guess, and there's a pretty big one up at NASA now, though that'd probably be a riskier proposition than Cray IMO, and I expect that NCSA and NSF are going to be pretty risk averse on following up on this.
Palacios can run on real x86 hardware or on QEMU. In fact, most of our development is done on QEMU, which is open source. The VMWare image was something we did on the original 1.0 release just to help people get started running it and haven't done since, but VMware has *never* been required for development.
Doh, my mistake, Roadrunner beat Jaguar by a little less than 5% in the SC08 Top500 list, not 0.5%. Still, I do wonder.
Palacios lives inside the lightweight kernel host. Applications that want to run natively on the lightweight kernel without virtualization can at *no* penalty. Applications that are willing to pay the performance penalty of Linux can run Linux as a guest at a nominal additional virtualization cost. That way, applications that demand peak hardware performance get it, applications that need more complex OS services get it, and the downtimes associated with a complete system reboot are avoided.
In addiiton, the costs of something like Linux to a scientific application can be much higher for than many might expect. Cray's target was to get application performance on their Compute Node Linux within 10% of Catamount performance; they did so for most (but not all) of their apps as I understand it, but had to spend a significant effort to even get within 10%.
We're happy to leverage their hard work, however, so that users who want CNL can boot it on top of our VMM, while users who don't can get done faster or save some of their allocated cycles. I sometimes wonder if ORNL wished they had been running a VMM/LWK on Jaguar when Roadrunner beat them on the SC 2008 Top 500 list by 0.5%. Being able to use the lightweight kernel for Top500 Linpack runs and CNL for running apps that needed it might have come in handy for them then.
Finally, our experience has been that a small, simple, open-source LWK/VMM combination is a very powerful platform for OS and hardware HPC research - it provides a simple, understandable, and powerful base for addressing HPC systems problems (e.g. fault tolerance) without the complexity of trying to do that in, for example, Linux.
Virtualization offers a number of potential advantages. A paper we have had accepted to IPDPS 2010 that enumerates more of them, but a few advantages quickly:
1. The combination of a lightweight kernel and a virtualzation layer allows applications to choose which OS they run on and how much they pay in terms of performance for the OS services they needs. Because Palacios is hosted inside an existing lightweight kernel that presents minimal overhead to applications that run directly on it, applications that don't need the services (and overheads) of full-featured OS like Linux can run directly on the LWK/VMM with minimal overhead. On the other hand, apps or app frameworks that need higher-level OS services (e.g. shared libraries) can run the OS they need as a virtualized guest on top of the LWK/VMM. Because doing an actual kernel reboot on a machine like Red Storm is very time-consuming, (compared to a guest OS boot), this is a substantial advantage.
2. Mean-time-to-interrupt on some of the most recent large-scale systems is much less than a single day, and virtualization is potentially useful technique for addressing fault tolerance and resilience issues in HPC systems, assuming that its overhead at scale can be kept small.
3. A small open-source LWK/VMM combination enables a wide range of OS and hardware research on HPC systems both by being a small, understandable, low-overhead platform, and by providing a way to support existing HPC OSes and applications while enabling OS and hardware innovation.
4. A number of others I won't mention right now as they're being actively researched here at UNM, and by my colleagues at Northwestern and Sandia.
We're not trying to hide anything, and so I will admit to being surprised by this (anonymous) accusation. To address the anonymous coward's concerns, however:
1. Actual users of supercomputers care most about application run time because applications are what scientists run, not micro-benchmarks. As a result, our paper and research more generally focuses on the runtime penalty to real applications (e.g. Sandia's CTH code) as opposed to focusing on optimizing micro-benchmarks that aren't what real users of these systems care about.
2. Micro-benchmarks do provide useful information about the exact costs of various low-level operations, however, to the extent that they can show you what is causing the application slowdowns you do see. They also can potentially help understand how proposed changes might impact applications other than the ones we were able to run in our limited access to the production Red Storm system. Because of this, the paper the anonymous coward above refers to explicitly measures and presents micro-benchmark latency and bandwidth overheads. Specifically, it cites the latency cost on both Red Storm's SeaStar NIC (5 or 11 microseconds, depending on how you virtualize paging) and QDR Infiniband (0.01 microseconds). It also presents a bandwidth curve to fully characterize virtualization's cost over the full range of potential message sizes on SeaStar. (IB is less expensive to virtualize than SeaStar, because IB doesn't have interrupts that Palacios must virtualize on the messaging fast path where as SeaStar does, at least when running Cray's production firmware).
We're very up front about the costs of virtualization because we are well aware that there is no such thing as a free lunch. Virtualization provides a number of potential advantages in supercomputing systems, for example in terms of dealing with node failures, providing a small open-source platform for OS research and innovation on supercomputing systems, handling applications with different OS feature and performance requirements, and a variety of other things. However, it does come with a cost to applications and application scientists that has to be weighed against its potential benefits.
ACSI Red Storm normally runs a dedicated lightweight kernel called Catamount, not Linux. Similarly, the IBM BlueGene systems run the IBM compute node kernel, not Linux. Linux is used on some supercomputers, even some of the biggest ones (e.g. ORNL's Jaguar system) but the performance penalty of using Linux as opposed to a lightweigher kernel for some applications can be substantial(e.g. > 10%).
"'Tis true, 'tis pity, and pity 'tis 'tis true." -- Poloniouius, in Willie the Shake's _Hamlet, Prince of Darkness_