Forgot your password?

typodupeerror

Comment: It's really about the applications (Score 1) 185

by gentryx (#43738425) Attached to: Has Supercomputing Hit a Brick Wall?

I guess your major misunderstanding is that the applications running on supercomputers could somehow be done in the (loosely coupled) way that Google does its data mining. Since you're a professional, too, please refer to this Wiki article on stencil codes, one of the major classes of codes that run on supercomputers. If you find a way (or at least a pseudo-code formulation) to transform these applications into loosely coupled codes, then I would not be the only one to be curious to hear about it. You'd transform the whole industry. In fact this is not possible, though.

But I agree that software will need to help with reliability and will have to actively manage node eviction/addition.

BTW: comparing Google and Cray is really like comparing apples and oranges: they're in different markets. The market for supercomputers is extremely small, the market for (online) advertising is gigantic.

Comment: Google is not a Supercomputer (Score 1) 185

by gentryx (#43728877) Attached to: Has Supercomputing Hit a Brick Wall?

Whenever someone on on /. likens Google's network to a supercomputer God kills a Pokemon. But honestly: the reason why Google can cope with these massive outages is that they're doing totally different computations from supercomputers. Google's compute jobs are losely coupled. They do data mining. That is fundamentally different from supercomputing where all compute jobs are tightly coupled. To give you a car analogy:

  • In the Google case millions of mechanics fix millions of cars in parallel. This is more or less trivial. If one of the mechanics is ill, another one can take over his task, or they simply wait until a replacement arrives.
  • In supercomputing your try to assign millions of mechanics to fix a single car in just a millionths of the usual time. This gets really tricky because they need to coordinate their actions tightly and if one of the mechanics is ill, others might trip over him and the whole job becomes a mess.

Not a good analogy, but I hope to correct the picture of Google being lightyears ahead of the supercomputing industry: they're simply working on very different problems. I wonder what makes you think that Google/Amazon/Facebook were 10 years ahead of Cray and academia? If they were, they'd simply take over Cray's market. And since Cray competes with IBM and Fujitsu, they'd probably try and claim parts of their market shares, too. This is not happening.

Comment: Re: Latency not as important as expected (Score 1) 185

by gentryx (#43728833) Attached to: Has Supercomputing Hit a Brick Wall?
Interesting! Actually, barriers are today considered non-scalable and thus people just try to avoid them. It's feasible if your code needs only next-neighbor communication. Not all codes satisfy this condition, but then again we build these machines today for a very specific set of applications.

Comment: Software is the problem/solution (Score 1) 185

by gentryx (#43723203) Attached to: Has Supercomputing Hit a Brick Wall?

Yes, in a way. We'll probably never be able to improve the hardware far enough that we can simply rely on it to fail gracefully (i.e. announce it's impending death a few seconds in advance). The reason is that ATM our systems contain approx. 20k nodes. Exascale systems will likely push this to 200k.Even if you assume a node will live 10 years in average, then you can estimate that every ~53 minutes one node of the system will fail.

My money is on the software: we'll need some kind of redundancy (e.g. a simulation code would need to store its mesh so that each part is held by multiple nodes, a bit like the redundancy we see in Bittorrent and other P2P networks). But that will require applications to be reengineered, and that will be really really expensive. Considering how the industry is struggling with the (comparatively easy) adoption of GPUs, I don't see this happening anytime soon. Interesting times ahead!

Comment: Latency not as important as expected (Score 3) 185

by gentryx (#43722021) Attached to: Has Supercomputing Hit a Brick Wall?

Although latency isn't so much of an issue: the #1 systems of the last ~3 years did all have torus networks (all Blue Genes, all Crays, K computer, too). These networks only perform well for next neighbor communication -- which is fine since most codes running on these machines are simulation codes and they only need this type of communication. If you scale up the system, you'll typically also scale the size of the simulation instance (this is known as "weak scaling").

This means that your program can still spend the same time waiting for the network as it could on a smaller machine. The cables do not need to become shorter.

Comment: Re:No? (Score 4, Informative) 185

by gentryx (#43721887) Attached to: Has Supercomputing Hit a Brick Wall?

Power consumption and MTBF: power consumption (high operating costs) be solved perhaps be solved by a larger budget, but the mean time between failures (MTBF) means, that the machine will fail before it can compute anything meaningful. Right know the machines we build, and even more importantly, the software we build rely on all parts of the machine to function. If even a single node fails, then the data it holds becomes inaccessible and the rest of the compute job crashes like a house of cards.

This can be remedied by taking frequent snapshots and then restarting from the last snapshot, but the time for checkpoint/restart has been continuously growing for the last systems. No one really expects exascale systems to do full system checkpoint/restart in a reasonable time frame. They'd spend more time taking snapshots than actually computing.

Source: I'm doing my PhD in supercomputing.

+ - Fukushima water leak discovered->

Submitted by OldJuke
OldJuke writes "Tokyo Electric power Co. said about 120 tons of the water are believed to have breached the tank's inner linings, some of it possibly leaking into the soil. TEPCO is moving the water to a nearby tank at the Fukushima Dai-chi plant — a process that could take several days ...More than 270,000 tons of highly radioactive water is already stored in hundreds of gigantic tanks and another underground tank. They are visible even at the plant's entrance and built around the compound, taking up more than 80 percent of its storage capacity.
TEPCO expects the amount to double over three years and plans to build hundreds of more tanks by mid-2015 to meet the demand."

Link to Original Source

+ - LibGeoDecomp 0.2.0: an auto-parallelizing computer simulation library for C++->

Submitted by gentryx
gentryx writes "The LibGeoDecomp (Library for Geometric Decomposition codes) project has recently released its version 0.2.0. The library can be used together with a huge variety of models, ranging from LBM (CFD, stencil codes) to molecular dynamics simulations (n-body codes). It leverages its specialized API to relieve the user from parallelizing his code and scales on virtually every parallel architecture, be it multi-cores, GPUs, MPI clusters or even GPU-equipped supercomputers."
Link to Original Source

Comment: Do the math! (Score 1) 84

by gentryx (#43332825) Attached to: First Petaflop Supercomputer To Shut Down
Roadrunner consists of 6480 QS22 Blades. Using Cellminer each will yield approx. 56 MHash/s, or 363 GHash/s in total. Using the Bitcoin profitability calculator we can then estimate that one will gain ~27 BTC/day (ATM $2667) while paying $3360/day for power (assuming cheap $0.07/kWh). So yes: mining on Roadrunner would not be cost-effective.

Comment: Stop complaining about the word "barrier" (Score 1) 84

by gentryx (#43332705) Attached to: First Petaflop Supercomputer To Shut Down

Whenever there is a story on supercomputers on /., there will be a comment stating that there was no barrier whatsoever. But that's not quite true.

The truth is that the performance of supercomputers grows that fast because engineers continuously solve problems, which were deemed intractable before (e.g. power consumption, reliability, network performance). The research may not be groundbreaking in the sense of earth-shattering, but definitely in the sense of "wow, I didn't think one could do that!"

YOW!! The land of the rising SONY!!

Working...