To Grid Or Not To Grid? 68
dbgimp writes "In my job at a (large) investment bank I am constantly being pushed to use grid technology.
I have many problems with this (not least that our data center is at best 100 Mb/s and our software is actually more data than computation heavy). A typical batch job takes 10-30 minutes consisting of around 10,000 trades. I would far rather spend the time and money on multi-core machines and optimizing the software than on the latest fad technology.
I am interested to hear from other people in a similar position and, in particular, why or why not they chose grid software over improving the existing code to leverage better processor technology, and which grid software they chose to use and why. Or, conversely, why they chose not to use grid software."
Clustered Benefit (Score:5, Interesting)
That's right, it seems to me that upper management likes the idea of having a clustered system because if a customer ever asked if our software would work for 1,000 people, my manager would say, "Sure, just buy more machines for the cluster." And everyone likes that idea. The idea that well, the system might not be able to handle everyone right away but wait a year or two and CPU cycles will be so cheap we can just buy 30 low end machines and cluster them to get the job done. Thanks to the common scheme of access that all databases use, this is an actual option.
I offer only the suggestion that maybe your bosses like the idea of just being able to throw more machines at it. Look at it from a financial perspective, if you tailored the code for multi-core CPUs--something I'm not even sure how to do--you would have to rebuild and maybe recode everything for future generations of machines. I can see why grid computing might sound so enticing to your employer. Look at Google's distributed scheme, hundreds of thousands of cheap machines running a stripped down form of Red Hat--I don't know if that's 'grid' computing but I imagine it's along the same lines.
It isn't clear to me whether your bank offers a service for trading or you do them in batches. It seems that the latter is true. Now, you mentioned you work at an investment bank so money probably isn't that big of an issue. Just go to your superior and say, "Look, I need the following." and if he balks at you just ask him how important these 10,000 transactions a day are to him.
So, to me, it would seem more intelligent to use the following idea. Buy new network hardware that handle gigabit ethernet. The cards, the router, whatever you have, just up it so that your internal network can really throw data around. Maybe look at relaying fibre if you don't have it. Then take what money is left over and buy a few more machines. Get a low-end server to act as a proxy that dishes out the requests for a trade to a cluster of machines. Write the software independent of the hardware so that you can always just buy more machines and install your client application on the machines. At some point, your choke point is going to be your database but if you make it that far, you've kind of hit a wall, in my opinion, and the only solution for that is to juice up the box (with database sepecific hardware) that's serving your database.
Correction On Database (Score:2)
Allow me to correct myself. If you fear this occurring further down the line, you do have another option. Buy multiple database machines and, in your client app, select the connection information for an account based on a lookup table. Then split your datab
Re: (Score:1)
Remember to ask yourself, "How much is MY job worth if those trades DON'T get processed because of something I recommended?."
Plan your response accordingly
Your boss is already looking for a potential scapegoat, make sure your not it.
Re: (Score:3, Informative)
If we're talking about an application that can truly benefit from clustering, and is built so that node failure can be detected and worked around relatively gracefully, this isn't much of a consideration. If you have 10 machines, and 1 goes down, you lose 10% thruput. If you look at it in terms of cores, 20 1 core machines is equivalent to 10 2 core machines, so your downtime per core essentially
Typical "ask slashdot" whining. (Score:2)
Yet, he says to us: my boss asks me, but I don't feel like it. Well, though shit brother.
Re: (Score:2)
Seems more likely to me that he is looking for a case with which to justify selling a non-grid solution to his boss. He may not be making the final choice, but it seems quite likely that he would have at least some influence over it. Having the right justifications and facts to back yourself up would obviously be of great help.
Re: (Score:2)
Well compare costs (Score:3, Insightful)
Or You just need to index your tables.
The rule of thumb is go the safe rout unless you are told by higher ups to do otherwise, have higher ups sign off on the more risky method (to save your butt) then get the method working focusing on getting the job done right and stop complaining how bad decision it was.
Re: (Score:3, Insightful)
Just providing costs comparisons boils down to "Your way costs X, my way cost Y." But, that may not matter to someone who wants to be buzz-word compliant. When an executive gets it in his head that "this" is better than "that", the best way to handle it is to show that "this" will give a give a crappy ROI while "that" will give a great ROI.
Unfortunately, sometimes even that does not work and you end up doing it the boss' wa
Re: (Score:2)
Different technology (Score:2, Insightful)
Re: (Score:2)
Re: (Score:2)
So the boss's boss told the boss "we're falling behind, use Grid technology"! And this poor bastard is stuck sticking square pegs in round holes.
Answered your self ... (Score:3, Interesting)
If the process is more data than computation intensive then throwing more machines at the problem is the most cost efficient way of going forward. You have already countered your argument for multi-core machines. Especially if this is finance it is highly unlikely that optimizing the software will produce anything remotely practical in a short time period or at low cost. Software optimization also can introduce bugs and lock you down on an implementation that cannot be easily updated.
Take search engine technology as an example, Google have hundreds of thousands of machines running advanced software on non ultra-optimized platforms: Java and Python. The alternative is having a couple of hundred big iron machines running hand tweaked C / assembly. As a business you should be seeking to reduce the overhead of operations, by increasing the number of machines, lowering the cost of each machine, reducing the time optimizing the software by allowing higher level languages that are easier to use and maintain you can actually get better performance, reliability, and flexibility.
Re: (Score:1)
I agree with this personally ... but let's play devil's advocate.
Dealing with large quantities of data has always been the sales pitch for mainframes. The question could therefore maybe be broadened to "can grids/clusters/multi-core/... really replace the mainframe?"
Re: (Score:2)
Clusters innit (Score:1)
What sort... (Score:1)
30 minutes for 10,000 trades seems an awfully long time - I work in the same industry (specifically developing position management systems) and the only thing we do as a batch job is our daily rollover/mark-to-market, which finishes in less time than yours with a hell of a lot more trades than that.
A Cynics reply... (Score:5, Interesting)
Period.
In fact, GRID software is constantly in flux, because there is no grant money to run a GRID, only to develop one, so they keep throwing stuff out and developing new parts -- to get grant money.
And yes, I am posting this anonymously because I work for such a place, and mostly like my job.
Grid vs cluster (Score:5, Informative)
Make sure you know the difference between grid technology [wikipedia.org] and clustering. Basically, grid is much more complicated but more flexible; the name means you can connect something to a grid to get computing power, just like you can connect to the power grid to get electricity. It looks like you're thinking of clustering instead, which is easier to deploy and in many ways closer to a multiproc machine
Re: (Score:2)
"Clusters" as used by most web communities are normally HA clusters, which are simply a logical group of servers running the same software, with a requ
Grid != Parallel (Score:3, Informative)
Grid within a company typically just means decent remote access to a shared cluster. A web service that submits jobs to sun grid engine (which has nothing to do with 'grid' btw) would probably fill in all the buzzword bingo requirements of a grid project without being anything of the sort. For sadists look into OMII and GT4, but don't feel compelled...
You keep using this word 'Parallel'... (Score:2)
Re: (Score:2)
Re: (Score:2)
Cluster where it makes sense (Score:5, Interesting)
Our processes tend to be more computation (than data) heavy compared to what you describe, but we are using lots of clustered computers. Take your 10,000 trades and split them into chunks of 100 trades and have separate machines value each chunk and reassemble the results. Depending on the nature of what your software does this may or may not make sense. If you can split your workload into small chunks that can be analyzed independently you can achieve much better throughput.
The newer cluster/grid software can be really shiny, but you don't always need it. Plain old PVM can still work wonders. Also, a lot of the commercial cluster software out there isn't well suited to this kind of high performance computation clustering.
I like the middleware layers (Score:2)
We want to offload processing cicles from z/OS onto a cheaper platform. As we process highly secure data we do not want any of this to land on insecure Windows boxes so our Grid engines will be typically tightly controlled Solaris or Linux boxes.
What I like best is the re-division of the application. The application submits a request for processing to a broker/manager. The broker/manager dispatches reque
i have a similar problem with virtualization (Score:5, Informative)
So, rather than move everything over to lpars I took a simple step - purchased a large virtualization-oriented server highly touted as perfect for this, and moved over a single app, with the goal of putting two apps on this server. Along the way I learned:
- io virtualization sucks for io-heavy applications
- the tools to determine how much of the cpu your app is getting at a given moment stink
- memory virtualization in which you resize application memory is primitive and almost useless
- there were no guidelines for optimization of the server - just recommendations to try it
hundreds of different ways and leave it on the best settings
- basic setup of the machine required wading through tons of jargon that even the os engineers didn't seem to know well
- out of the box - a single app on the new virtualization server performed more slowly than it did on a free seven year-old server
- some of the most heavily-advertised virtualization features of the product just don't work
- virtualization of multiple busy apps onto the same server is mostly a waste of money
- virtualization of multiple mostly idle app (failover servers, test servers, demo servers, etc) should work very well
- we spent at least $25k on labor just to create something that was a slam dunk
- I'm glad that we started with a small prototype - and didn't waste a ton of cash moving everything over immediately the way some management hoped
- I think in the end we'll get multiple apps working on this box just fine. BUT - we will have spent more money on this scenario than by simply purchasing separate systems. We may recoup a savings if we move enough idle systems onto virtual boxes.
As a result of this experience my team now knows more about virtualization than any other people in the division, we now have a production server supporting it, my management is now cool on this technology, and there is no risk of being forced to migrate critical servers over quickly to the virtual world. I'd call that a success.
I think that you're right - that grid is in a hype cycle right now. So - there are quite a few disappointments to be had along the way to its implementation. For example - if your workload is heavily transactional - you're really not going to get much benefit. In this example oracle supports grids - but it is really more about failover than performance. If you roll your own or use a more sophisticated product you can be safe in assuming that you'll hit unexpected issues, a gap between vendor marketecture & what you really need, and possibly the pain of having a vendor talking directly to your management.
You might want to consider having management fund a small prototype to prove out the benefits. Then let them see that they can achive perhaps better availability but worse performance at a very high cost through this approach.
good luck
What virtualization platform did you use? (Score:1)
Re: (Score:1)
Wish I could tell you, but I can't.
It's a trade-off (Score:4, Informative)
Sounds like a trade-auditing project I was once on.
If the 10,000 trades are easily broken into small groups, such as by the initial letter of the ticker symbol, and if all the data for the analysis is fetched in the first step, you can in fact spread the processing over 26-odd machines for a speedup of (fixed part + (per-ticker-symbol part/26)).
I have an article on doing the load-balancing part of this kind of processing, albeit on a large multiprocessor, at http://www.sun.com/blueprints/0605/819-2888.pdf [sun.com][In PDF].
As you've already guessed, sometimes the problem doesn't decompose
nicely into parts that can be distributed to machines
far from the database.
The rule of the thumb is that grid does distributed computation, where you ship small amounts of data to many CPUs. If you have large amounts of data, you need to have previously distributed data stores, and then you ship the processing to reside with it, instead of the other way around. Alas, some folks call the latter grid, when it should be called something like "data grid" (;-))
--dave
Re: (Score:2)
Yup: if you could do it at run-time, you'd use a bin-packing algorithm to create N equally-sized buckets of trades (;-))
--dave
Mod parent up. (Score:2)
This is the only post that sounds like it's coming from someone with a clue.
Re: (Score:2)
Quantian (Score:4, Informative)
Re: (Score:1)
some warnings from experience (Score:2)
Some learnings:
1. Software licensing is your biggest enemy. Oracle in particular is evil in this regard, but every vendor fears grid computing since it doesn't conform to their pricing models and gives you more bang for the buck. Investigate the consequences of grid at the earliest opportunity.
2. By linking numerous apps to a pool of servers, you've just complicated your software currency lifec
The key to running a grid successfully is attitude (Score:2)
hmm. (Score:3, Insightful)
Identify bottle necks (Score:2)
A few comments from my experience (Score:1)
Not every application is an ideal candidate for the grid: the problems that scale best are composed of many discrete, independent calculations (think Monte Carlo simulations). Strictly linear p
fix the easy problems first (Score:2)
the next thing to think about is how to educate the powers that be on their options in terms of parallel processing. this me
Seperate cluster from datacenter.. (Score:2)
First: I assume that you are talking about clusters, not grids (grid=>cluster as road=>car).
Second: The computation nodes *do not* sit on your regular datacenter network. A computation node only ever talks to its master and its peers, so they sit on their own, dedicated, high-speed network (usually no less than 1 gbps).
Third: Some tasks are better for SMP, other for clusters. Find out whic
Grid and parallelize (Score:1)
2. Grid/Parallelize the application layer -- i.e., ensure you can run parallel jobs with discrete data.
3. If that doesn't help, then grid the database layer.
If your application isn't built to scale today -- see the second point -- all the grid in the world won't help you.
I agree with you that it sounds like the code needs some optimization -- 10-30 minutes to process i
Ok first off, the two are not mutually exclusive (Score:2)
I forgot (Score:3, Insightful)
8. Off the top of my head, freebies include Torque, GridEngine, Condor.
9. Yes it would be a Beowulf of those. Mwhahaha!
Re: (Score:1)
Improve code! (Score:2)
To a point, I have to wonder this as well. I'm really annoyed at not having the ability to use 32-bit drivers in a 64-bit OS when it's still going to end up addressing the same fucking registers. (I'm talking about a webcam, here, BTW) Considering that current 64-bit processors are based off of and share the same registers/opcodes (in the AMD/Intel market, that is, I can't speak
Re: (Score:2)
GRID is an administration system, not HPC (Score:1)
GRID is complex, it's main advantage is the way it can handle users, data and computational resource administration in a very hetrogenous environment. GRID is all about adminstring groups of users all across the globe, using all kinds of different hardware, to process data that's to big to be stored in one location, and therefore also very distributed. It has al kinds of tools to distribute the management of access to resources, users, etc.
Re: (Score:1)
Check out IceGrid from zeroc (Score:2)
Warning signs (Score:2)
I've been at two investment banks, one midsized and one gargantuan. Gargantuan one has a grid, along with piles of Linux servers, piles of Sun servers, large medium and small databases of both transactional and data warehousing
Data heavy parallelization (Score:1)
Just some ideas... (Score:1)
I am interested to hear from other people in a similar position and, in particular, why or why not they chose grid software over improving the existing code to leverage better processor technology,
Not sure how "comparable" my situation is to yours (aerospace industry) but in a similar "many machines versus optimizing to exploit smaller, faster, better machines" situation we came down soundly on the side of the former. The reasoning went roughly like this:
We know today's budget. We can use it to upgrade
More of a summary (Score:1)
it all depends (Score:2)
How long are people willing to wait on the 30 minute jobs?
A lot of tasks like this tend to batch easy, and if you can batch it, then you can throw it on a batch queueing system (like LSF, the one I have my experience with).
At the end of the day, its a lot easier to run multiple jobs on multiple machines than it is to optimize a single job. It all depends on where you want to spend your time and what return you want and expe
its important too.. (Score:1)