Comment Re:Yes, this is legit and no, we're not idiots (Score 2) 387
Speaking as one of the people you will (temporarily) be supplanting, sounds like you have a tough spot to get through.
I also admin life sciences clusters for a major university on the east coast. I'm going to assume that our workloads are going to be fairly similar (R, matlab, blast, HMMER, IDEA, maybe some mutual information codes, sequence alignment, etc.). If that's not the case, some of this advice may be off.
So, a couple of things:
- I think CentOS is a good idea for a cluster platform. I do not think Rocks will scale like you want it to to that size, and it's really not terribly flexible either. Let's put it this way, I often find that I could have just built from scratch by the time I get Rocks to do all the customization I need. We run Rocks on small clusters, but big ones we spin ourselves (e.g. CentOS, or sometimes Fedora + Kickstart + some utility scripts and a scheduler... we use SGE, now OGS). Finally, stay away from more fringe distributions. You'll find that commercial software vendors are pretty quick to let you know they just don't support running their software on XX distribution. There are other reasons too. I posted a bit of a rant on this a while ago at: http://slashdot.org/comments.pl?sid=2188634&cid=36255670
- Infiniband vs. 10 Gbps. Well, InfiniBand is cool, and I've spent a lot of time working with it. I once had a project that involved writing some early stage block level storage protocols for InfiniBand... really, I like InfiniBand. That said, unless you plan to run a lot of MPI enabled MD simulations like Desmond, skip the IB and get 10 Gbps. There are a couple of exceptions to that rule, but most life sciences applications do not use MPI, and most of your traffic is going to be storage I/O. Depending on your storage solution, it's probably not InfiniBand enabled (in the front-end anyway, and you really don't want to be running IP over IB if you can help it). To say more I'd have to know a bit more about what you're going to be running.
- GPUs. One thing sticks out to me a lot here. If you don't know which GPUs to get, that probably means no one has ported anything to GPU yet. If someone has done some porting, you should ask them what they ported to. If they ported to CUDA, you should probably be looking at 2050s or 2070s. If they haven't ported anything, and they don't have (good!) GPU ported applications... don't waste money on too many GPUs. We've run a couple of pilots where we tried to get people using GPUs, and here are a couple of observations: 1. most researchers can't/won't do the porting; 2. most pre-built applications, such as matlab and R _still_ require you to port the matlab, R, etc. code, which researchers will probably also not do; 3. some life sciences algorithms just don't work well on GPUs (e.g. they are branch-heavy or memory I/O heavy algorithms); 4. many of the pre-built GPU applications for life science are terrible (I know a particular sequence alignment tool, for instance, that is proud of it's 4x speedup over a single CPU... do the math... which costs more, a quad core CPU or a tesla?). GPUs can be great, but buy them sparingly at the beginning and integrate them as they are actually being used. If you're buying now you should be buying CUDA (i.e. NVidia). It's the only actual mature development kit (though I don't like that it doesn't let you control the scheduling on the card... but I digress).
- Chargeback: So the bottom line is nothing is going to give you chargeback without some effort. You're going to have to manage that on your own. The best way to do it is to setup some basic accounting scripts that will dig your cluster logs (or database, depending on your configuration) and generate accounting reports. Note that it's the resource manager/policy manager (e.g. OGS, Torque/Maui, etc.) logs that you're going to do this with. You _could_ do it with Rocks as well as anything else (but again, I don't suggest Rocks for this project).
Sounds like you have a fun project ahead of you... good luck!