My project involves research where I run my own code to repeatedly analyze a ~30TB dataset, where data is in flat files on an xfs filesystem and and each job reads a subset of these files, always sequentially. The task is inherently sequential, so this isn't parallelizable in ways suited to Map-Reduce, for instance. I get parallelism by running one research job per CPU core, and conducting many experiments in parallel.
The architecture I have so far is to use a direct-attached external JBOD chassis and hardware RAID6 over 16 x 4TB 3.5" SATA drives. I then NFS export this read-only dataset to other compute nodes nearby, which also run compute jobs accessing the same data. I am currently CPU bound, and so far I think I could grow to 2-4 more compute nodes all reading from this NFS filesystem. Once I exceed the bandwidth of the drives, I'll buy another chassis & drives (50TB) and mirror my dataset there.
I'm looking to expand compute capacity, and looking for advice between:
— scale out with many cheap nodes (Dell R620)
— fewer beefy nodes (Dell R820)
— Dell blade solutions (M1000e + many M620 blades)
— Dell VRTX blade with internal storage + compute
I have heard that blades can be finicky (setup, compatibility), and it seems surprising to me that these enterprise technologies would be the most price-efficient. Yet the pricing seems reasonably good, and the power consumption should be better. Are blades a popular choice for HPC?
I've also heard that it's possible to directly attach storage to multiple compute nodes. Presumably if I do this, I need to switch filesystems — is this advisable? It seems like it might perform better than NFS over 1GigE.
Happy to hear what people think.