rockmuelle - Slashdot User

Comment Data scientists == web masters (Score 2) 99

by rockmuelle on Friday January 30, 2015 @09:48PM (#48944121) Attached to: Cutting Through Data Science Hype

Comment Re:Yep it is a scam (Score 2) 667

by rockmuelle on Wednesday January 21, 2015 @07:12PM (#48870451) Attached to: US Senate Set To Vote On Whether Climate Change Is a Hoax

Comment No Camera? (Score 1) 324

by rockmuelle on Wednesday January 21, 2015 @05:55PM (#48869685) Attached to: What Will Google Glass 2.0 Need To Actually Succeed?

Comment Re:Yep it is a scam (Score 2, Informative) 667

by rockmuelle on Wednesday January 21, 2015 @05:45PM (#48869573) Attached to: US Senate Set To Vote On Whether Climate Change Is a Hoax

Comment Re:Proprietary (Score 3, Interesting) 648

by rockmuelle on Tuesday January 20, 2015 @01:14PM (#48857177) Attached to: Justified: Visual Basic Over Python For an Intro To Programming

Comment HTML5 Client (Score 1) 264

by rockmuelle on Tuesday January 13, 2015 @03:01PM (#48805145) Attached to: Ask Slashdot: Linux Database GUI Application Development?

Comment Re:Hadoop needs a fairly specialized problem (Score 4, Interesting) 34

by rockmuelle on Tuesday January 13, 2015 @12:46PM (#48803743) Attached to: Meet Flink, the Apache Software Foundation's Newest Top-Level Project

Comment Re:Ok, I give up (Score 4, Interesting) 34

by rockmuelle on Tuesday January 13, 2015 @11:20AM (#48802733) Attached to: Meet Flink, the Apache Software Foundation's Newest Top-Level Project

More importantly, why did we need Hadoop when we already had [your_favorite_language] + [your_favorite_job_scheduler] + [your_favorite_parallel_file_system]?

Seriously, standard HPC batch processing methods are always faster and easier to develop for than latest_trendy_distributed_framework.

The challenges of data at scale* are almost all related to IO performance and the overhead of accessing individual records.

IO performance is solved by understanding your memory hierarchy and designing your hardware and tuning your file system around your common access patterns. A good multi-processor box with a fast hardware raid and decent disks and sufficient RAM will outperform a cheap cluster any day of the week and likely cost less (it's 2015, things have improved since the days of Beowulf). If you need to scale, a small cluster with Infiniband (or 10 GigE) interconnects and Lustre (or GPFS if you have deep pockets) will scale to support a few petabytes of data at 3-4 GB/s throughput (yes, bytes, not bits). You'd be surprised what the right 4 node cluster can accomplish.

On the data access side, once the hardware is in place, record access times are improved by minimizing the abstraction penalty for accessing individual records. As an example, accessing a single record in Hadoop generates a call stack of over 20 methods from the framework alone. That's a constant multiplier of 20x on _every_ data access**. A simple Python/Perl/JS/Ruby script reading records from the disk has a much smaller call stack and no framework overhead. I've done experiments on many MapReduce "algorithms" and always find that removing the overhead of Hadoop (using the same hardware/file system) improves performance by 15-20x (yes, that's 'x', not '%'). Not surprisingly, the non-Hadoop code is also easier to understand and maintain.

tl;tr: Pick the right hardware and understand your data access patterns and you don't need complex frameworks.

Next week: why databases, when used correctly, are also much better solutions for big data than latest_trendy_framework. ;)

-Chris

*also: very few people really have data that's big enough to warrant a distributed solution, but let's pretend everyone's data is huge and requires a cluster.

** it also assumes the data was on the local disk and not delivered over the network, at which point, all performance bets are off.

Comment Re:cis and mi regulation is not "bad" code (Score 2) 14

by rockmuelle on Friday December 19, 2014 @06:18PM (#48638267) Attached to: Machine Learning Reveals Genetic Controls

Comment Re:Not in the hospital I work at. (Score 1) 73

by rockmuelle on Monday December 01, 2014 @12:48PM (#48498343) Attached to: Intel Processor Could Be In Next-Gen Google Glass

Comment Super Mario Galaxy! (Score 1) 132

by rockmuelle on Thursday November 13, 2014 @03:54PM (#48380739) Attached to: Comet Probe Philae Unanchored But Stable — And Sending Back Images

Comment Re:Oh no (Score 3, Informative) 297

by rockmuelle on Monday November 10, 2014 @11:36AM (#48350223) Attached to: Study: Body Weight Heavily Influenced By Heritable Gut Microbes

Comment Programming complexity (Score 2) 181

by rockmuelle on Sunday November 09, 2014 @11:54AM (#48345235) Attached to: There's No Such Thing As a General-Purpose Processor

Comment One semester is a start (Score 1) 173

by rockmuelle on Saturday November 08, 2014 @08:01PM (#48342589) Attached to: Codecademy's ReSkillUSA: Gestation Period For New Developers Is 3 Months

Comment Re:Bruce Perens (Score 1) 240

by rockmuelle on Wednesday October 01, 2014 @02:01PM (#48039329) Attached to: Back To Faxes: Doctors Can't Exchange Digital Medical Records

Slashdot Top Deals