Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×

Comment Re:The Fuck? (Score 1) 175

Well, for queries on structured data, no, not often at all. Practically never if properly configured.

Why do the people who make those engines disagree with you and advocate hybrid strategies?

Maybe MySQL does, I don't know. PostgreSQL does not.

Enterprise DB which writes Postgres disagrees with you. That's one reason they have a practice around supporting IBM's big data platform and themselves author the Postgres Plus Connector for Hadoop.

Comment Re:The Fuck? (Score 1) 175

but it's not always a win, especially if you don't know what you're doing or why.

I agree. I always start big data conversations with the "what query do you want to perform that doesn't work on RDBMS"? If they can't even name one then there is no reason to go big data. A dataset under a 100g, that isn't being read / destroyed quickly that can be structured consistently never needs big data. One of those 3 things has to not be true.

The problem on /. right now is that lots of the developers who are SQL advocates didn't come up in the pre-SQL years when the ratio of computational power to storage was much lower than it was 1990-2005. Techniques that made sense when you could only do say 3g / hr are making sense today when storage is often measured in petabytes and its hard to get 5g / sec from even the best arrays. 100t of read at 5g / sec = 5.5 hrs. If you need answers in 55 seconds then even getting a factor of 10 above 5g / sec (which I've never seen done) won't get you close to your goal. That's my point to most of the SQL advocates there are use cases because 100t of data in a table is no longer something that even midsized business can't play pretty comfortable with.

Comment Re:Terrible arguments for Big Data (Score 1) 175

Why bother lighting the fuse for a full cartesian product blow-up at all?

The main reason is when the blow up in practice is sparse.

Say for example A has 1 million rows and 20 columns. B has 1 million rows and 20 columns. Column A1 is a link to rows in B. On average a value in A1 will have 3 corresponding rows in B. Pulls from A*B tend to be 5 rows or less. It is cheap to just pull the appropriate blocks from A and then the appropriate blocks from B. That's going to be much faster than denormalizing,

Comment Re:The Fuck? (Score 1) 175

And then what? How do you get them to work together? You could add a coordination system between them so you have a bunch of slaves nodes getting dispatches from the master node? Well to make that work you need to design your data to be partitioned and your workload to be able to be combined at the coordination level with little contact between the coordinator and the slave nodes. That's a big data engine.

Comment Re:The Fuck? (Score 0) 175

Your comment about set theory is nonsense. This is computer science not math. In math (and especially set theory) a problem has a finite solution if given any finite amount of computation power C there exists any finite amount of time T such that the algorithm will arrive at an answer. Computer science is all about making C and T small.

If you want to have a long conversation get an account.

Comment Re:The Fuck? (Score 1) 175

Absolutely true. SQL technology is much more mature. But just like SQL made sense in a world where non-SQL COBOL based systems were more mature for many client-server workloads, big data makes sense for non client server workloads of the kind that are more similar from a hardware standpoint to the old non-SQL workloads.

Comment Re:The Fuck? (Score 4, Informative) 175

SQL engines are often slower than what?

Than engines designed for massive parallelism in dealing with workloads which can be effectually processed in parallel.

Operating on what hypothetical database schema with how many records spread across how many tables?

Generally NoSQL engines use schema on read techniques not schema on write. The table structure comes during the read. To get some sort of fair comparison something like a typical star schema with a much too large fact table (think billions or trillions of rows) and a half dozen dimension tables.

Or if you really want to make it worse. The same query where the table is getting 1m writes / second and you want an accurate stream.

SQL engines have problems with massive parallelism? Why? Which ones?

Because SQL by its nature operates on the table not the individual rows. Older database technologies that were row oriented like what you see on a mainframe on in SaS work better when the ratio of table size to computation speed is low. Today because disk storage size per dollar has gone up so fast, we disk we face many of the same problems systems in the 1980s faced with tape.

And the next question is pretty much all of them. The big data SQL engines have the least problems though and via. their execution plans turning into map-reduces might present a viable long term solution.

How well do you *really* know SQL in general and the capabilities of different database engines in particular?

Assume I don't know anything. Oracle, which has the best engine and SQL people on the planet has a guide for hybridization to handle things their engine can't handle well. IBM which probably comes in second and invented the relational database produces their own Hadoop / R to handle queries that DB2 (which is BTW far better than Oracle at stream) can't handle. Teradata's engine which was originally written specifically for larger amounts of data for a decade has had specific features of another subsystem to do enhanced big data, they also have guides for hybridization for things even their enhanced engine can't handle And Microsoft which writes the 3rd most popular engine has spent many millions on hybridization strategies. Enterprise DB (postgres) fully supports the IBM strategy.

I don't know anyone in the space who does agree with the /. "SQL can do everything" attitude.

but that portion off the article was ridiculous, and thus far all of the comments in support of it have demonstrated a similar lack of familiarity with actual databases, their operation, or performance tuning.

The article was ridiculous I said as much in another response. However the comment I was responding to went much too far in the other direction. As for performance tuning -- performance tuning is designed to avoid full table scans and expensive joins. To goal of many hybridization strategies is to take a raw data flow and convert it into a relational ETL using a big data engine which can take advantage of indexing and a better execution plan. It doesn't do much good when the initial goal is to do a full table scan.

Comment Re:The Fuck? (Score 4, Informative) 175

I know SQL pretty well. I agree with you it handles most stuff. That doesn't mean it handles everything.

SQL engines are often slower.
SQL engines have problems with massive parallelism (i.e often at around 12 CPUs you stop gaining much at all by adding addition CPU).
SQL engines have problems with complex in document (i.e. in blob) searches
etc...

Comment Terrible arguments for Big Data (Score 1) 175

I'm a big data advocate. I like the idea of engines designed for unstructured data. But the two examples in the article barely even register as difficulties of relational databases, "What if two people share the same address but not the same account? What if you want to have three lines to the address instead of two? Who hasn’t tried to fix a relational database by shoehorning too much data into a single column? Or else you end up adding yet another column, and the table grows unbounded.".

As for his comments on denormalizing, I'm wondering if he has ever head of a data warehouse and a star / snowflake schema both of which handle the "I want cheaper joins" problem without having to denormalize the dimension tables.

Comment Re:Causes of hording. (Score 1) 107

The department of defense runs servers out of house. Lockheed Martin runs a cloud provider. Many of the country's banks handle it. There is no question you can buy better security than any company has internally.

As for running an internal cloud that's pretty easy and they could ask a vendor to run the financial it while keeping all the servers physically on their prem.

Slashdot Top Deals

The flow chart is a most thoroughly oversold piece of program documentation. -- Frederick Brooks, "The Mythical Man Month"

Working...