linderdm - Slashdot User

Comment Re:Sounds inefficient (Score 2, Informative) 65

by maraist on Sunday September 12, 2010 @06:31PM (#33556138) Attached to: Google Caffeine Drops MapReduce, Adds "Colossus"

BigTable scales pretty well (go read it's white-papers) - though perhaps not as efficiently as map-reduce for something as simple as text to keyword statistics (otherwise why wouldn't they have used it all along).

I'll caveat this whole post with - this is all based on my reading of the BigTable white-paper a year ago, but having played with Cassandra, Hadoop, etc occasionally since then. Feel free to call me out on any obvious errors. I've also looked at a lot of DB internals (Sybase, Mysql MyISAM/INNODB and postgresql).

What I think you're thinking is that in a traditional RDBMS (which they hint at), you have a single logical machine that holds your data.. That's not entirely true, because even with mysql, you can shard the F*K out of it. Consider putting a mysql server on every possible combination of the first two letters of a google-search. Then take high density combinations (like those beginning with s) and split it out 3, 4 or 5 ways.

There are drastic differences to how data is stored, but that's not strictly important - because there are column-oriented table stores in mysql and other RDBMS systems. But the key problem of sharding is what's focused on Mysql-NDB-Cluster (which is a primitive key-value store) and other distributed-DB technologies that best traditional DBs at scalability.

BUT, the fundamental problem that page-searches are dealing with is that I want a keyword to map to a page-view-list (along with meta-data such as first-paragraph / icon / etc) that is POPULATED from statistical analysis of ALL page-centric data. Meaning you have two [shardable] primary keys. One is a keyword and One is a web-page-URL. But the web-page table has essentially foreign keys into potentially THOUSANDS of keyword records and visa-versa. Thus a single web-page update would require thousands of locks.

In map-reduce, we avoid the problem. We start off with page-text, mapped to keywords with some initial meta-data about the parent-page. In the reduce phase, we consolidate (via a merge-sort) into just the keywords, grouping the web pages into ever more complete lists of pages (ranked by their original meta-data - which includes co-keywords). In the end, you have a maximally compact index file, which you can replicate to the world using traditional BigTable (or even big-iron if you really wanted).

The problem of course, was that you can't complete the reduce phase until all web pages are fully downloaded and scanned.. ALL web pages. Of course, you do an hourly job which takes only high-valued web-pages and merges with the previous master list. So you have essentially static pre-processed data which is over-written by a subset of fresh data.. But you still have slowest-web-page syndrome. Ok, so solve this problem by ignoring web-load requests that don't complete in time - they'll be used in the next update round.. Well, you still have the issue of massive web-pages that take a long time to process. Ok, so we'll have a cut-off for them too.. Mapping nodes which take too long, don't get included this round (you're merging against you last valid value - so if there isn't a newer version, the old one will naturally keep). But the merge-sort itself is still MASSIVELY slow. You can't get 2-second turn-around on high-importance web-sites. You're still building a COMPLETE index every time.

So now, with a 'specialized' GFS2 and specialized BigTable, either or both with new fangled 'triggers', we have the tools (presumably) to do real-time updates. A Page load updates its DB table meta-data. It see's it went up in ranking, so it triggers a call to modify the associated keyword's table (a thousand of them). Those keywords have some sort of batch-delay (of say 2 seconds) so that it minimizes the number of pushes to production read-servers.. So now we have an event queue processor on the keyword table. This is a batch processor, BUT, we don't necessarily have to drain the queue before pushing to production. We only accept as many requests as we can fit into a 2 second time-slice. Presumably the algorithm is scalable to multiple machines, so some monitor can detect which keys can be grouped together on common hardware and which require more than one server to request (e.g. a single primary key is being served by say 100 machines.. Say the keyword "Lindsay Lohan").

In terms of enhancements since the last BigTable white-paper. Obviously triggers make sense. On update, when special filter condition is met, trigger a remote call to another table to incorporate a subset of the updated row. So for URL-X, whenever a new URL+keyword primary key is inserted, immediately push the URL meta-data to that keyword of the keyword table. Do something similar if some interesting aspect of the keyword has changed, OR if the base metadata for the overall page has changed (say some premium service or search-data ranks the page overall higher, so all the URL-Keywords needs to be re-considered).

The other aspect could be related to making the incremental index pushes more efficient from the writing keyword-table to the read-only (ideally compacted) keyword-table's that serve all the google searches.. With map-reduce, they would always have been slack-free and redundancy free. But with BigTable, you'll likely have tremendous slack-space/overwritten-nodes. Plus you'll not have an efficient number of search-depth (won't be log(n), but k + log(n), which can potentially double or tripple the number of GFS[2] loads).

So ideally, you'd like to run a compaction on data after your 2 second queued update. If you synchronized the operation.. On the 0'th millisecond of every even second, you'd initiate a compaction and publication. This is different than traditional BigTable which runs compaction as a background process, and it's largely transparent to operation. I can think of several ways to minimize the cost of the compaction and thus the amount of SSTables that would have to be inserted into the public view's mapping. But I'm sure this is a complex endeavor that required a lot of debugging.

Comment Re:What if Beck is a Colbert? (Score 1) 703

by maraist on Sunday September 05, 2010 @02:37PM (#33482356) Attached to: The Push For Colbert's "Restoring Truthiness" Rally

"Will that still be your position if they win in November? After all, you're on a threat about hosting a counter rally Colbert-esque. And promoting the fact that it's all an act even by the right, despite the real ramifications, and call that a joke."

The joke I think the parent is referring to is about the Republican's leading them to think they are supporting the causes of the tea party. That the exact same people that spent us into oblivion are going to have their interests at heart. When all most people in power care about is appeasing their biggest doners - sure they have a say about which doners they want to pick - it's called picking a party. Now there are some legitimate tea party people.. People that I wouldn't trust to run a 7-eleven. Those I would truely and honestly believe feel the need to minimize taxation. Because.. well, it only takes about 50 IQ to figure out that the government forces you to pay taxes, and that in doing so it costs YOU hard-earned money. Easy fix.. Run for congress and don't do it anymore.. problem solved.. YEAH, if only thousands of years of governance could have figured that out!!! I applaud the tea-party movement for it's innovation.

Oh, by the way, the dumb f*ks don't even know (or acknowledge) what the tea party was.. It was NOT a fight for anti-taxation.. It was a fight against being a second class citizen - where taxation occurred without any local benefit, nor with any autonomy. Sure, once established, US congress had minimal taxation (predominantly on foreign trade because people didn't understand the ramifications of doing so at the time). But it was also a time when for the next 160 years we would not have a standing navy/army to fund. Nor an interstate road to maintain (for the same standing army).

Comment Re:What if Beck is a Colbert? (Score 1) 703

by maraist on Sunday September 05, 2010 @02:27PM (#33482322) Attached to: The Push For Colbert's "Restoring Truthiness" Rally

now now now. You see a lot of depictions of ill-informed tea party goers, but this is hardly representative. Many honestly want too privatize S.S. (essentially undoing the social safety net because they, their friends, and those they care about already have their golden parachutes). Many believe military spending is the only legitimate use of taxation. The ones you CAN throw popo at are the ones that say the best way to cut our deficit and taxation is by reducing foreign aid and by reducing pork-spending (which collectively is like 1.5%). Which is the majority of American's polled - not just lipton something for nothings.

Then I guess you can include the white-supremisists (disguised as American values people, or old people), the anti immigration people, the America-firsters, and my personal favorite, the God-wants-us-to-winn ers. Never mind the sage advice to hope to be on God's side.

Comment Re:What would the impacts of this be for cryptogra (Score 1) 457

by maraist on Sunday August 08, 2010 @08:00PM (#33184346) Attached to: Claimed Proof That P != NP

Just reread the definition of P=NP (been a while).. Guess FFT isn't a good example. There's no P verify and NP answer aspect of the FT.

But then again, traveling salesman problem (minimum path) isn't P to verify as far as I can reason. Though public key encryption probably is. Encrypt/decrypt in P time (matches original input == works?). V.s. crack in NP time.

Comment Re:I guess its good for cryptography. (Score 1) 457

by maraist on Sunday August 08, 2010 @07:49PM (#33184238) Attached to: Claimed Proof That P != NP

Naah.. Just pass the responsibility over to the hardware guys.. That's what cell phone dudes have been doing for a decade. ;)

Comment Re:What would the impacts of this be for cryptogra (Score 1) 457

by maraist on Sunday August 08, 2010 @07:48PM (#33184218) Attached to: Claimed Proof That P != NP

err.. rainbow tables?? Encryption with O(n ^ inf) of all 10 byte input files are pretty much constant to decrypt, even without the decryption key.

And I'm not sure what you're saying with n^1E8 . Consider what it would mean to have such a coefficient. 100 million nested loops?? Where practically speaking are you going to have that kind of coefficient in a polynomial algorithm? (I only bring it up because you mentioned practical).

The practical problem class is factorial or exponential n ^ x, which occur in combinatorial problem sets (meaning with every new element, you have to consider every existing element's permutations or combinations). Most interesting problems live here.

That being said, I've never formally studied P/NP, and personally find it a boring subject (especially given how much face time the subject gets)

Comment Re:What would the impacts of this be for cryptogra (Score 3, Interesting) 457

by maraist on Sunday August 08, 2010 @07:40PM (#33184176) Attached to: Claimed Proof That P != NP

Don't think this is what it means. Look at FFT (logarithmic optimization to a quadratic problem). P = NP as I understand it means that ALL NP problems have a corresponding P solution. You just have to think hard enough to find it. Proving that there are classes of NP that have no P just suggests certain crytographic algorithms MIGHT be NP. But it doesn't prove it (unless it was one of the particularly proven NP classes in this or some other paper). And even if this paper includes RSA / ECC, etc. That doesn't mean someone even more clever 30 years from now finds a flaw or special case where this isn't true and thus finds a P cracking tool.

Comment Re:And Back to the Future. (Score 1) 421

by maraist on Sunday July 25, 2010 @07:41PM (#33024336) Attached to: The Possibility of Paradox-Free Time Travel

But VM style forking requires non-trivial memory. Likewise space-time would need to reserve energy for the fork.. So you would need to contribute as much energy into a forking time-event as the extent of the fork causes deviations. But the causality issue is that you can't know a-priori how much change will be in effect, thus how much energy needs to be committed.. So the whole concept violates all sorts of principles of Science and Logic.

The only remaining two forms of 'time travel' are
1) An event that does not change the future at all (and thus non-paradoxical) - traveling backwards in time is no different than traveling left down i-95.
2) time travel is purposefully incompatible with distance locality.. Meaning I can travel back in time, but only n light-years away, thus my interference would not have a resonating paradoxical effect. This one seems the most compatible with relativity in my interpretation. It would seem that time travel requires very fast speeds, which would be inter-stellar in scale. This also is compatible with the statistical capturing of historical information. Light would be traveling to distant stars, and you could travel 'faster than light' to those stars such that you could see them before the light gets there.. The fact that you were going backwards in time to do so is almost immeasureable. Short distances would allow you to quickly see something that just happened.. While longer distances would have less and less resolution into past events.

Neither of the above suggests HOW you would do these things - just dealing with the logical consistency.

Comment Re:java sites screwed (Score 1) 230

by maraist on Saturday June 26, 2010 @08:17PM (#32705830) Attached to: Google Shares Insights On Accelerating Web Sites

I can only hope you were being sarcastic. variable names are least parser intensive operation. Though for non lexically scoped variables (at least in perl, which I though PHP was at least loosely based on), the variable names are hash-lookups, thus long variable names have a minute incremental cost - especially in tight loops.

That aside, this isn't a rational comparison, given that php is a scripted language and java is a compiled language. So your 50 character java variable name is a 4 byte integer symbol reference at load time and execution time.

That being said, java .class files (even in version 6) are pretty startup intensive the first time. And .jsp files are doublly so, because they are compiled into java source, then compiled into .class, then finally loaded. It'll only win over a PHP compile if it's

But high-performance pages are likely raw servlets and thus pre-loaded prior to startup.. Meaning before accepting port 8080. Thus in a clustered environment with rolling updates, you never see the startup slowness. The only remaining startup slowness would be pre-jitted code (running raw interpreted-mode for the first 100 executions or so). But by run 1,000 you're likely running bare-metal assembly - depending on the nature of the servlet that is. Granted, this doesn't compensate for overly abstracted code (many of the MVC frameworks) or inefficient cluster/database code.

Comment Re:It should Flash Crash to about 5000 (Score 1) 411

by maraist on Saturday June 26, 2010 @07:12PM (#32705504) Attached to: Flash Crash Analysis of May 6 Stock Market Plunge

Wow, that's misleading.

March '05, google's PE was 87 and growing.

June '01 MSFT PE was 61

June '02 BP was 25

Just before the housing crash and the most recent correction, all these companies were at nearly 2x their current PE.

Comment Re:If you think the game is rigged, why play? (Score 1) 411

by maraist on Saturday June 26, 2010 @07:03PM (#32705452) Attached to: Flash Crash Analysis of May 6 Stock Market Plunge

You're half right.. It's a pyramid scheme.. As a greater and greater fraction of GDP goes into the market, prices rise (just like the housing market). Government incentives be damned, it's the personal preference of most people that CD's at 1% to 6% is never going to let you retire.. The ONLY way to retire is with 7% to 15% returns. Likewise for pension-funds, municipals, endowment funds, etc. This only is [presumably] achievable by taking risk *cough* *cough* (better to actually invest that money in yourself and allow yourself to earn more money, but that's crazy talk).

So just like the housing market, eventually the market will peek. Most likely it'll just level out, but the lack of growth will kill the perceived future value (since there is physically no more money that can go into the aggregate market).. Then it becomes a zero-sum-game.. Sloshing funds from one stock to another.

Comment Re:HF Trading reduces spread, increases liquidity (Score 1) 411

by maraist on Saturday June 26, 2010 @04:49PM (#32704564) Attached to: Flash Crash Analysis of May 6 Stock Market Plunge

What you're describing is useless arbitrage. It isn't win-win-win.. It's win-ops-ops.

Useful arbitrage happens with 3 or more markets. Any two individuals can reach an optimal price through direct negotiation.. Any argument that HFT increases the performance is merely describing inefficiencies in the exchange market setup (e.g. having multiple exchanges that aren't centralized).

Consider if I want stock S at price P from user U. But it happens that if I route around 3 intermediate firms, I can get a lower price. Arbitragers remove the profit margin on alternate routes, so it's never worth my while researching these paths.

The equivalent would be finding that it's cheaper for a British goat-buyer, finding a shortage of goats in the UK, but seeing a glut of goats in the US. He might simply order the goats on his UK credit card, and get charged a single exchange rate. BUT, if he does this often enough. He's better off pre-purchasing a lot of US $ - watching the market fluctuations to find optimal weekly/monthly rates. Holding large sums of USD is expensive, since this is the one and only use for it. THEN, let's say he's really dilligent. He determines that due to trade-deficits and trade-wars / import-duties, it's cheaper to buy the goats in Yen!! Or worse, to first ship the goats to Japan, then re-ship them to the UK.

These are structural innefficiencies that arbitrage can solve in a win-win situation.

A currency arbitrage buys and sells currency to the point that it's never worth it to buy in someone else's currency.

The better long term solution is to unify the exchange process.. As Europeans did with the Euro. You get rid of the middle-men entirely.

Likewise here, the only innefficiency is the alternate market-makers not having a central clearing-house.. So.. fix it. Make a central clearing house (overseen by the government).

Comment Re:HF Trading reduces spread, increases liquidity (Score 1) 411

by maraist on Saturday June 26, 2010 @04:26PM (#32704416) Attached to: Flash Crash Analysis of May 6 Stock Market Plunge

Except that, there is no difference between what you've suggested an a market maker that simply sells a $29 stock at $29 to the guy willing to buy $30 and not have a middle-man leach $1 out of the system. There is no conceivable reason why the already electronic system can't make such matching decisions automatically instead of requiring arbitration.

Comment Re:HF Trading reduces spread, increases liquidity (Score 1) 411

by maraist on Saturday June 26, 2010 @04:21PM (#32704382) Attached to: Flash Crash Analysis of May 6 Stock Market Plunge

I'm no financial trader, but HFT has to do with the speed of a transaction, NOT the financial analyst watching pending transactions and making auto-purchase/sell/short decisions. These two concepts are independent. I (as the speculative market-maker) can make the same transaction once / second and accomplish your liquidity goals - and most likely be MORE efficient because I can batch 1 second worth of pending requests.

The ONLY thing HFT does is let YOU be the speculator faster than your competitors.

Thus banning HFT would have ZERO effect on liquidity.

Comment Re:HF Trading reduces spread, increases liquidity (Score 1) 411

by maraist on Saturday June 26, 2010 @04:14PM (#32704352) Attached to: Flash Crash Analysis of May 6 Stock Market Plunge

Why does this make sense to you? Have you ever used ebay?
The obvious solution is to adjust the transaction cost to $29 directly between the two buyers. ebay does this as the market maker, as would 99% of trading firms. No HPT needed. I'm sure there is a use-case where HPT gives value-add, but this isn't it.

Slashdot Top Deals