Comment Re:Sounds inefficient (Score 2, Informative) 65
BigTable scales pretty well (go read it's white-papers) - though perhaps not as efficiently as map-reduce for something as simple as text to keyword statistics (otherwise why wouldn't they have used it all along).
I'll caveat this whole post with - this is all based on my reading of the BigTable white-paper a year ago, but having played with Cassandra, Hadoop, etc occasionally since then. Feel free to call me out on any obvious errors. I've also looked at a lot of DB internals (Sybase, Mysql MyISAM/INNODB and postgresql).
What I think you're thinking is that in a traditional RDBMS (which they hint at), you have a single logical machine that holds your data.. That's not entirely true, because even with mysql, you can shard the F*K out of it. Consider putting a mysql server on every possible combination of the first two letters of a google-search. Then take high density combinations (like those beginning with s) and split it out 3, 4 or 5 ways.
There are drastic differences to how data is stored, but that's not strictly important - because there are column-oriented table stores in mysql and other RDBMS systems. But the key problem of sharding is what's focused on Mysql-NDB-Cluster (which is a primitive key-value store) and other distributed-DB technologies that best traditional DBs at scalability.
BUT, the fundamental problem that page-searches are dealing with is that I want a keyword to map to a page-view-list (along with meta-data such as first-paragraph / icon / etc) that is POPULATED from statistical analysis of ALL page-centric data. Meaning you have two [shardable] primary keys. One is a keyword and One is a web-page-URL. But the web-page table has essentially foreign keys into potentially THOUSANDS of keyword records and visa-versa. Thus a single web-page update would require thousands of locks.
In map-reduce, we avoid the problem. We start off with page-text, mapped to keywords with some initial meta-data about the parent-page. In the reduce phase, we consolidate (via a merge-sort) into just the keywords, grouping the web pages into ever more complete lists of pages (ranked by their original meta-data - which includes co-keywords). In the end, you have a maximally compact index file, which you can replicate to the world using traditional BigTable (or even big-iron if you really wanted).
The problem of course, was that you can't complete the reduce phase until all web pages are fully downloaded and scanned.. ALL web pages. Of course, you do an hourly job which takes only high-valued web-pages and merges with the previous master list. So you have essentially static pre-processed data which is over-written by a subset of fresh data.. But you still have slowest-web-page syndrome. Ok, so solve this problem by ignoring web-load requests that don't complete in time - they'll be used in the next update round.. Well, you still have the issue of massive web-pages that take a long time to process. Ok, so we'll have a cut-off for them too.. Mapping nodes which take too long, don't get included this round (you're merging against you last valid value - so if there isn't a newer version, the old one will naturally keep). But the merge-sort itself is still MASSIVELY slow. You can't get 2-second turn-around on high-importance web-sites. You're still building a COMPLETE index every time.
So now, with a 'specialized' GFS2 and specialized BigTable, either or both with new fangled 'triggers', we have the tools (presumably) to do real-time updates. A Page load updates its DB table meta-data. It see's it went up in ranking, so it triggers a call to modify the associated keyword's table (a thousand of them). Those keywords have some sort of batch-delay (of say 2 seconds) so that it minimizes the number of pushes to production read-servers.. So now we have an event queue processor on the keyword table. This is a batch processor, BUT, we don't necessarily have to drain the queue before pushing to production. We only accept as many requests as we can fit into a 2 second time-slice. Presumably the algorithm is scalable to multiple machines, so some monitor can detect which keys can be grouped together on common hardware and which require more than one server to request (e.g. a single primary key is being served by say 100 machines.. Say the keyword "Lindsay Lohan").
In terms of enhancements since the last BigTable white-paper. Obviously triggers make sense. On update, when special filter condition is met, trigger a remote call to another table to incorporate a subset of the updated row. So for URL-X, whenever a new URL+keyword primary key is inserted, immediately push the URL meta-data to that keyword of the keyword table. Do something similar if some interesting aspect of the keyword has changed, OR if the base metadata for the overall page has changed (say some premium service or search-data ranks the page overall higher, so all the URL-Keywords needs to be re-considered).
The other aspect could be related to making the incremental index pushes more efficient from the writing keyword-table to the read-only (ideally compacted) keyword-table's that serve all the google searches.. With map-reduce, they would always have been slack-free and redundancy free. But with BigTable, you'll likely have tremendous slack-space/overwritten-nodes. Plus you'll not have an efficient number of search-depth (won't be log(n), but k + log(n), which can potentially double or tripple the number of GFS[2] loads).
So ideally, you'd like to run a compaction on data after your 2 second queued update. If you synchronized the operation.. On the 0'th millisecond of every even second, you'd initiate a compaction and publication. This is different than traditional BigTable which runs compaction as a background process, and it's largely transparent to operation. I can think of several ways to minimize the cost of the compaction and thus the amount of SSTables that would have to be inserted into the public view's mapping. But I'm sure this is a complex endeavor that required a lot of debugging.
I'll caveat this whole post with - this is all based on my reading of the BigTable white-paper a year ago, but having played with Cassandra, Hadoop, etc occasionally since then. Feel free to call me out on any obvious errors. I've also looked at a lot of DB internals (Sybase, Mysql MyISAM/INNODB and postgresql).
What I think you're thinking is that in a traditional RDBMS (which they hint at), you have a single logical machine that holds your data.. That's not entirely true, because even with mysql, you can shard the F*K out of it. Consider putting a mysql server on every possible combination of the first two letters of a google-search. Then take high density combinations (like those beginning with s) and split it out 3, 4 or 5 ways.
There are drastic differences to how data is stored, but that's not strictly important - because there are column-oriented table stores in mysql and other RDBMS systems. But the key problem of sharding is what's focused on Mysql-NDB-Cluster (which is a primitive key-value store) and other distributed-DB technologies that best traditional DBs at scalability.
BUT, the fundamental problem that page-searches are dealing with is that I want a keyword to map to a page-view-list (along with meta-data such as first-paragraph / icon / etc) that is POPULATED from statistical analysis of ALL page-centric data. Meaning you have two [shardable] primary keys. One is a keyword and One is a web-page-URL. But the web-page table has essentially foreign keys into potentially THOUSANDS of keyword records and visa-versa. Thus a single web-page update would require thousands of locks.
In map-reduce, we avoid the problem. We start off with page-text, mapped to keywords with some initial meta-data about the parent-page. In the reduce phase, we consolidate (via a merge-sort) into just the keywords, grouping the web pages into ever more complete lists of pages (ranked by their original meta-data - which includes co-keywords). In the end, you have a maximally compact index file, which you can replicate to the world using traditional BigTable (or even big-iron if you really wanted).
The problem of course, was that you can't complete the reduce phase until all web pages are fully downloaded and scanned.. ALL web pages. Of course, you do an hourly job which takes only high-valued web-pages and merges with the previous master list. So you have essentially static pre-processed data which is over-written by a subset of fresh data.. But you still have slowest-web-page syndrome. Ok, so solve this problem by ignoring web-load requests that don't complete in time - they'll be used in the next update round.. Well, you still have the issue of massive web-pages that take a long time to process. Ok, so we'll have a cut-off for them too.. Mapping nodes which take too long, don't get included this round (you're merging against you last valid value - so if there isn't a newer version, the old one will naturally keep). But the merge-sort itself is still MASSIVELY slow. You can't get 2-second turn-around on high-importance web-sites. You're still building a COMPLETE index every time.
So now, with a 'specialized' GFS2 and specialized BigTable, either or both with new fangled 'triggers', we have the tools (presumably) to do real-time updates. A Page load updates its DB table meta-data. It see's it went up in ranking, so it triggers a call to modify the associated keyword's table (a thousand of them). Those keywords have some sort of batch-delay (of say 2 seconds) so that it minimizes the number of pushes to production read-servers.. So now we have an event queue processor on the keyword table. This is a batch processor, BUT, we don't necessarily have to drain the queue before pushing to production. We only accept as many requests as we can fit into a 2 second time-slice. Presumably the algorithm is scalable to multiple machines, so some monitor can detect which keys can be grouped together on common hardware and which require more than one server to request (e.g. a single primary key is being served by say 100 machines.. Say the keyword "Lindsay Lohan").
In terms of enhancements since the last BigTable white-paper. Obviously triggers make sense. On update, when special filter condition is met, trigger a remote call to another table to incorporate a subset of the updated row. So for URL-X, whenever a new URL+keyword primary key is inserted, immediately push the URL meta-data to that keyword of the keyword table. Do something similar if some interesting aspect of the keyword has changed, OR if the base metadata for the overall page has changed (say some premium service or search-data ranks the page overall higher, so all the URL-Keywords needs to be re-considered).
The other aspect could be related to making the incremental index pushes more efficient from the writing keyword-table to the read-only (ideally compacted) keyword-table's that serve all the google searches.. With map-reduce, they would always have been slack-free and redundancy free. But with BigTable, you'll likely have tremendous slack-space/overwritten-nodes. Plus you'll not have an efficient number of search-depth (won't be log(n), but k + log(n), which can potentially double or tripple the number of GFS[2] loads).
So ideally, you'd like to run a compaction on data after your 2 second queued update. If you synchronized the operation.. On the 0'th millisecond of every even second, you'd initiate a compaction and publication. This is different than traditional BigTable which runs compaction as a background process, and it's largely transparent to operation. I can think of several ways to minimize the cost of the compaction and thus the amount of SSTables that would have to be inserted into the public view's mapping. But I'm sure this is a complex endeavor that required a lot of debugging.