Google Experiments 186
gafferted writes "The boffins at google have been experimenting with new toys, such as Keyboard Shortcuts and glossary, but most fun is Google Sets. Try "green, purple, red" to get a set of 40 different colours. Try a set that contains both Richard Stallman and Bill Gates, see what google associates with Slashdot or ask for a set of rude words."
Google Blog (Score:5, Interesting)
And no, it's not my site. I just think it's cool.
Fark-like Not Safe For Work (Score:2, Interesting)
Yeah, the thing doesn't link to boobies, but grepping for incoming text vs. grepping for inbound boobies is a tad easier for log generation.
Besides, I thought rude words just involved being insensitive, not foul.
Re:Google *do* cach itself, and the result is funn (Score:2, Interesting)
If you're talking about the language thing, it is your mother tongue.
Re:Very Impressive (Score:3, Interesting)
Quick theory on how Google Sets works (Score:2, Interesting)
Each query phrase produces a set of documents, i.e. web pages. The intersection of those sets gives a small set of docs which is pretty much the same thing that a normal google query (or any search engine) will return, if all the queries are ANDed. Then the new feature is to find the intersection of all the terms from all the docs in the doc-intersection set. That is, return all the terms that are common to all the docs.
e.g. in pseudo-code: Assume ...}. ... ...; // so docSets contains the URLs of the docs that have all the query terms // ws will contain the running intersection of the set of words in all the docs
- G is the normal google search engine.
- G.query("search phrase") returns a set of references (URLs) to docs, e.g. {u1, u2, u3,
- u.terms() returns a set of all the words contained in the doc referenced by u, e.g. if u=="http://slashdot.org", then u.terms() == {"news", "for", "nerds", "slashdot", etc.}.
- * is a set intersection operator.
s1 = G.query(q1); s2=G.query(q2); s3=G.query(q3);
docSets = s1 * s2 * s3 *
ws = docSets[0].terms();
forall url in docSets { ws = ws * url.terms(); }
return ws;
So my guess is that ws is the final set of terms returned by the google set. Of course, the words should be sorted by some meaningful metric, e.g. frequency. This is all very easy to implement and can be done very quickly, because finding the document set intersection and the word set intersections can be done very quickly using sparse vectors to represent word or document vectors.