Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

[ Create a new account ]

Interview With Google's Director of Research

Posted by Hemos on Thu Jun 21, 2001 10:55 AM
from the excellent-interivew dept.
Cialti writes "Salon has a very interesting article with Monika Henziger, Google's Director of Research, about their search technology and where they're going with it. "
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1) | 2
  • Re:[ot]Google's data structure? by Anonymous Coward (Score:1) Thursday June 21 2001, @07:13AM
  • Re:Voice activated search engine by Anonymous Coward (Score:1) Thursday June 21 2001, @10:21AM
  • you want choices? by Anonymous Coward (Score:1) Thursday June 21 2001, @03:47PM
  • Re:Voice activated search engine by Anonymous Coward (Score:2) Thursday June 21 2001, @08:37AM
  • Re:Actual Questions for Ask Jeeves by Tony Shepps (Score:1) Thursday June 21 2001, @08:09PM
  • Actual Questions for Ask Jeeves by Tony Shepps (Score:2) Thursday June 21 2001, @09:18AM
  • Yikes, Zephyr Interactive? by drsoran (Score:1) Thursday June 21 2001, @08:37AM
  • Re:Prepositions need love too by Malc (Score:2) Thursday June 21 2001, @07:26AM
  • Deja by Tet (Score:2) Thursday June 21 2001, @07:44AM
  • Google also does Mac searches! by GPS Pilot (Score:1) Thursday June 21 2001, @11:14AM
  • Re:[ot]Google's data structure? by K-Man (Score:2) Thursday June 21 2001, @08:37AM
  • by K-Man (4117) on Thursday June 21 2001, @10:11AM (#135282)
    That's true if the data is changing. However most search engines do web crawls in large chunks, and index the data once in one large block. Under such conditions dynamic management of hit lists and other data structures is not necessary. Basically, the bytes are packed as tight as they can get them so that it all fits into memory.

    As far as I can tell from their paper [nec.com], Google manages its web crawls the same way. It partitions the data into "barrels" and indexes each separately. Once the indices are built, they aren't updated. They also extend the hit lists to include word position and some other attributes for each hit.
  • Re:Prepositions need love too by rsidd (Score:1) Thursday June 21 2001, @09:30AM
  • Re:Prepositions need love too by Zagadka (Score:2) Thursday June 21 2001, @06:07PM
  • Re:Voice activated search engine by FFFish (Score:2) Thursday June 21 2001, @07:14PM
  • Re:Disturbing Search Requests by ergo98 (Score:1) Thursday June 21 2001, @08:16AM
  • by ergo98 (9391) <dennis.forbes@gmail.com> on Thursday June 21 2001, @07:30AM (#135287) Homepage Journal

    Google absolutely blows away the competition, however it is humorous seeing entries in my log file related to people looking for masturbation tips (from the beginner level "How To" style queries, to full blown searches for advanced techniques). The page [yafla.com] in question is entitled "Hey Jerk : Get Off My Computer!" (and relates to pop-up ad windows) and I'm, uh, proud to see that it ranks #2 for searches for "jerk off technique" (I've had dozens of related hits appearing). While it is humorous seeing searching going a little off-track, I am very curious how many consumers know that each link you follow passes on where you came from, so for instance I see log entries like

    200x-xx-xx xx:xx:xx xxx.xxx.xxx.xxx GET /rants/jerk/index.htm 200 5986 334 270 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;+Dig Ext) http://google.yahoo.com/bin/query?p=jerk+off&b=21& hc=0&hs=5
    -or-
    200x-xx-xx xx:xx:xx xxx.xxx.xxx.xxx GET /rants/jerk/index.htm 200 5986 437 1292 Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;+Dig Ext;+sureseeker.com) http://www.google.com/search?q=guys+who+jerk+off

  • Re:Masturbation Techniques by daviddennis (Score:2) Thursday June 21 2001, @11:28AM
  • Re:why I like google by daviddennis (Score:2) Thursday June 21 2001, @11:37AM
  • Voice activated search engine by funkman (Score:2) Thursday June 21 2001, @07:05AM
  • by funkman (13736) on Thursday June 21 2001, @08:38AM (#135291) Homepage
    I love when people don't read the article and post. From page 2 of the article:
    What other kinds of search are you developing?

    We have a voice-search project with BMW -- BMW wants to put voice search into their 7 Series cars. They want to put microphones in the cars -- you can just speak whatever your search is and then it gives you answers back on a display. Then you just say the result number and the search jumps to that result.

  • Re:Perks by ethereal (Score:2) Thursday June 21 2001, @07:38AM
  • Re:Regex: won't happen by griffjon (Score:2) Thursday June 21 2001, @11:01AM
  • I'm just waiting for them to implement a RegEx interface. now THAT would be some love for the geeks out here.
  • Dumb question (?) by DonK (Score:2) Thursday June 21 2001, @08:07AM
  • Re:Actual Questions for Ask Jeeves by billybob (Score:1) Thursday June 21 2001, @12:11PM
  • by King Babar (19862) on Thursday June 21 2001, @08:59AM (#135297) Homepage
    For example, searching for: "Hail to the chief" would ignore to and the. In order to actually search for the phrase (which I indicated that I wanted to do by surrounding it in quotation marks), I would have to type "Hail +to +the chief". Hardly user-friendly.

    And, actually, that's not quite right, either. It's apparently always going to blow off your "the" (I just tried it). This is, alas, a seriously hard problem. What you were doing was looking for what actually amounts to a single chunk of information: the title of a fanfare played for the president. Unfortunately, the English version of the title is four words long although the title itself might in some cases act just like a single word (or noun phrase). So:

    That was one of the worst "Hail to the chief"
    s that I have every heard.
    Yes, you might even pluralize it just like a noun. So that's one problem right there: search terms that really are tantamount to a single lexical item might be four or more words long, and might even be inflected.

    Ideally, you'd like to index separately these multi-word chunks, especially if you can prove they occur way more often than expected. So in your example, "hail" and "chief" co-occur on about 28,000 pages, while "hail" alone is on 510,000 and "chief" alone is on over 1,500,000. If Google indexes 1.5 billion pages (or so), and the terms were independent, then, you'd expect something like 5000 co-occurrences, and 28,000 is so outrageously out of line you would know that something is up.

    Now, I'm guessing that *local* co-occurrence information is likely to eventually going to prove even handier in this regard. So, for example, "hail to" comes up 157,000 times, which is about 1/3 of all "hail" pages. That's very unlikely unless there's something systematic (and very possibly exploitable) going on.

    The big problem is that you can't really do much with function words alone, since they're just too staggeringly frequent. In running English text, the frequency of "the" is just about 70,000 per million. In other words, 7% of all English text consists of the definite article, and most web pages contain many distinct copies. You've got to kill that. Unfortunately, by omitting "the", you lose a lot of potentially useful information about definiteness of the noun phrase. In the "hail to the chief" example, the song title itself is just one example of a (somewhat) productive expression "hail to [definite-NP]", which has a specific kind of meaning implied (interestingly, usually sarcastic or abusive). Picking up on this could be very useful.

    So suppose I typed into deja "bush mass-mooning Gothenburg". I'll get 9 hits. That's nice, but google might want to do more, and provide additional examples of president (or candidate) Bush being derided in public. Or maybe give me pages that refer to the same incident being described as the Swedish version of "hail to the chief".

    So there is no doubt that function words need love, but I'd argue for a love that seeks to understand them and their weird little contributions to meaning rather than just a way to make sure you can nail a song title exactly.

  • the technology behind google by dizco (Score:2) Thursday June 21 2001, @08:37AM
  • by daytrip (25725) on Thursday June 21 2001, @09:10AM (#135299) Homepage
    You'll probably get a resonable idea at this page:

    http://www-db.stanford.edu/~backrub/google.html [stanford.edu].

    Also, try a lookup for a bloom filter [google.com], which google uses, I think. Most search engines work by inverting the index, and then merging the lists. Taking the intersection of all the keywords gives ou the membership, then you apply ranking to the membership. Pretty simple concept. I don't know of any search engines that use a trie, or use any form of stemming.

    -js
  • Re:Is she hot or not? by ashitaka (Score:1) Thursday June 21 2001, @08:03AM
  • Re:Smarter Searches by Xofer D (Score:2) Thursday June 21 2001, @08:26AM
  • German queries at fireball.de by harmonica (Score:2) Thursday June 21 2001, @09:41AM
  • MP3 of that talk (Score:3)

    by harmonica (29841) on Thursday June 21 2001, @10:03AM (#135303)
    You probably mean The Technology Behind Google [ddj.com]. It's a 73 min MP3, very interesting!
  • Re:Masturbation Techniques by Krilomir (Score:2) Thursday June 21 2001, @11:46AM
  • Re:MP3 of that talk by htmlboy (Score:1) Thursday June 21 2001, @01:50PM
  • Re:Yeah Suckah! (Score:5)

    by htmlboy (31265) on Thursday June 21 2001, @08:37AM (#135306)
    Google gave a talk for ACM here last semester (got a t-shirt, woohoo!). The speaker described how they're used. They have thousands of linux boxes, and they're used to store websites (to be searched and cached copies) and to do searching on the pages they have (I think that's how it went). I got the impression that linux is used because it's free (important with thousands of licenses), it's reliable, and they found it a good platform for the searching backend software.

    an interesting side note: they found that when one of the linux boxes stops working, it's more cost effective to replace it than to fix the problem (hardware, at least). google throws out a lot of good hardware because of that. the lecture hall was begging for a student donation program of some sort when the google guy mentioned that :)

    chris
  • by dead_penguin (31325) on Thursday June 21 2001, @07:54AM (#135307)
    With the giant display of scrolling queries (filtered, though) they have in their lobby, I think it's time to start sending little messages to the Google staff using searches.

    "Help, I'm stuck in here!!" is an obvious classic to try. If enough of us do it, it might even get noticed...

    "Intelligence is the ability to avoid doing work, yet getting the work done".
  • Re:Smarter Searches by gorilla (Score:2) Thursday June 21 2001, @08:57AM
  • Re:[ot]Google's data structure? by costas (Score:2) Thursday June 21 2001, @08:35AM
  • Re:Send messages to the staff! by Tofuhead (Score:2) Thursday June 21 2001, @02:19PM
  • Gnut by QuantumG (Score:1) Thursday June 21 2001, @07:29PM
  • Sing it brother by QuantumG (Score:1) Thursday June 21 2001, @07:36PM
  • Re:Dumb question (?) by QuantumG (Score:1) Thursday June 21 2001, @07:50PM
  • Re:Google is still sloppy and second-rate. by QuantumG (Score:1) Thursday June 21 2001, @07:56PM
  • I remember when... by T3kno (Score:1) Thursday June 21 2001, @07:53AM
  • Re:Smarter Searches by Louis Savain (Score:2) Thursday June 21 2001, @09:01AM
  • Re:Smarter Searches by Louis Savain (Score:2) Thursday June 21 2001, @10:26AM
  • Re:Smarter Searches by Louis Savain (Score:2) Thursday June 21 2001, @12:23PM
  • Smarter Searches (Score:4)

    by Louis Savain (65843) on Thursday June 21 2001, @07:32AM (#135319) Homepage
    Monika Henziger: You can try to return documents that are specifically on this topic. We're developing more sophisticated techniques to return documents that might not mention the query words, but are [still relevant to] the topic. We're getting away from just pure word matches and getting more into topics.

    This is interesting. I wonder if there might be a way for the engine to have a two way back-and-forth "conversation" with the user. IOW, if the engine interprets the query to have several possible meanings, a few multiple choice questions might clarify the meaning and narrow the search parameters. I think this could be more helpful than doing a blind guess of the user's intention.
  • Re:Send messages to the staff! by binner (Score:1) Thursday June 21 2001, @12:24PM
  • Re:I remember when... by TheShadow (Score:1) Thursday June 21 2001, @08:42AM
  • phone book function by spasm (Score:2) Thursday June 21 2001, @07:44AM
  • Re:[ot]Google's data structure? by markprus (Score:1) Thursday June 21 2001, @08:15AM
  • by LocalYokel (85558) on Thursday June 21 2001, @07:22AM (#135324) Homepage Journal
    Search terms have all kinds of problems.

    I had the same problem yesterday when I was searching for "quotes about Shakespeare". "to be or not to be" (with quotes) pulls up the proper category, but the first rsult it comes up with is the GNU homepage, because GNU's not Unix!. The second link is to Am I Hot or Not, BTW...

    Strangely enough, it warns about "or", and if I want to use it in a search, it must be in CAPS, but then how do I search for something in ORegon? For some reason, it says nothing about "not", so I don't know what's up with their search terms anymore.

    --

  • Search Query by BierGuzzl (Score:1) Thursday June 21 2001, @08:09AM
  • isn't Google always getting itself in the news? by Delrin (Score:1) Thursday June 21 2001, @07:01AM
  • Re:Yahoo took a much bigger leap - it licensed Goo by Delrin (Score:1) Thursday June 21 2001, @08:41AM
  • Re:Prepositions need love too by TimMann (Score:1) Thursday June 21 2001, @06:16PM
  • by zpengo (99887) on Thursday June 21 2001, @07:01AM (#135329) Homepage
    A recent development in Google technology left me very dismayed -- They started ignoring "common words."

    This makes sense on a general level, but when you try searching for a phrase embedded in quotation marks, it's frustrating to have Google decide which parts of a literal string to search for and which to ignore. If I had wanted it to ignore parts of it, I wouldn't have indicated that it was a literal phrase, dangnabbit!

    It is possible to include words that you typed in the search phrase, but you have to add an Altavista-style '+' before it.

    For example, searching for: "Hail to the chief" would ignore to and the. In order to actually search for the phrase (which I indicated that I wanted to do by surrounding it in quotation marks), I would have to type "Hail +to +the chief". Hardly user-friendly.

    Oh, well.

  • Re:[ot]Google's data structure? by jon_c (Score:1) Thursday June 21 2001, @08:01AM
  • Re:why *I* like google by jon_c (Score:1) Thursday June 21 2001, @09:53AM
  • Re:[ot]Google's data structure? by jon_c (Score:1) Thursday June 21 2001, @09:58AM
  • Northernlight by JPMH (Score:2) Thursday June 21 2001, @09:32AM
  • Re:Much like McDonalds by Carnivore (Score:1) Thursday June 21 2001, @12:11PM
  • Re:Smarter Searches by Fencepost (Score:2) Thursday June 21 2001, @07:56AM
  • by jwater (112092) on Thursday June 21 2001, @08:48AM (#135336)
    Here at Slashdot it seems like people only can complain about a service. Most of the posts are rants without understanding of the dynamics below them.

    I think we all could use more understanding of the topic. A link to the paper that started it all here [nec.com].

    1. When was the last time that "to" or any other preposition helped the average query. Your Grandmother does not know that this word is meaningles 99.9% of the time, so google ties to improve their relevancy.

    2. Google has not sold out. Their ads are the most simple in the industry. They give access to users like you and me at reasonable rate. Who wants to wait for 345x123 pixel banner ads anyways.

    3. Have you noticed the spelling feature? Google will correct your spelling. This is a function of the tons of bigrams that they have stored.

    4. Here is a link to more papers [Warning: Technical] here [nec.com].

  • Re:More on language translation... by FTL (Score:2) Thursday June 21 2001, @08:40AM
  • its been said by zerocool^ (Score:1) Thursday June 21 2001, @07:16AM
  • Re:Smarter Searches by PeterBecker (Score:1) Friday June 22 2001, @12:03AM
  • Re:[ot]Google's data structure? by Angelo Torres (Score:1) Thursday June 21 2001, @10:35AM
  • read the article by mr_gerbik (Score:2) Thursday June 21 2001, @07:25AM
  • Re:read the article by mr_gerbik (Score:2) Thursday June 21 2001, @08:07AM
  • Re:why *I* like google by mr_gerbik (Score:2) Thursday June 21 2001, @10:12AM
  • why I like google (Score:3)

    by mr_gerbik (122036) on Thursday June 21 2001, @08:13AM (#135344)
    who else has linux only searches?.. and not only that, a cool linux google logo!

    http://www.google.com/linux [google.com]

    -gerbik
  • Re:why I like google by ryanf (Score:1) Thursday June 21 2001, @10:30AM
  • How to improve the timeliness searches? by Jim Madison (Score:1) Thursday June 21 2001, @09:16PM
  • SatireWire: interview with Jeeves by mrBlond (Score:2) Thursday June 21 2001, @07:27AM
  • Re:read the article by Eloquence (Score:1) Thursday June 21 2001, @08:01AM
  • Re:Disturbing Search Requests by don_carnage (Score:2) Thursday June 21 2001, @08:07AM
  • Re:Disturbing Search Requests by don_carnage (Score:2) Thursday June 21 2001, @08:31AM
  • Re:new search engine by xp (Score:1) Thursday June 21 2001, @07:33AM
  • Re:Prepositions need love too by BitchAss (Score:2) Thursday June 21 2001, @08:02AM
  • Re:Why Google is my favorite search engine by PingXao (Score:1) Thursday June 21 2001, @09:57AM
  • It's good to know... by Rackemup (Score:1) Thursday June 21 2001, @07:03AM
  • Re:[ot]Google's data structure? by kerrbear (Score:1) Thursday June 21 2001, @09:24AM
  • Yahoo took a much bigger leap - it licensed Google by arete (Score:2) Thursday June 21 2001, @08:03AM
  • What do you expect, a monolith? by arete (Score:2) Friday June 22 2001, @09:17AM
  • Re: weird google pages (was "why *I* like google") by wishus (Score:2) Thursday June 21 2001, @11:32AM
  • method for increasing hits by jvj24601 (Score:2) Thursday June 21 2001, @08:35AM
  • Re:I remember when... by HoaryCripple (Score:1) Thursday June 21 2001, @10:44AM
  • new search engine by Aalschover (Score:2) Thursday June 21 2001, @07:10AM
  • Re:Voice activated search engine by SpookyFish (Score:1) Thursday June 21 2001, @07:12AM
  • Re:Yeah Suckah! by i0lanthe (Score:1) Thursday June 21 2001, @09:13AM
  • Re:Deja by i0lanthe (Score:1) Thursday June 21 2001, @09:18AM
  • Re:method for increasing hits by b0bby (Score:1) Thursday June 21 2001, @10:40AM
  • Re:isn't Google always getting itself in the news? by markov_chain (Score:2) Thursday June 21 2001, @09:12AM
  • Re:isn't Google always getting itself in the news? by DejaMorgana (Score:1) Thursday June 21 2001, @11:27AM
  • by sdo1 (213835) on Thursday June 21 2001, @08:07AM (#135368) Journal
    These translation services (such as BabelFish on AltaVista) still have quite a way to go before they're completely reliable. Especially when you translate from one language to another, you might end up with something similar to this (translated from English to Korean and then back to English again):

    Will be complete and on the front of the L it will be reliable to translation service (as the BabelFish is same) a yet positively is thin method to Altavista. It was special and when you from one language also translate in different one thing, you in child one silence comfort ended to this, (and the that time English back mac tayn Great Britain from again under translate again in a Korean):

    -S
  • Regex: won't happen by brlewis (Score:2) Thursday June 21 2001, @10:31AM
  • Re:Smarter Searches by timboy3 (Score:1) Thursday June 21 2001, @08:47AM
  • Re:Smarter Searches by timboy3 (Score:1) Thursday June 21 2001, @11:13AM
  • Re:isn't Google always getting itself in the news? by Weh (Score:1) Thursday June 21 2001, @08:48AM
  • Re:Send messages to the staff! by jaredcat (Score:1) Thursday June 21 2001, @08:38AM
  • probably a suffix tree. by gagganator (Score:1) Thursday June 21 2001, @08:37AM
  • Re:Prepositions need love too by 3-State Bit (Score:1) Thursday June 21 2001, @01:15PM
  • Re:Prepositions need love too by 3-State Bit (Score:1) Sunday June 24 2001, @07:14PM
  • mujen.com is better by oplspopo112 (Score:1) Thursday June 21 2001, @10:14AM
  • Re:mujen.com is better by oplspopo112 (Score:1) Thursday June 21 2001, @10:39AM
  • Re:Dumb question (?) by SpaceLifeForm (Score:1) Thursday June 21 2001, @12:52PM
  • Re:[ot]Google's data structure? by wrinkledshirt (Score:1) Thursday June 21 2001, @07:21AM
  • Re:[ot]Google's data structure? by wrinkledshirt (Score:1) Thursday June 21 2001, @07:26AM
  • by wrinkledshirt (228541) on Thursday June 21 2001, @07:08AM (#135382) Homepage

    Okay, this is so off-topic it's not even funny.

    Anybody have an inkling of a clue of the data structure that Google uses (or probably uses) to store all its words? I was just thinking that maybe it was some sort of balanced binary tree with each node containing a word, two pointers to the next two words further down the tree, and the root of a linked list of all the pages that word is contained in? I know binary search trees are supposed to be fast, but I was wondering if that'd be good enough for something with probably hundreds of thousands of words?

    I'm assuming they're not using some sort of sql LIKE "%searchword%", I can't imagine any kind of cluster that could speed that process up, although I don't really know all that much about the process or what the main benefits of clustering are.

    Anyway, hugely sorry for the offtopic post, it's just something that's been on the brain lately...

  • Re:Is she hot or not? by JAVAC THE GREAT (Score:1) Thursday June 21 2001, @09:57AM
  • by blamanj (253811) on Thursday June 21 2001, @08:10AM (#135384)
    Probably they use a trie [harvard.edu] or the related Patricia tree. These are very space efficient and relatively fast.
  • Re:Prepositions need love too by Kinchie (Score:1) Thursday June 21 2001, @07:44AM
  • Search 1,346,966,000 web pages by thgood (Score:2) Thursday June 21 2001, @08:23AM
  • Re:Here is the real google info... by Popocatepetl (Score:1) Thursday June 21 2001, @01:28PM
  • Re:read the article by Rogerborg (Score:2) Friday June 22 2001, @03:30AM
  • Re:phone book function by freeweed (Score:2) Thursday June 21 2001, @08:35AM
  • Much like McDonalds by freeweed (Score:2) Thursday June 21 2001, @08:37AM
  • Re:Disturbing Search Requests by tb3 (Score:1) Thursday June 21 2001, @08:20AM
  • Re:method for increasing hits by neves (Score:1) Friday June 22 2001, @09:30AM
  • Google Merchandise! by skunkeh (Score:1) Friday June 22 2001, @03:01AM
  • Re:Is she hot or not? by yukonbob (Score:1) Thursday June 21 2001, @10:19AM
  • also... by verbatim_verbose (Score:1) Thursday June 21 2001, @10:47AM
  • G3r/\/\4/\/ Pr1d3 b4by, w00t!!!!!1 by Supa Mentat (Score:1) Thursday June 21 2001, @07:29AM
  • Perks by Violet Null (Score:1) Thursday June 21 2001, @07:03AM
  • Re:[ot]Google's data structure? by Violet Null (Score:1) Thursday June 21 2001, @07:14AM
  • Re:[ot]Google's data structure? by Violet Null (Score:1) Thursday June 21 2001, @07:23AM
  • Yeah Suckah! by Louis_Cyphier (Score:2) Thursday June 21 2001, @07:03AM
  • Why Google is my favorite search engine by sketerpot (Score:1) Thursday June 21 2001, @07:12AM
  • Re:method for increasing hits by MattCutts (Score:1) Thursday June 21 2001, @03:54PM
  • Re:Here is the real google info... by 4thAce (Score:1) Thursday June 21 2001, @11:38AM
  • Re:Prepositions need love too by Modus Nonsens (Score:1) Thursday June 21 2001, @07:07AM
  • Google Parody by eyesyte (Score:1) Thursday June 21 2001, @05:04PM
(1) | 2