Forgot your password?
typodupeerror
Google

anaesthetica's Journal: The Math behind PageRank 131

Journal by anaesthetica
The American Mathematical Society is featuring an article with an in-depth explanation of the type of mathematical operations that power PageRank. Because about 95% of the text on the 25 billion pages indexed by Google consist of the same 10,000 words, determining relevance requires an extremely sophisticated set of methods. And because the links constituting the web are constantly changing and updating, the relevance of pages needs to be recalculated on a continuous basis.
This discussion has been archived. No new comments can be posted.

The Math Behind PageRank

Comments Filter:
  • by ambivalentduck (1004092) on Wednesday December 06, 2006 @07:48PM (#17139072)
    But 9,000 of those words are slang for parts of the human anatomy.  Go figure.
    • by Anonymous Coward
      There's only two that really reflect the power of Pagerank: Click here. [google.com]
      About 1.2 billion pages, and surprise surprise, Acrobat Reader tops the list, followed by a who's who of internet applications and plugins. But around result #30 it gets a bit more interesting, and when you're a few dozen pages in, "new patterns begin to emerge."

      And to explain why not to use "click here", I found this [w3.org] buried on page 45. Thanks for the proof pudding guys, it's delicious.
    • Not directly related to this reply, but putting it here for visibility. Not self-promotion. Just would like to provide some useful reference:

      The Anatomy of a Large-Scale Hypertextual Web Search Engine
      http://infolab.stanford.edu/~backrub/google.html [stanford.edu]
      - This paper tells you what PageRank really is, by the original author.

      Efficient Computation of PageRank
      http://dbpubs.stanford.edu:8090/pub/1999-31 [stanford.edu]
      - This paper tells you how they efficiently compute it

      And as far as I know about information retrie

  • by dada21 (163177) * <adam.dada@gmail.com> on Wednesday December 06, 2006 @07:50PM (#17139112) Homepage Journal
    I have sites with a PR of 6, and I can tell you that they got that way because of inbound links from other sites. In fact, when other sites dropped those links, my PR dropped (to 5, and even to 4). Getting more inbound links brought the PR back.

    Think about those links, too. How often do you use common words in an HREF? I don't think there's a lot of weeding out of common words since the link to a site is usually either its name, or a description containing some important keywords.

    I love seeing these technoscientists think they understand PageRank, but just like TimeCube, they're way, way off.
    • Re: (Score:3, Informative)

      by markov_chain (202465)
      There has been a PageRank paper out there since 2000 or so, so it's not exactly a secret how it works. Basically an initial set of relevant pages is pulled from the database and ranked by doing some computation on a connectivity matrix. The trick is to come up with a good initial set; and unless they managed to implement an all-knowing oracle they probably do it by doing a keyword search. Here's where the article summary makes sense; if most pages have the same keywords, a keyword search is going to come
    • by zootm (850416)

      If you're referring to the article, it focuses on the "links" aspect when describing the PageRank algorithm. The summary on here is pretty misleading in that way.

    • Think about those links, too. How often do you use common words in an HREF?

      interestingly, it appears that Adobe Acrobat leads the list of results [google.com] when you search for "here" on Google (you can download it here [adobe.com]).

      and who would have expected this [google.com]

  • Bad summary (Score:5, Interesting)

    by Knights who say 'INT (708612) on Wednesday December 06, 2006 @08:06PM (#17139318) Journal
    The article specifically says the PageRank eigenvector is only recalculated once a month, approximately. Even though Google uses some clever numerics to calculate the eigenvectors to a 25 billion by 25 billion matrix by iteration, it still takes several hours to finish.
    • Re: (Score:2, Funny)

      by The Zon (969911)
      Even though Google uses some clever numerics to calculate the eigenvectors to a 25 billion by 25 billion matrix by iteration, it still takes several hours to finish.

      Please. I can do that on paper in, like, five minutes.
    • by Firehed (942385)
      Several hours for 25b x 25b? Jeez, it took Slashdot the better part of a day to update the comment id field type in their database... 16.7m by 1. OSTG, we demand that the servers running Slashdot be upgraded to something that could actually withstand a Slashdotting!
      • Re:Bad summary (Score:5, Insightful)

        by martin-boundary (547041) on Wednesday December 06, 2006 @10:19PM (#17140646)
        It's nowhere near like that. A web matrix is very sparse, so if you did a true 25Bx25B matrix power iteration, you'd be multiplying zero by zero a gazillion times. Optimization is about not doing things you don't need to do, and optimizing PageRank is about figuring out clever ways to not do the full multiplication. Moreover, PageRank is calculated in parallel over a computer farm. Overall, you can expect a single iteration to take on the order of an hour, and you can expect around 50-80 iterations before Google gives up and says it's converged. You can also try and reuse the previous "converged" PageRank vector to cut down on the 50-80 iterations after you've crawled new pages.

        If google used a single computer to do all the work, and truly did 80*25B^2 operations, they'd be morons.

        • Interestingly, Google does a lot of reindexing using existing searches and then builds upon a search listing and a page indexing review. For example in US Patent 6,526,440 [patentmonkey.com], "The search engine obtains an initial set of relevant documents by matching a user's search terms to an index of a corpus. A re-ranking component in the search engine then refines the initially returned document rankings so that documents that are frequently cited in the initial set of relevant documents are preferred over documents tha
  • Nouns maybe? (Score:4, Insightful)

    by Bryansix (761547) on Wednesday December 06, 2006 @08:07PM (#17139344) Homepage
    It seems like it would be the nouns, pronouns, etc. that Google should be paying attention to. Who cares about all the verbs, adjectives, etc. that just muddy the indexing waters?
    • by kramulous (977841)
      I believe that a race is on at the moment for semantic searching. Not only nouns, verbs etc, but whether the phases are subjective or objective. I know a blog search company that is working on this. They wanted to borrow some of my code.
    • Re: (Score:2, Insightful)

      by abshnasko (981657)
      Searching for pill and the pill should yield very different results. Yes nouns are more important, but articles and other words cannot be disregarded.
      • by gfody (514448)
        The is a stop word [wikipedia.org] and will most likely be excluded from your search term.
        • by WaXHeLL (452463)
          It's not entirely excluded.

          An index of "the pill" and "pill" are two different queries becuase matching the whole phrase will get you more relevant results. This is built into the code that interprets queries (this is completely different from PageRank, which deals with cross linking between sites to get the highest probability of relevance -- AFTER the query is interpreted and a set of pages is generated). Almost all search engines work that way.
    • by WaXHeLL (452463)
      RTFA please. It deals with determining relevance, not the optimal method of indexing pages.

      In regards to your comment:
      Verbs play an extremely important role when dealing with relevancy based on phrases.

      The small snippet that was posted was just cut and pasted from the opening hook of the article. It just leads into a mathematical discussion how to sort through the thousands of results that are returned.
      • by Bryansix (761547)
        I actually thought about that after I posted. I know all the words are important for indexing. I'm just saying that looking at keywords and placing more importance on those is a part of the mix too. Those keywords are almost always nouns.
    • by svindler (78075)
      So if I want to look for dwarf throwing I'll have to wade through all dwarf related pages because throwing is not relevant for the pagerank?
  • I read about this some time ago ... I think the paper was entitled "The 10 billion dollar Eignvector: The math behind google" or something to that effect. Sorry, but I've got a new laptop and cannot find the exact title. It was an excellent introduction for beginner computational scientists for an application of the eigenvector. I forget the American University responsible.
    • by mochan_s (536939)
      Here's the bibtex reference.

      @article{bryan:569,
      author = {Kurt Bryan and Tanya Leise},
      collaboration = {},
      title = {The $25,000,000,000 Eigenvector: The Linear Algebra behind Google},
      publisher = {SIAM},
      year = {2006},
      journal = {SIAM Review},
      volume = {48},
      number = {3},
      pages = {569-581},
      keywords = {linear algebra; PageRank; eigenvector; stochastic matrix},
      url = {http://link.aip.org/link/?SIR/48/569/1},
      doi = {10.1137/050623280}
      }
  • by CrazyJim1 (809850) on Wednesday December 06, 2006 @08:10PM (#17139396) Journal
    I skimmed the article and didn't find what I wanted to find. If you make a webpage that you want ranked high, what do you do? Do you make 100 geocities accounts and provide links to your main website, or what? I'm just wondering this out of curiosity, not out of need.
    • At a very basic level a sites page rank is a reflection on how much other sites think it's relevent, and is based on how important the sites are that link to it. Get a link from the BBC, CNN, or somewhere like that and it's worth thousands or millions of links from Geocities sites.
    • by mojodamm (1021501)
      That's kinda what I thought at first as well, but looking over the lower two-thirds of the article, I started to get a different impression. They talked about a 'strong web' idea, where if your webpage is disconnected from the 'main' web and set up in a sort of 'secondary web' with just your Geocities accounts, for instance, linking to it, then the actual websites that interconnected within your site matrix would rank a 0 overall.

      Not sure if this is correct or not, just the impression that I got from what
      • If you read the entire article carefully, they deal with that by changing the way they search through the web. Instead of following every link, they assign a probability of .85 to following it. This makes their eigenvectors have nonzero entries because the search can jump out of the strong web and get back on track (if the random number falls into the .15 category it goes to a random indexed page from the entire internet). So yea, making a web of geocities accounts wouldn't do much more than you'd think it
    • That wouldn't work, because they'd all be coming from the same domain.
    • by Anonymous Brave Guy (457657) on Wednesday December 06, 2006 @08:52PM (#17139894)

      The underlying idea behind page rank is pretty well-exposed at this point, and is described in TFA. Essentially, it's a big set of simultaneous equations: each incoming link to your page gets a score that is roughly the rank of the source page divided by the number of outgoing links on that page, and then the rank of your page is roughly the sum of the scores of all incoming links.

      Various fudge factors are introduced along the way. For example, if you break Google's rules about displaying the same content to bots as to humans, you can get slapped right down. More subtly, newly registered domains take a modest hit for a while. More nobody-knows-ly, Google's handling of redirects is unclear: information about exactly what adjustments are made is pretty scarce, and there's a lot of conjecture around. One thing that's pretty certain is that they penalise for duplicate content, which is why some webmasters do apparently unnecessary things like redirecting http://www.theircompany.com/ [theircompany.com] to http://theircompany.com/ [theircompany.com] or vice versa.

      So, if you want to get a page with a high rank yourself, then ideally you need would get many established, highly-ranked pages to link to your page and no others. In your example, all those Geocities sites wouldn't help a lot, because (a) they'd have negligible rank themselves, and (b) they'd be penalised for being new and lose some of that negligible rank before they even started. Many times negligible is still negligible, and so would be your target page's rank. OTOH, get a few links from university sites, big news organisations and the like, and your rank will suddenly be way up there. Alternatively, get a grass-roots movement going where a gazillion individuals with small personal sites link to you, and the cumulative effect will kick in.

      • Re: (Score:3, Interesting)

        by TheLink (130905)
        "if you break Google's rules about displaying the same content to bots as to humans"

        I notice many sites that do that and don't get slapped down - esp subscription sites. And seems Google doesn't cache those, so its probably collusion.

        You see the keywords and paragraphs in the search, but click on it you get a login page.

        They should have to pay a special rate be marked differently from the other search results. It's a waste of time otherwise.
        • by oni (41625) on Wednesday December 06, 2006 @09:39PM (#17140342) Homepage
          I notice many sites that do that and don't get slapped down - esp subscription sites.

          I wonder, if I changed my useragent to be whatever the googlebot reports itself to be - would I get by the registration screen on websites like the NYTimes??
          • by kimvette (919543)
            No, because they check the IP you're coming from as well now - they grew wise to user agent spoofing years ago.

            Google for the "bugmenot" Firefox extension.
            • by jZnat (793348) *
              Googlebot doesn't use the same IP address all the time (several servers running Googlebot I'd imagine), so filtering based on IP addresses would be infeasible (at least according to Google).
          • Re: (Score:3, Informative)

            by XorNand (517466) *
            As pointed out, the Times site isn't fooled, but there are a good many out there that are fooled. Sometimes if you ever do a Google search, one of the results will contain a keyword or two. However, when you click on the link, you'll find yourself redirected to a subscription page. Useragent spoofing can frequently show you the same page that Google indexed.

            If you're a FF user, grab the Useragent Switcher extension [mozilla.org] and add in a UA of "Mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)
        • Re: (Score:3, Interesting)

          by suggsjc (726146)
          Here is an email with associated response I received from Google on roughly this topic.

          This is a very general question. I'm creating a website. It is going to be a blogging platform. Obviouslly, the content of the site(s) is the most important thing. I've already started making the content of my site dynamic in the sense that I tailor it to the requesting agent (via the user-agent header). My intention for doing this is to make sure that the content renders correctly for *any* browser that accesses the sit

      • I now have a nice basic understanding of Google page ranking system. Thats all I was asking for.
      • Re: (Score:2, Insightful)

        by l0cust (992700)
        Thanks for the informative post. I have one question though. How does it help find the relevant information unless that information just happens to be on a popular page too? What I mean to say is that the idea behind grading/filtering systems like PageRank is to provide the most relevant information about the thing you are trying to search on the net. Now suppose Mr. A is looking for some obscure Indian text written in Sanskrit and Mr. B has (recently or not) put up a website with that text as one of the co
      • One of the references for the article is http://infolab.stanford.edu/pub/papers/google.pdf [slashdot.org]" >The Anatomy of a Large-Scale Hypertextual Web Serach Engine published in Computer Networks and ISDN Systems. At the end of the paper, they have a very interesting appendix: "Advertising and Mixed Motives"

        Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, i

    • by linhux (104645)
      If those 100 geocities pages each have a PageRank of 0 (which they would if they aren't linked to from other high-ranking pages), their total contribution to your main page PageRank will be 0.
    • by cvos (716982)

      a webpage that you want ranked high, what do you do? Do you make 100 geocities accounts and provide links to your main website

      No this would definitely not work. The reason is that 100 new geocities websites would have a value of 0 so using the PageRank algorithm you would effectively have 100 links X 0 PR. Incoming links only have a positive impact if they have weight independent of other websites. This is why it is so crucial to have your own website in the oldest dataset possible. It takes a long time for websites created in 1995 to disappear.

  • Does PageRank count? (Score:2, Interesting)

    by matr0x_x (919985)
    As a self proclaimed SEO expert - I honestly don't believe PageRank counts nearly as much as it did a few years ago! You'll find lots of PR5 sites ahead in the SERPS of PR9 sites!
    • by Trieuvan (789695) on Wednesday December 06, 2006 @08:35PM (#17139726) Homepage
      The pagerank that's reported from toolbar is really old. Google never want to let you know the real number or it will be easy to spam ...
      • The pagerank that's reported from toolbar is really old.

        I think that at least part of this is indicative of the "Google Sandbox [wikipedia.org]" (if you believe it exists). I've noticed, with the Google Toolbar in IE and FireFox, that some sites seem to have stagnant PR's (even with noticable increases/decreases of traffic), but others move along in a relatively sistent manner.

        Just my 2 cents.

    • by dbmasters (796248)
      PageRank is worthless in terms of SEO. What it can do is tell you if there is a problem, if you have a PR of 0 or 1 or something, but thinking it somehow affects your SERPs is a dillusion far to many people fall in to. Concentrate on SERPs, not PR, ASAP for SEO on the WWW.
      • Re: (Score:2, Funny)

        by Anonymous Coward
        Concentrate on SERPs, not PR, ASAP for SEO on the WWW

        I searched on Google but I cannot find what "on", "not", "for" and "the" mean...
  • I asked some math website to put a link to http://www.mathpotd.org/ [mathpotd.org] Math Problem of the Day -- they don't bother to do so. They know the math and use it.
  • SELECT advertiser, description, link, adcost
    FROM tblAdvertisers
    WHERE adword LIKE %searchstring%
    ORDER BY adcost
  • OK, but... (Score:1, Informative)

    by indigest (974861)
    The algorithms behind PageRank are no secret. Why not just read about them from the source [stanford.edu]?
  • by colourmyeyes (1028804) on Wednesday December 06, 2006 @11:01PM (#17141000)
    I think we can get four or five tomorrow.
  • Great article.

    The character of online content is changing now rapidly. We used to be in an Internet where mostly only the site provider determined the content on the pages they served (/. being a notable, early exception). Now, with the rise of "2.0" systems, user-generated content, and empowerment of the individual - the content being served on many sites is coming into sites from wide groups, and being moderated and curated by those groups.

    So... a thought: as user-submitted and group-moderated content
    • I could not disagree more. Most of the sort of information people search for is not user generated: when did you last do a Google search for which a slasdot comment was the appropriate answer?

      The only exception that I can think of (form my searches) are forums that have answers to software problems. Google seems to have no problem finding these for me.
      • Sometimes you want to search through your old posts. Not all sites let you do that (slashdot does if you pay up, I think), and often forums are even norobots space.
      • by drDugan (219551) *
        The meme that Google helps us find all the information is a huge marketing Spin.

        Compared to "exactly the information you want, when and how you want it" - Google sucks. It is better that anything else now, but it still is not anywhere close to really solving the information access problem generally.
  • For a different, somewhat more technical, but more succint discussion, Cleve Moler [of Matlab fame] wrote another view [mathworks.com] of this topic, about 5 years ago.

    The math is the same, of course, but two points of view may provide a greater sense of perspective. So to speak. And Cleve is always worth listening to.

    • by jfengel (409917)
      Actually, I'm not so sure it's the largest matrix computation. Weather and nuclear bomb simulations are done with matrix algebra, and it wouldn't surprise me to discover that they do some months-long calculations with even larger matrices.
  • I've seen links on google searches that don't exist anymore but were ranked highly when they DID exist and still exist in the top 10 of the query. What happens to those? Do they stay at their ranking till they get overtaken by other more popular pages on the same search? Get their ranking slowly reduced because they don't exist?

It's time to boot, do your boot ROMs know where your disk controllers are?

Working...