Journal anaesthetica's Journal: The Math behind PageRank 131
The American Mathematical Society is featuring an article with an in-depth explanation of the type of mathematical operations that power PageRank. Because about 95% of the text on the 25 billion pages indexed by Google consist of the same 10,000 words, determining relevance requires an extremely sophisticated set of methods. And because the links constituting the web are constantly changing and updating, the relevance of pages needs to be recalculated on a continuous basis.
10,000 words (Score:5, Funny)
Re:The two that matter (Score:1, Interesting)
About 1.2 billion pages, and surprise surprise, Acrobat Reader tops the list, followed by a who's who of internet applications and plugins. But around result #30 it gets a bit more interesting, and when you're a few dozen pages in, "new patterns begin to emerge."
And to explain why not to use "click here", I found this [w3.org] buried on page 45. Thanks for the proof pudding guys, it's delicious.
Re: (Score:1)
The Anatomy of a Large-Scale Hypertextual Web Search Engine
http://infolab.stanford.edu/~backrub/google.html [stanford.edu]
- This paper tells you what PageRank really is, by the original author.
Efficient Computation of PageRank
http://dbpubs.stanford.edu:8090/pub/1999-31 [stanford.edu]
- This paper tells you how they efficiently compute it
And as far as I know about information retrie
Re: (Score:2)
PageRank doesn't seem to be based on keywords (Score:4, Informative)
Think about those links, too. How often do you use common words in an HREF? I don't think there's a lot of weeding out of common words since the link to a site is usually either its name, or a description containing some important keywords.
I love seeing these technoscientists think they understand PageRank, but just like TimeCube, they're way, way off.
Re: (Score:1)
I'm behind on my Slashdot reading, but I wanted to offer you a supportive comment even if it isn't timely. You're right, the original poster only read the summary and got modded up for a stupid comment based on not RTFA.
That said, your comment contained more insult than explanation (yeah he didn't RFTA, but point out the discrepancy in his argument). The more inflammatory your message, the less likely it will be considered. I know, it's tempting to flame, and I do it myself now and then, but not near
Re: (Score:3, Informative)
Re: (Score:3, Interesting)
Re:PageRank doesn't seem to be based on keywords (Score:5, Funny)
Re: (Score:2)
If you're referring to the article, it focuses on the "links" aspect when describing the PageRank algorithm. The summary on here is pretty misleading in that way.
Re: (Score:2)
interestingly, it appears that Adobe Acrobat leads the list of results [google.com] when you search for "here" on Google (you can download it here [adobe.com]).
and who would have expected this [google.com]
Re:Pagerank is cool (Score:5, Interesting)
Of course, yahoo has its own opinion. [yahoo.com]
Although, altavista seems to almost agree. [altavista.com] Check the second non-advertised result.
I do find this [google.com] amusing though. Third place, how humble.
I didn't expect such interesting results. The site with the search term in its url was tops for av and yahoo, but not google. Yahoo ranked the wiki entry above google, but av reversed that decision, google of course thought itself was more important than the wiki. Google's own reference site was number one in its own search and near the top in the other two, but pagerank.net wasn't even in the top 10 for google's search. I'm not sure what conclusions can be drawn from all that, but it is definitely food for thought.
Re: (Score:2)
What I found interesting about that link was the description listed for google's entry:
Where did they get that text from? It's not anywhere to be found in the source [tinyurl.com]. Did they cheat? Or are they just tricky?
Re: (Score:2)
They got it from the Google category [dmoz.org] at the Open Directory Project at dmoz.org [dmoz.org], mirrored at directory.google.com [google.com]. Google is a user of dmoz.org data but has completely de-emphasized that as of late.
It's actually against the dmoz license agreement to use their data without a link back to the source, but nobody seems to care.
Re: (Score:2)
Re: (Score:1)
Re: (Score:1)
Bad summary (Score:5, Interesting)
Re: (Score:2, Funny)
Please. I can do that on paper in, like, five minutes.
Re: (Score:1)
Re: (Score:1)
Re: (Score:2)
Re:Bad summary (Score:5, Insightful)
If google used a single computer to do all the work, and truly did 80*25B^2 operations, they'd be morons.
Re: (Score:1)
Nouns maybe? (Score:4, Insightful)
Re: (Score:1)
Re: (Score:2, Insightful)
Re: (Score:2)
Re: (Score:1)
An index of "the pill" and "pill" are two different queries becuase matching the whole phrase will get you more relevant results. This is built into the code that interprets queries (this is completely different from PageRank, which deals with cross linking between sites to get the highest probability of relevance -- AFTER the query is interpreted and a set of pages is generated). Almost all search engines work that way.
Re: (Score:1)
In regards to your comment:
Verbs play an extremely important role when dealing with relevancy based on phrases.
The small snippet that was posted was just cut and pasted from the opening hook of the article. It just leads into a mathematical discussion how to sort through the thousands of results that are returned.
Re: (Score:2)
"The Who" vs the who (Score:1)
Re: (Score:1)
A bit late? (Score:1)
Re: (Score:2)
@article{bryan:569,
author = {Kurt Bryan and Tanya Leise},
collaboration = {},
title = {The $25,000,000,000 Eigenvector: The Linear Algebra behind Google},
publisher = {SIAM},
year = {2006},
journal = {SIAM Review},
volume = {48},
number = {3},
pages = {569-581},
keywords = {linear algebra; PageRank; eigenvector; stochastic matrix},
url = {http://link.aip.org/link/?SIR/48/569/1},
doi = {10.1137/050623280}
}
Re: (Score:1)
Kind regards
I joke a lot on Slashdot, but serious question (Score:3, Interesting)
Re: (Score:1)
Re: (Score:1)
Not sure if this is correct or not, just the impression that I got from what
Re: (Score:1)
Re: (Score:1)
Re:I joke a lot on Slashdot, but serious question (Score:5, Informative)
The underlying idea behind page rank is pretty well-exposed at this point, and is described in TFA. Essentially, it's a big set of simultaneous equations: each incoming link to your page gets a score that is roughly the rank of the source page divided by the number of outgoing links on that page, and then the rank of your page is roughly the sum of the scores of all incoming links.
Various fudge factors are introduced along the way. For example, if you break Google's rules about displaying the same content to bots as to humans, you can get slapped right down. More subtly, newly registered domains take a modest hit for a while. More nobody-knows-ly, Google's handling of redirects is unclear: information about exactly what adjustments are made is pretty scarce, and there's a lot of conjecture around. One thing that's pretty certain is that they penalise for duplicate content, which is why some webmasters do apparently unnecessary things like redirecting http://www.theircompany.com/ [theircompany.com] to http://theircompany.com/ [theircompany.com] or vice versa.
So, if you want to get a page with a high rank yourself, then ideally you need would get many established, highly-ranked pages to link to your page and no others. In your example, all those Geocities sites wouldn't help a lot, because (a) they'd have negligible rank themselves, and (b) they'd be penalised for being new and lose some of that negligible rank before they even started. Many times negligible is still negligible, and so would be your target page's rank. OTOH, get a few links from university sites, big news organisations and the like, and your rank will suddenly be way up there. Alternatively, get a grass-roots movement going where a gazillion individuals with small personal sites link to you, and the cumulative effect will kick in.
Re: (Score:3, Interesting)
I notice many sites that do that and don't get slapped down - esp subscription sites. And seems Google doesn't cache those, so its probably collusion.
You see the keywords and paragraphs in the search, but click on it you get a login page.
They should have to pay a special rate be marked differently from the other search results. It's a waste of time otherwise.
Re:I joke a lot on Slashdot, but serious question (Score:5, Interesting)
I wonder, if I changed my useragent to be whatever the googlebot reports itself to be - would I get by the registration screen on websites like the NYTimes??
Re: (Score:2)
Google for the "bugmenot" Firefox extension.
Re: (Score:2)
Re: (Score:1)
Re: (Score:3, Informative)
If you're a FF user, grab the Useragent Switcher extension [mozilla.org] and add in a UA of "Mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)
Re: (Score:3, Interesting)
Thanks for all the replies (Score:2)
Re: (Score:2, Insightful)
Interesting Appendix: Page and Brin on Advertising (Score:1)
Re: (Score:2)
Re: (Score:1)
a webpage that you want ranked high, what do you do? Do you make 100 geocities accounts and provide links to your main website
No this would definitely not work. The reason is that 100 new geocities websites would have a value of 0 so using the PageRank algorithm you would effectively have 100 links X 0 PR. Incoming links only have a positive impact if they have weight independent of other websites. This is why it is so crucial to have your own website in the oldest dataset possible. It takes a long time for websites created in 1995 to disappear.
Does PageRank count? (Score:2, Interesting)
Re:Does PageRank count? (Score:4, Insightful)
Re: (Score:1)
I think that at least part of this is indicative of the "Google Sandbox [wikipedia.org]" (if you believe it exists). I've noticed, with the Google Toolbar in IE and FireFox, that some sites seem to have stagnant PR's (even with noticable increases/decreases of traffic), but others move along in a relatively sistent manner.
Just my 2 cents.
Re: (Score:1)
Re: (Score:2, Funny)
I searched on Google but I cannot find what "on", "not", "for" and "the" mean...
Pagerank (Score:5, Funny)
They use a set of nested if-else statements
*ducks*
Re: (Score:1)
No, that would be waaay too many if-elses to write by hand...
they use IoC and code generation tools.
Old guys bully new comers. (Score:1)
Re: (Score:1)
Here it is... Google's PageRank formula (Score:1, Funny)
FROM tblAdvertisers
WHERE adword LIKE %searchstring%
ORDER BY adcost
you forgot.. (Score:5, Funny)
Re: (Score:1, Troll)
Re: (Score:1)
Shouldn't it be "ORDER BY adcost DESC"?
Re: (Score:1)
#1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '%searchstring% LIMIT 0, 30' at line 1
OK, but... (Score:1, Informative)
Only three articles about Google on one page? (Score:3, Funny)
evolution (Score:2)
The character of online content is changing now rapidly. We used to be in an Internet where mostly only the site provider determined the content on the pages they served (/. being a notable, early exception). Now, with the rise of "2.0" systems, user-generated content, and empowerment of the individual - the content being served on many sites is coming into sites from wide groups, and being moderated and curated by those groups.
So... a thought: as user-submitted and group-moderated content
Re: (Score:2)
The only exception that I can think of (form my searches) are forums that have answers to software problems. Google seems to have no problem finding these for me.
Re: (Score:2)
Re: (Score:2)
Compared to "exactly the information you want, when and how you want it" - Google sucks. It is better that anything else now, but it still is not anywhere close to really solving the information access problem generally.
It's the World' s Largest Matrix Computation (Score:2, Informative)
For a different, somewhat more technical, but more succint discussion, Cleve Moler [of Matlab fame] wrote another view [mathworks.com] of this topic, about 5 years ago.
The math is the same, of course, but two points of view may provide a greater sense of perspective. So to speak. And Cleve is always worth listening to.
Re: (Score:2)
Other google technologies (Score:1, Redundant)
Why doesn't Brin get some credit?? (Score:1)
Seems unfair that something Brin and Page developed together would bear only one of their names.
"Page-rank"
??
Pages that don't exist anymore (Score:2, Interesting)
Re: (Score:1, Offtopic)
Re: (Score:1)
Please, try to impress me about Stanford some other way once you've progressed further
Re: (Score:1, Insightful)
Re: (Score:1)
- cal berkeley leads stanford in william lowell putnam competition fellows
- as for killer math events
stanford had streleski (v.i.z. wikipedia)
but berkeley topped him with kaczynski (!)
seriously, best
Re: (Score:2)
Re: (Score:1)