Search Engines Can't Keep Up 82
joshwa writes "The Boston Globe today reported a study by Nature saying that search engines barely index one-sixth of the pages on the net. To a certain extent it's a plug for the Northern Light search engine, which claims to be the most comprehensive (at a staggering 16 percent of the web), but it's an interesting read nonetheless. "
Bogo-coverage (Score:2)
Anyone who does any searching quickly realizes this, so the study isn't breaking ground here, although maybe it quantifies the problem.
Beyond this, I don't see how the study's result could be meaningful.
1) How did they come up with their estimate of 800 million web pages? If that number is bogus, so is the %age. They can measure the pages they found, but how do they measure the pages they couldn't find? Different techniques of estimation might provide great variance in the number of web pages.
2) Counting pages (and computing coverage) is especially problematic given the increasing amount of content generated dynamically.
Slashdot moderation: idea for new search engine? (Score:1)
Re:Maybe, but we're trying! (Score:1)
--
Linux Home Automation - Neil Cherry - ncherry@home.net [mailto]
http://members.home.net/ncherry [home.net] (Text only)
http://meltingpot.fortunecity.com/lig htsey/52 [fortunecity.com] (Graphics)
Measuring the Web (Score:1)
http://www.research.digital.com/SRC/personal/Kris
(Don't know why a space got inserted in the link, just remove the space after you get the 404 error. Sorry!)
Also from Compaq (DEC) SRC:
Web Archeology [digital.com]
Mercator Web Crawler [digital.com]
Re:Search engine coverage (Score:1)
Yay google! (Score:1)
had to put my 2 cents in. google rules.
-----------------------------------------
Reveal your Source, Unleash the Power. (tm)
Re:Most "good" links are still being indexed (Score:1)
i don't know about that. yahoo seems to be leaning more towards shopping and less towards infomation. besides the fact that yahoo does a lousy job of cleaning links. a list of results generated from a search is a little like playing minesweeper for dead links which renders it pretty useless when the listing of results is limited.
Older versions of study (Score:1)
98 Results [nec.com]
99 IEEE Paper [nec.com]
Re:Search engine coverage (Score:1)
Oh cool! (Score:1)
Yes, humans do do it better! I use NetMechanic's link checker to keep my links pages up to date - the only problem is, it seems to cache the pages somehow - links that have been removed from the page physically still show up in the report.
I should also mention the Mining Co, now About.com. I had given up hope on looking for decent 3D graphics sites, and to my amazement, I found a whole section devoted to it and VRML there!
Here's an example: (Score:1)
So where are my hits coming from? Well, go to MetaCrawler and search for scuba, pictures, women [go2net.com]. That gets you pictures of me with various celebrities (none underwater) along with a mix of dive sites, scuba porn sites and the charming pages of www.whitesonly.org.
The other 84% (Score:3)
Re:Uhhh.... (Score:1)
To become a moderator, you need to be a user. I really don't understand why regular readers aren't users - IMHO of course
Hopefully, this piece was "informative".
Mong.
* Paul Madley
Re:The question: Do we want all the web pages? (Score:1)
I also want them to ignore meta tags, or any text in a tag - or at least have that be one of the search options, which'll cut down on an 31337 P0rn site popping up on EVERY search.
I agree with the sidebar problem. The other problem is half the stuff they seem to have indexed has moved on by the time I search and a lot of "That member's page can't be found" or just 404 errors pop up.
Yahoo! and Monkeys on Typewriters (Score:1)
How is Yahoo! (an index) even able to compete anymore? Goes to show you what good a little name-recognition can get you...I would bet they are at less-than 1% coverage now!
I think that a 100% human-entered index is still handy. If only they could somehow quadruple the number of monkeys on typewriters we might really have something: http://dmoz.org/
-AP
Re:Search engine coverage (Score:1)
Deals with ISP's, common Index (Score:1)
Anyway, Altavista, Yahoo, Infoseek, etc... could make deals with the big ISP's/web host services such as Mindspring, Netcom, Earthlink, Geocities, Tripod, etc... Those sites would then index their own sites, which would save your spider/crawler a lot of time.
The indexes could then be merged back at Altavista, Infoseek, etc... Or the search of those sites could hit all the distributed indexes.
There is still the issue of other sites not located on these big ISP's, like
Hmm, an outrageous pipedream: (Score:1)
2.) Grab ahold of a gigantic dataline (OC-192 anyone?)
3.) Set up an engine to visit every IP concievable. then check each site for every directory concievable... And every filename concievable..
4.) Brace for lawsuits
5.) Throw more hardware at it
6.) run SETI@home (in it's spare time)
7.) a month later, all indexed, sued up the wazoo, time to start *all* over again
Re:Distributed indexing??? (Score:1)
For an excellent example that is almost there, see the Open Directory [dmoz.org].
Sherlock plugin? (Score:1)
Re:Uhhh.... (Score:1)
An individual moderator can only nudge the score up or down by 1 point. The adjective displayed is the one select by the last moderator to grade a message.
Uhhh.... (Score:3)
Reminds me of a joke, but I can't remember the specifics. Something like "Fully 33 percent of our foos are bar, but only one third of their foos are bar."
Diameter of WWW (Score:1)
Distributed indexing??? (Score:2)
Re:Search engine coverage (Score:1)
Well, perhaps the computing power for something like this isn't available yet. But it'd be nice...
Search engine coverage (Score:4)
Machines aren't very good at being intelligent in this manner, so suppose a new search engine was created. You type in a search term and it comes back with a list of matching pages. You again wade through the list but now you also can award a number of relavence points to the ones that matched closest. This would work well for a while, but would break down in the long run, as the web continues to expand new pages will be unranked, so they would not appear in the ranked lists of potential hits (at least for popular search terms) and so won't be ranked.
What might work better would be a search by reduction. Type in some overgeneralized search term and the text on the page is distilled down to a brief outline. There are already packages which can create fairly decent summaries of documents. You click on a button that indicates "I like this, find me more like it" which means that there's something you like about the summary so it generates a number of new more specific search terms from the summary and comes up with a new list.
Re:Search engine coverage (Score:1)
Aeiwi [aeiwi.com] is based on.
Aeiwi have a unique interface that allows users to add more search
terms until they have a manageable number of results.
Knud
Quick usability test (Score:1)
The layout is pretty nice, or rather clean, but the search was slightly slow. Who wants to catalog more of the Web when it means that much more noise to wade through? I also think the 1/6 to 16% thing is hilarious.
Re:Uhhh.... (Score:1)
Should anyone bother with meta tags anymore?
Re:Uhhh.... (Score:1)
Joe
Re:Uhhh.... (Score:1)
The question: Do we want all the web pages? (Score:2)
One problem I have with engines are sites with changing sidebars... when the sidebars mention one of my keywords because it was a recent article when the crawler went by, but the article has nothing to do with what I want...
Never mind %s, update those links! (Score:1)
I was looking for a Director page...and came up with a page apologizing to people who came from Yahoo. It read something like "This link has been dead since Dec. 16, 1997, if you're wondering how long Yahoo keeps old URLs".
Yeah, when I need something in-depth, Google, Hotbot and Ask Jeeves do the job pretty good!
Re:Bogo-coverage (Score:1)
1. Search a random set of IP addresses for web servers. Make this as big as you can. Just look at port 80 for now to make the task easier.
2. Find out how many pages are on each of the servers you find (OK _this_ is hard)
3. choose a subset of the machines you scanned and check them for web servers on unusual ports.
Work out the number of servers on odd ports for each running on 80 and add this to the servers found in (1).
Average (mean) the number of pages per server you attempted to establish in (2)
Now we know the % of IPs that run a server, we know the number of possible IP addresses there are so we can get a fair guess at the number of servers out there. We know average pages per server and robert's your father's brother...
Problems:
Big sites with many pages verses home servers on cable. Hit a few too many big sites, or too few and you are way out.
Just finding the number of pages on a server, how many to geocities have ? do they tell this stuff ?
Active pages, does slashdot have infinite pages if you keep adding users and generating unique views ?
Same search results, different day (Score:1)
I did a search on Northerlights and it didn't find my pages but did find pages wih links to my page. I also used the power search but that failed also.
I probably shouldn't be too upset as I get about 1000 hits per month. Since it is a specilaized page I don't think I'll get any more hits. But it does tick me off that if I want poeple to find my page I have to pay for it. I thought a search engines reputation was supposed to garner it more attention and therefor more advertisement dollars. Now to increase their rep's I have to pay them.
--
Linux Home Automation - Neil Cherry - ncherry@home.net [mailto]
http://members.home.net/ncherry [home.net] (Text only)
http://meltingpot.fortunecity.com/lig htsey/52 [fortunecity.com] (Graphics)
Re:Distributed indexing??? (Score:1)
I sort of like the idea of a self-indexing web...almost like a neural net (insofar as it is mutable through requests being submitted)...central databases don't seem to work well..they are too "removed" from the actually destination...
Don't Credit the Globe (Score:1)
(Note to Rob: I submitted this same story to /. yesterday afternoon, with links and proper attribution to NECRI and Nature, but I guess accuracy doesn't count as much as timing.)
Re:Search engine coverage (Score:2)
I left the project because I don't think it works. The system would have involved THE WORLD LARGEST NEURAL NET, by having inputs which contain information describing all of the "important" words on the page and the distance between various words and the font size used to display the words.
IMHO, there were a few insurmountable problems with the project. One, the neural net was way too large. There are too many words to search, and the word list would need to grow over time (in 1996, would the words "Linux" and "PalmOS" and "WinCE" have been frequent enough to merit their own input nodes? Probably not. Today, on the other hand....). How do you design a neural net which changes the number of input nodes over time, but doesn't lose its current weights? I don't know if there's any research on this, but it would be interesting.
There are various problems with synonyms and related words as well. I also wasn't sure that the Hn tags were good indicators of importance. Web pages arent structured like outlines anymore.
The biggest problem is the lack of NEGATIVE feedback. You only tell the neural net search engine what you like, not what you don't like. Neural nets are initialized with random weights for various technical reasons (Prof. Shavlik has experimented with starting neural nets off with rule-based knowledge in his KBANN project). That means that some things which you DO like will most likely get negative weights at first and you'll never see them. While you might specify a list of words you do NOT want to see (which would help the inputs), you would probably not spend time examining pages to see if they do NOT interest you (which means you would never do back propagation with a negative answer). No one would want a product which says: "I think you will really hate this page. Am I right?" The problem is that this is a very necessary part of training a neural net.
This isn't to say that the research project hasn't shown some results, but it isn't as ideal of a solution as you'd think.
-jon
Search engines as a commodity? (Score:2)
To make what I'm thinking of possible, you'd need to have a standard indexing format. I'm sure Microsoft has one we can use, as long as half the links point back to them :)
Isn't that part of what the META tag is for? Or the LINK tag?
Looking over my copy of the HTML 4.0 specification, there's not a specified list of META attributes, but maybe the following should be considered standard for search engines:
The following LINK attributes should be set also:
That way, a search result could take the format of:
The best thing about the LINK attributes is that at least one browser, iCab [www.icab.de], provides a set of buttons for several LINK attributes -- start, end, next, prev, home, search, help, made, etc. Too bad it's MacOS only; maybe someone could create a similar set of buttons for Mozilla?
Anyway, Altavista, Yahoo, Infoseek, etc... could make deals with the big ISP's/web host services such as Mindspring, Netcom, Earthlink, Geocities, Tripod, etc... Those sites would then index their own sites, which would save your spider/crawler a lot of time.
Now there's a thought! Then meta-search engines like Metacrawler could have more meaningful returns.
Am I the only one that thinks a search engine should be a commodity? I don't care which search engine I use, so long as I get the best results. (Keeping paid advertisements out of the search results would be a benefit, too...)
There is still the issue of other sites not located on these big ISP's, like .edu's and ibm.com's.
Maybe someone should consider an EduSearch search engine, indexing only sites under the .edu domain? (Especially if its index can be used by a larger metasearch engine...)
As for ibm.com and the like, large corporate web sites should have some form of search facility; an Alertbox column from UseIT.com [useit.com] discussing corporate intranets says that having some form of search facility should be considered essential -- I don't see why the same shouldn't be true for their Web shingle as well.
Jay (=
Yahoo! (Score:1)
But I agree that Yahoo! can't compete anymore, if you want your site to be indexed with it you have 2 options.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Belgium HyperBanner
http://belgium.hyperbanner.net
Re:Uhhh.... (Score:1)
Moderators don't directly set the score, they just somehow nudge it in a particular direction with a particular adjective.
I don't know if it takes more than one moderator to assign a particular adjective to a post.
Search engines should be like DNS servers (Score:1)
Could we turn web search engines into a distributed hierarchy like DNS? I don't expect my ISP's DNS server to have every IP address on the planet, but I expect it to be able to find the ones I need.
Have each of the major ISPs (expecially those that give their members web space!), free web page providers, companies that do virtual domain hosting, and large corporate/education/organization sites maintain their own index of web pages.
There could be generic, "top level" engines like Yahoo and Altavista (which could choose to exclude indexes of porn sites) but also more focused engines -- educational sites, business sites, scientific and technical sites; hell, why not a porn engine?
Would this work?
Jay (=
Re:Search engine coverage (Score:2)
The system in Scientific American works by analysing not merely the contents, but the relationship of the links. It then classifies sites and documents according to the pattern of links into and out of them. This helps in priortizing "authoritative" sites, for example.
You should check out the article and its bibliography.
Re:Distributed indexing??? (Score:1)
Searching the Web (Score:2)
Are we afraid that someone in New Guinea has the answer to our life's problems?
I don't see why searching the whole web is any more relevant an activity than reading every book that has been written. Some will see a flaw in this: they'll say, "Reading the web and searching the web aren't the same thing - I want to know my choices." Fine, I say - you don't know all your choices when it comes to books, either.
Then there's the "quality" argument: "I don't want all of the references to 'X' - I want only the 'good' references to 'X'." On the Internet not only does no one know if you're a dog, they don't know if you're a dog with bad taste! I think this argument needs to be changed; I like the Social Sciences Index idea, personally: the number of references to an article makes it "important". That is, the greater the number of times that an article is refererred to by another article, even if the reference is only to refute the original, the higher the ranking of the article. We already see this in action - they're called portals. They are the hot spots of the web...
--
Plug, eh? (Score:2)
another cached engine (Score:1)
These cached engines need to update on a daily basis if they intend to remain functional.
As websites often update, they change the pages and the names of pages to fit a new look or feel.The searches I used found pages that did not exist anymore, nor have they for a few months now.
Oh well maybe people will "back up" in the URL when they visit...
Northern Light IS better (Score:1)
1. Breaks down the search results by their type and location (mini-directory)
2. Doesn't annoy your with stupid plugs (ahem, "recommendations").
Apparently they also have a fair amount of non-web (presumably OCR scanned) material, but I've never tried purchasing it.
In the Trenches... (Score:2)
Having started up a coupe of years back, I can say I've seen some of what this article is talking about. More and more, I see sites listed and mentioned by work of mouth than I had not found via any of the major search engines. Even with date restrains, a search of the majors (Altavista and HotBot in my case) can eat up days, literally.
The reviews I write tend to note this fact -- although I have a few "big" Middle Eastern Dance sites, my focus and goal is noting all the little sites that are being left behind. Most of them still come from the search engines, but it's just too much. Even with 100 workers, I'd still not get them all, could not.
I can't say I know of a realistic way of overcoming this. What would be good is to have a strong effort to have all the major ISP's offer an easy way to register with all the search engines any pages their users create. It's easy to create a web site, but so many people get left behind in actually promoting it, and when they do, they do so very poorly. (For the moment, let's ignore those who just don't do HTML well) Without the promotion, it's just for a few families and friends, unles the content is really interesting, and is promptly drowned out by the chaos of the web.
Also, I think projects like Google and the push towards XML are imperative to the health of the web. We need to more away from the free-form nature of _everything_ on the WWW, and towards some more structure, more focus. Peple simply need to be able to find stuff, and they cannot right now. I'm going to do my part -- my site is being converted to an XML for the far future, and, for the near future, the perl scripts that build it have already been rewritten to be moved to an server with CGI, so that people can search my site, specifically.
Just my two cents.
Meta-search! (Score:1)
One search engine not complete enough for you? Search a bunch of them with a meta-search engine. I like SavvySearch [savvysearch.com].
distributed "SETI"-like initiatives called for? (Score:2)
To avoid the Netscape "What's Related?" fiasco, the authors should allow the end user editorial control, and provide for some discretion over and anonymizing of the results submission.
Re:Search engine coverage (Score:1)
Searching "Linux home page perl" on hotbot:
(After clicking Reload, due to "Connection Reset by Peer")
8 advertising graphics(including a mini-form letting me look up "Linux home page perl" at kidflix.com)
4 "search partner" links(including "How to Buy a House Online")
The search results start halfway down the page.
I do not enjoy scrolling to look at the reason I'm on the site in the first place: search results. Hotbot is good as far as the search engine itself goes, but I find myself at Google and Northern Light these days simply because of presentation!
That was it! (Score:1)
Dilbert used that very item, and that's where I heard it from.
ack! let's not do that (Score:1)
if I have a thirty year old peice of equipment, I still want to be able to find a thirty year old document to describe it in detail. Even better -- thirty years of accumulated information describeing it in detail.
Video Game Worker Dismissed... (Score:1)
If you scroll down the page, you'll find the story about Kevin Roseler, an employee at Origin Systems, who was dismissed after he abused his priviledges at Ultima Online to generate castles/gold/etc. and sell it for $7000 on Ebay. What he did wasn't technically legal, but it was an abuse of power and whatnot.
Pigdog article on the aame topic... (Score:1)
The link to the Boston Globe is dead. Check out an article from Pigdog Journal [pigdog.org] about the exact same topic. It also has a link to a BBC article about it.
Web Search Engines Are Falling Down on the Job! [pigdog.org]
1999-07-07 18:18:38
Re:Pigdog article on the same topic... (Score:1)
Whoops, little trouble typing there...
Anyway, just go to the Pigdog front page (www.pigdog.org). The article is right there. This slashdot message board doesn't like the long URL for the article.
Google (Score:1)
In the end, the only solution is to structure the data better than HTML allows. XML here we come
Better indexing (Score:1)
It seems to me that the only way we're ever going to get away from ever-deteriorating keyword searches and ever more corruptible and less competent cataloging sites is to switch over to better (ie. more logical, more meaningful) forms of mark up than HTML provides. XML anyone ?
Re:Searching the Web (Score:1)
Several journal indices don't measure "quality" per se - they measure impact, essentially.
That is, the more often an article is cited, the more impact the article has - either as a source of truth, radical theory, or wrong thinking.
Re:Search engine coverage (Score:1)
The article should be located here:
http://www.sciam.com/1999/0699issue/0699raghava
Re:Never mind %s, update those links! (Score:1)
Sounds like you might be interested in the Mozilla Directory, i.e. DMOZ.org [dmoz.org] -- it's an Open Source web directory, more or less. There's a feature on there for the editors which point out dead links and even sends an e-mail to the editors warning them about the dead link so it can be corrected.
Plus, as the tag line goes, "humans do it better."
I use Google [google.com] for a search engine and DMOZ for a web directory. Either way, I tend to find what I need much more often than not.
-Augie, is an editor on DMOZ by way of full disclosure
Re:Searching the Web (Score:1)
alternatively - for example if you want to know about biker groups in your area, so you do a search on biker - the results are predominantly porn sites, since these sites sometimes have a tangential reference to bikes, but reference each other a huge amount, giving them 'quality', in your terms, relative to bikes.
I don't think that quality can be measured using number of references as the primary criterion.
Re:Search engine coverage (Score:1)
Once you add more words to your search, the feature starts to stand out, picking out the most popular of the sites by the number of other pages which link to it.