Searching the 'Deep Web' 193
abysmilliard writes "Salon is running a story on next-generation web crawling technologies, specifically Yahoo's new paid "Content Acquisition Program." The article alleges that current search services like Google manage to access less than 1% of the web, and that the new services will be able to trawl the "deep web," or the 90-odd percent of web databases, forms and content that we don't see. Will access to this new level of specific information change how we deal with companies, governments and private insitutions?"
With the 10% that is crawled (Score:5, Funny)
Re:With the 10% that is crawled (Score:5, Informative)
Let me give you an example. I run a forum. The main index page doesn't contain much information, just an overview of the latest posts and a brief introduction.
The rest of the content is what people submit. Here is the problem. The pages are generated dynamically. They end up having url's like http://domain/index.php?act=showpost&postid=12 44
Google sees index.php as one page, and does not attempt to submit any data via get/post. This means that effectively the most valuable content is missed.
Of course making it crawl
Re:With the 10% that is crawled (Score:3, Interesting)
Google sees index.php as one page, and does not attempt to submit any data via get/post.
Hmm... I see plenty of pages in Google that have URLs with GET parameters, so there must be some way of getting it to crawl them. Or am I misunderstanding what you're saying? Maybe the key here is to provide an alternate route to those pages without doing anything fancy (drop-down menus, radio buttons, etc.). Just generate another page that contains a regular link
Re:With the 10% that is crawled (Score:3, Interesting)
Re:With the 10% that is crawled (Score:2)
print("Some page... [slashdot.org]");
As I understand it, looping is in fact a big problem for robots. There are a number of ways of getting around it. A brute-force method would be to just limit the search tree depth to say, 20 levels or so (I pulled that number out of my butt, of course, so it would need some tuning based on how many levels you're likely to see on a real site).
It wouldn't surprise me to learn that more sophisticated robots (e.g., Google) actually do fairly sophisticated cont
Re:With the 10% that is crawled (Score:2)
If google is smart, then they'll have robots close to as many servers as possible, preferably at least 1U colocated at each more than insignificant hosting provider, so that crawling
Re:With the 10% that is crawled (Score:2, Informative)
One word: backlinks. Pages, even with request parameters, that get linked to from lots of popular (high-pagerank) sites get indexed.
Re:With the 10% that is crawled (Score:2)
However, you can modify Apache and/or PHP to use URL/URI names for dynamic pages. You could remap your example query to http://domain/showpost/1244/ and the engines will probably index it. I'm not sure why more message board software doesn't do this. (Okay, probably because it requires httpd & server-side processing coordination.)
Re:With the 10% that is crawled (Score:2)
Re:With the 10% that is crawled (Score:2, Informative)
However, when search engines do start doing deep crawls, especially if they do POSTs and GETs, then the bandwidt
Re:With the 10% that is crawled (Score:2)
Better yet though, provide a nested view (like Slashdot) and have that the default forum view, have everything else (threaded, individual message, etc) be additional options. Google will follow a link to a single message, then seeing a ton of links to the same
Re:With the 10% that is crawled (Score:3, Insightful)
If you're too cheap to pay for anythi
Re:With the 10% that is crawled (Score:2)
Deep Web? (Score:5, Insightful)
Re:Deep Web? (Score:3, Insightful)
Re:Deep Web? (Score:2, Insightful)
User-agent: *
Disallow:
And if they don't listen, feed them a huge maze of generated links that eventually lead to goatse or something. Or just block their crawler at the router and they can search their intranet.
Re:Deep Web? (Score:3, Informative)
Re:Deep Web? (Score:2, Insightful)
And analogously ... (Score:3, Insightful)
Re:Deep Web? (Score:2, Interesting)
Disallow:
trawler: "Hey cool, thx for the tip I never would have thought to try
Re:Deep Web? (Score:3, Funny)
Progress (Score:2)
ignore robots.txt (Score:1, Informative)
Damn ... (Score:2, Funny)
Re:Damn ... (Score:2)
Oh yeah, a whole new pair of dimes (Score:4, Funny)
Yeah. It means I'll be able to use someone else's credit card for more of my transactions, since finding credit cards, SSNs and other...uh...'deep web' stuff will be so much more accessable.
-Adam
Re:Oh yeah, a whole new pair of dimes (Score:5, Insightful)
Re:Oh yeah, a whole new pair of dimes (Score:2)
If by "almost never" you mean "usually", I'd be inclined to agree with you.
We're talking about application designers that are foolish enough to store credit card numbers in a publicly accessible location to begin with. Do you really think any of them have given thought to deliberatily obfuscating the data model enough to store expiration dates somewhere other than right next to the CC numbers and
Re:Oh yeah, a whole new pair of dimes (Score:3, Insightful)
Deep Web? (Score:2, Funny)
Re:Deep Web? (Score:2)
robots.txt should be ignored anyway (Score:1, Troll)
Deep web? (Score:5, Funny)
Re:Deep web? (Score:2)
Hell, it even sounds like the name of a Lovecraftian Horror..
deep web? (Score:5, Funny)
-
No... (Score:1, Interesting)
only 1%??? (Score:1)
so maybe that's why google never tells me anything about servicing this teletype machine...
it's amazing to think how much more information we'd have access to if google (or another search engine) could search 90% of what's out there. i mean, just at 1% we already say, "google knows all"Maybe I'm just missing the point... (Score:5, Interesting)
It the very least, it might require an overhaul or extension to the robots exclusion specification to keep spiders out of your data.
you are missing the point... (Score:2)
But if you bypass the front pages... (Score:4, Insightful)
I could care less about Ticketmaster whining out their deep linking, but there's probably some stuff out there that if it isn't taken in context to their intended point of entry may have other problems.
I'm afraid that this is going to give people more reason to go back to using frames, and 'detecting' if their content has been hijacked, and writing more bad code that causes multiple windows to pop up all over the place, and/or crash browsers.
Re:But if you bypass the front pages... (Score:2)
Also there is nothhing from stopping the sites from checking the refferer to display the disclaimer on first EXTERNAL entry. Also search engines at present are hardly intelligent enough to automatically avoi
Warnings are there to limit liability. (Score:4, Insightful)
If you have no warnings, then someone can claim that you forced your content on them, and they didn't know what they were getting into, and it was offensive.
By putting up warnings, which inform the user that they shouldn't enter your site if it's illegal for them to do so shifts part of the burden of responsibility to them, and away from you.
So, if you're sued for having distributed offensive material, you can claim that you provided warnings, and that the person chose to disregard them. [Sort of like putting up 'wet floor' signs -- if someone gets hurt, they made an active decision to ignore the sign]
Re:But if you bypass the front pages... (Score:2)
> bypass the disclaimer pages on porn sites because
> of deep linking?
How many children want to read a disclaimer page anyway? Or agree that they are not old enough to do something?
Re:But if you bypass the front pages... (Score:5, Insightful)
Hello, 1996 is calling; they want their paranoia back!
Goodness, you aren't serious, are you? Have you used a search engine in the last couple years? Have you not ever looked for porn yourself? Just hop over to images.google.com and enter the name of a porn star - bam, shitloads of smut. Not only that, but search google.com for a porn star's name (many of which you could easily find by searching for 'famous porn stars', I'm sure) and you'll find gallery after gallery of porn, open and free.
There is no such thing as protecting your kids from porn on the internet anymore. If you don't want to have them looking at porn, don't let them online or police their actions.
Re:But if you bypass the front pages... (Score:2)
PHP? (Score:5, Interesting)
While I find it highly unlikely that this system will do well with large databases (or even databases at all for that matter) it is a step in the right direction. Google will probably have their version up on labs inside a month.
Re:PHP? (Score:2)
Perhaps you are doing something wrong? All the dynamic PHP sites I know of are fully indexed by Google.
Re:PHP? (Score:2, Insightful)
There is reams of stuff in there that a search engine can't see. XML could be used to deep search these entire databases, rather than just the stuff that's pulled into the UI by the PHP code.
Re:PHP? (Score:5, Informative)
http://site.com/blah/prog.php/stat/1
instead of
http://site.com/blah/prog.php?stat=1
I use it all the time and it works really well.
Re:PHP? (Score:2)
Re:PHP? (Score:5, Interesting)
Freshbot is meant to update the google cache for pages that change frequently. Freshbot may pull pages as much as every couple hours for really popular pages that change frequently.
Deepbot goes out once every month or two and follows links. The higher your pagerank, the deeper into your site it will go. If you want more of your site to get crawled here are some tips:
It is likely that deepbot just hasn't run since you updated your site, so freshbot is just pulling your front page occasionally.
BTW: I noticed you have a link to my cheet sheet on your links page. Thanks! :-)
Re:PHP? (Score:2)
Another way of looking static is to a, say, "index.cgi" within a subdirectory, and then only link to the subdirectory name. For example, a typical month's archive at my site kisrael.com has the URL like http://kisrael.com/arch/2004/03/ even though it's all dynaimcally generated. (I wasn't smart enough and/or didn't have enough access to my rented webserver to pull off that trick where that URL ends up going to, say, arch/index.cgi and /2004/03/ get interpret
*Look* static? Be static, dammit! (Score:2)
I have not ran across a lot of pages that actually need to be dynamically generated. Shopping carts and account settings need it, but if you make everything dynamic, like most misguided web developers do these days, you simply succeed at slowing your site down to a crawl and evoking a long stream of curses from people like me, who still think that broadband access is not worth $60 a month.
Re:*Look* static? Be static, dammit! (Score:2)
Because of low latency. Throughput slows down everything, but dynamic content suffers more due to the amount of cross-talk it generates.
> And it will only slow down the server if its done poorly anyway.
Judging from what I see on ALL the web sites I visit, there is simply no one left who can do it well.
Re:PHP? (Score:2)
It's all in how you build your pages.
... everything is actually pulled from source XML files. But the URLs are created in such a way that it appears to be separate pages to a search engine. I've seen the googlebot
For PriorArtDatabase.com [priorartdatabase.com] there is only a handful of actual 'pages'
Re:PHP? (Score:2)
Have you considered using mod_rewrite or a similar solution to convert your complex URLS with query string parameters aplenty into something that looks like a vanilla filepath?
For example, using mod_rewrite the URL of the page I'm typing this on
http://slashdot.org/comments.pl?sid=99804&op=Re p ly &threshold=3&commentsort=0&mode=flat&pid=85090 86
could be rewritten to look like
From the article (Score:5, Insightful)
Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.
There is a reason for this: a Google search should turn up pointers to the items in the so-called "deep web" (*gag*). To use one of the examples above: if I am looking for information on patents, the search terms I use should point me to the US Patent and Trademark Office [uspto.gov]. It shouldn't have to point me to all 12 bajillion patent filings.
Besides, what makes anyone think this is going to fly after all the hubbub over "deep-linking"?
Re:From the article (Score:2, Interesting)
But if you are interested in a specific subject..
Let's say you have a technical problem.
Chances are somewhere on the planet someone submitted the same problem on a web-based forum.
Now you want google to give you THAT specific message.
You don't want google to tell you "hmmm... I guess the solution must be in one of those zillions of forums here, here, and here".
Spiders? (Score:4, Interesting)
Has anyone tried this yet? Change your user agent string to one matching the googlebot and crawl the web. I'm pretty sure many "registration only" websites would magically open themselves, but I wonder about other differences too
Re:Spiders? (Score:3, Interesting)
Re:Spiders? (Score:3, Informative)
I can't speak for everyone, but here we check not only a spider's User Agent string, but also whether the request is coming from Google's IP range or elsewhere. So your results may not be so great.
Then again, I've defeated many registration (er, pr0n) gateways by just seting a Referer header identical to the URL I'm requesting, so some defenses are better than others...
Re:Spiders? (Score:2)
Indeed. I do exactly this to access the Insiders Only content on IGN [ign.com]. (You'll also need to disable javascript). I'd feel bad about it, but this pricks clearly intend to deceive. I find links to interesting content through Google, but the link leads somewhere else. I don't mind paid content (I pay for two online magazines), but attempting to mislead both G
Pay to search (Score:2)
Privacy and Crap (Score:3, Interesting)
But in reality the other 90% most likely be best left un-found. Who really wants to know that parents were not married as in the manor that they told.
Just is in archology, you will find a nice vase or two... but the rest is rumble.
You understand that digging a garage dump is the best place to find things in archology, because people clean their house then too. That is what other 90% is... a dump of information.
Google (Score:3, Insightful)
With google storing more than 4 billion web pages, I'd hate to see what kind of crap the other 99% is.
Perhaps they count each iteration of a dynamic page as a seperate page? Even so, google's news page does a great job searching in real time for pages that change dynamicaly.
Re:Google (Score:2)
Top 4 (Score:5, Informative)
Are you Corn Fed? [ebay.com]
Re: (Score:3, Informative)
1 percent,? (Score:5, Insightful)
1 percent, and I still don't have a problem feeling lucky almost every time I do a search on google.
zRelevancy (Score:4, Insightful)
I run a search engine for an educational institution, and I will admit, Google misses a significant number of our documents, on the other hand, some of those documents are scripts that when queried will create an (virtually) infinite amount of data (calendar scritpts, etc). How deep do we really need to go though? Do we really need to include calendar entries for the year 2452?
I'm also confused, is this search service 'pay by the searcher' or 'pay by the content provider'. It seems to be content provider to me.
Limitations of Google (Score:3, Insightful)
Personally, I find this infuriating. A site I once worked on was available in numerous languages, which could be chosen by choosing from a drop down list box. The upshoot of this is that Google has only cached the site in English, meaning users who would use the other languages do not get my site returned when they search in Google.
We need an open-source alternative that can address these problems, as well as get rid of the security concerns and mysterious methods Google uses to rank sites.
Re:Limitations of Google (Score:5, Insightful)
Solution: Web designers, stop trying to be so clever.
If you want your site to be spiderable, don't hide it behind javascript and flash!
Re:Limitations of Google (Score:2)
My solution: load the links normally inside a <div id=...>, but after the page loads and the JS menus are drawn, it replaces the contents of the DIV using the innerHTML function. Consequently, web spiders are able to crawl down to my sub-pages despite not having JS (not that any engines *have* crawled them, mine's just a small personal site hosted on my university account, please don't
Article (Score:3, Informative)
Those of us who place our faith in the Googlebot may be surprised to learn that the big search engines crawl less than 1 percent of the known Web. Beneath the surface layer of company sites, blogs and porn lies another, hidden Web. The "deep Web" is the great lode of databases, flight schedules, library catalogs, classified ads, patent filings, genetic research data and another 90-odd terabytes of data that never find their way onto a typical search results page.
Today, the deep Web remains invisible except when we engage in a focused transaction: searching a catalog, booking a flight, looking for a job. That's about to change. In addition to Yahoo, outfits like Google and IBM, along with a raft of startups, are developing new approaches for trawling the deep Web. And while their solutions differ, they are all pursuing the same goal: to expand the reach of search engines into our cultural, economic and civic lives.
As new search spiders penetrate the thickets of corporate databases, government documents and scholarly research databanks, they will not only help users retrieve better search results but also siphon transactions away from the organizations that traditionally mediate access to that data. As organizations commingle more of their data with the deep Web search engines, they are entering into a complex bargain, one they may not fully understand.
Case in point: In 1999, the CIA issued a revised edition of "The Chemical and Biological Warfare Threat," a report by Steven Hatfill (the bio-weapons specialist who became briefly embroiled in the 2001 anthrax scare). It's a public document, but you won't find it on Google. To find a copy, you need to know your way around to the U.S. Government Printing Office catalog database.
The world's largest publisher, the U.S. federal government generates millions of documents every year: laws, economic forecasts, crop reports, press releases and milk pricing regulations. The government does maintain an ostensible government-wide search portal at FirstGov -- but it performs no better than Google at locating the Hatfill report. Other government branches maintain thousands of other publicly accessible search engines, from the Library of Congress catalog to the U.S. Federal Fish Finder.
"The U.S. Government Printing Office has the mandate of making the documents of the democracy available to everyone for free," says Tim Bray, CTO of Antarctica Systems. "But the poor guys have no control over the upstream data flow that lands in their laps." The result: a sprawling pastiche of databases, unevenly tagged, independently owned and operated, with none of it searchable in a single authoritative place.
If deep Web search engines can penetrate the sprawling mass of government output, they will give the electorate a powerful lens into the public record. And in a world where we can Google our Match.com dates, why shouldn't we expect that kind of visibility into our government?
When former Treasury Secretary Paul O'Neill gave reporter Ron Suskind 19,000 unclassified government files as background for the recently published "Price of Loyalty," Suskind decided to conduct "an experiment in transparency," scanning in some of the documents and posting them to his Web site. If it weren't for the work of Suskind (or at least his intern), Yahoo Search would never find Alan Greenspan's scathing 2002 comments about corporate-governance reform.
The CIA and Dick Cheney notwithstanding, there is no secret government conspiracy to hide public documents from view; it's largely a matter of bureaucratic inertia. Federal information technology organizations may not solve that proble
Bad kitty! (Score:4, Interesting)
Re:Bad kitty! (Score:3, Informative)
If I want to find cheap airline tickets, I put "airline tickets" into google, and it'll give me
Useless statistic of the week (Score:3, Funny)
There's a useless statistic if you ask me.
I just wrote a cgi script that, upon requesting the url "http://bogus.com/nnnnn" returns a page with the text "nnnnn" where nnnnn is any number up to 1000 digits long. So there, I just added 10^1000 pages to the "deep web" of which google indexes none! (gasp).
So there, Google now indexes less than 0.001% of the deep web.
True nature of the deep database problem (Score:5, Informative)
Now say I was looking for info from a few weeks ago - Google is not necessarily the best way of finding this info. It's all still sitting there in the database, but it's not on the site's front page. archive.org may have a copy of it, but it would be much better to have google.com talk XML in a standard method to the news site's content management system, and have ALL the data there for a search.
Re:True nature of the deep database problem (Score:2)
Then what would be the user's motivation to come to the news site, and spend any time there? They could just go to Google and leech all the same content for free.
Funny (Score:5, Interesting)
only missing 90 TB? (Score:2, Funny)
that is what salon says, and I think that is bull, given my favorite porn site offers 20gigs of raunchy action.
More search results (Score:2, Funny)
So instead of 5,234,169 search results returned, we will see 45,961,384 results?
Yippee!!!!!
Typo? (Score:2)
Surely that should be 10%, given the 90% statistic mentioned later on?
Insight on the "deep web" (Score:4, Funny)
Re:Insight on the "deep web" (Score:2)
I remember one that actually did sentence fragments but I can't find it in Google. (Probably because the search terms I'm using are flooded with other relevant hits.)
How?? (Score:3, Interesting)
People submit their site, google goes to their site and visits every link it can find on the main page, then every link it finds on those other pages etc. So that pretty much the whole site is included.
This obviously means pages which are not linked do not get included in googles search, so i'm not surprised at the fact that less than 1% is ever crawled.
So how does this new method of crawling work? How can it possibly know what files are on the server if they are not linked in any way. The only way I can think of is a brute-force type method, which seems extremely stupid to me, since that would consume so much of the search engine's resources.
This also brings me onto the next point, like a few people have mentioned, there are certain pages on the web which append string onto the end or before the beggining of the URL, for example yourname.ismyfriend.com or www.somegamesite.com/attack.php?player=bob&attack
Also, since most of the internet is porn, and this new found technology will reveal another 90% or so percent of the internet, are we suddenly going to be showered with explicit sites?
Re:How?? (Score:5, Interesting)
Google doesn't just search pages submitted - I've got an Apache webserver running a home, doling out pages for family photos and stats for a local UT2K3 server. I hadn't enabled robots.txt to stop search engines from crawling it (didn't think I needed to) and one day entered my URL in google, only to find it.
I've never submitted the URL to google.
Should we assume that Google's already crawled a majority of the sites out there?
BTW, Yahoo has no record of my site in their database.
Re:How?? (Score:2)
Google has submission page, but it doesn't really do much. The way it works is that a page gets indexed if and only if inbound link is found in Google's current index.
That means
Re:How?? (Score:2)
But "a href" is not the only way to get from page to page on the Web. There are also form submits, DHTML, and a hundred varieties of Javascript tricks and techniques.
Deep-linking would presumably try to simulate human interaction well enough to take advantage of these more complex methods. For closed-ended systems, eg select one option from this pull-down menu, deep-linking will probably work well, but for more open-e
another form of DOS (Score:2, Interesting)
The frontend webservers that serve the static pages are fine (they're already being spidered now), but the dynamic content, largely dependant on databases and such, very likely wasn't built to handle this sort of load. Once the new engines get their hooks into these pieces, they're going to be in trouble.
bad idea (Score:2)
I just don't think that is going to fly.
if it... (Score:2)
Max
On a related note... (Score:5, Interesting)
The so-called invisible web is indirectly related to the "deep web", with the exception that most of it isn't connected at all to the main web. Slashdot has had some articles regarding these hidden segments of the web - but has any progress been made on finding these "lost networks"?
Current theory on networks explains how and why these networks form and separate from the main web of connections, mainly due to loss of one of the tenuous threads from a supernode to the outlyer nodes. When this loss occurs (an intermediary site goes offline, or popularity wanes, or a large meganode dies or stagnates), the network fragments - and getting back to the pages/sites within is nearly impossible, unless you already have a link to the inside, or a friend provides it to you.
Now, it is a good thing that this phenomena exists - it seems to exist in all robust, evolving networks - whether those networks be electronically connected, socially connected (ie, Friendster, Orkut, or plain-ole social groupings), or bio/chemo connected (ie, the brain, the body, etc).
Even so, I wonder at all the information out there which I *can't* access, because it isn't indexed in some way. Sometimes you come across fragments and echos in other archives (news, mail, irc) that lead to these far-off and displaced "locations" - but it is rare, and tedious to do unless you are looking for very needful information.
So I ask again, has anything been done to further the "searching" within/for the "invisible web"?
Re:On a related note... (Score:2)
It will only take one link to reconnect a seperate section. Whilst this may not be much for many networks, with search engines that walk the entire network, it's then going to re-enter the indicies. At this point, it's connected by more than one link, and thus a bit more robust.
So, these invisible sections will only contain things that no one links to - which is a pretty good definiti
Wrong Conclusion... (Score:2)
The fact is there is TONS of great indepently published stuff that will never be found through Google because the author doesn't take the time to play the SEO game and advertise their page all over the web. Google's algorithm is far from the final word in relevancy algorithms. The evolution will continue until we have sea
what google should do (Score:2)
I've found this to be true.... (Score:2)
For example: try finding a biography on 'Louis Hebert' on the net. You'll find a few pages, some of them go
Sailing the seas of cheese (Score:2)
The guy giving the speech claimed that he was a retired FBI agent and seemed to have a great deal of insight into the inner workings of national intelligence. As
Re:AKA goodbye robots.txt (Score:2)
Nah, I'm sure the contents of the robots.txt file will be read, and the file itsself will be listed in the index too
Good-bye riaa.org.
Re:AKA goodbye robots.txt (Score:2)
Something like this: robots.txt [google.com]