Is the Internet Becoming Unsearchable? 313
wergild asks: "With more and more sites going to a database driven design, and most search engines not indexing anything that contains a query string in it, we're missing alot of content. I've also heard that some search engines won't index certain extensions like php3 or phtml. Is anything being done about this? How can you use dynamic, database driven content and still get it indexed into the major serach engines?" Is keyword searching obsolete? Do you think its time to index sites by the type of content they carry rather than the content itself? Will larger indexing databases (or a series of smaller, decentralized ones) help?
Directories (Score:1)
Catchup (Score:2)
Searching searches? (Score:1)
Extend the Robots.txt protocol... (Score:3)
Distirbuted Databases? (Score:2)
The effort put in by sites makes up for it. (Score:1)
There are ways... (Score:1)
dynamic content - without the use of different
extensions or URLs that contain query strings.
Apache is awesome (in case you haven't heard)!
Almost all of our HTML files actually contain
embedded TCL code, so the servers are configured
to parse every *.html file - allowing to use
the *.html extension for files that have dynamic
content. We also use things like mod_rewrite
to send data to a single file that tells the
file what data to use and how to behave. We
could have an entire range of sites served out
by a single file... even making it look like
they have thier own directories, when in
reality they don't exist.
Not at all searchable (Score:2)
Perhaps doing away with keywords entirely, getting search engines to look at the content instead of the "false content" of meta tags... now that would be nice.
Search engines are useless. (Score:1)
Customers :) (Score:2)
Is it even possible to index dynamic pages? They don't really exist until the page is generated. Perhaps the best thing to do for sites that want to be indexed is to make sure they have a plain, vanilla index.html page that contains relevant keywords?
Dana
diligent searching. (Score:1)
Short answer, yes. Long answer->I'll find it if given an afternoon or two.
mcrandello@my-deja.com
rschaar{at}pegasus.cc.ucf.edu if it's important.
Rethink the way we index? (Score:1)
Not a problem (Score:1)
#2: technologies like XML may give a standard interface to databases, so that search engines can index databases directly.
IMO, A much bigger threat to the "searchability" of the internet is the rapidly growing amount information -- and with it, the amount of misinformation.
Parallel static pages for search engines (Score:1)
One obvious possibility is to generate - using the database - a set of static pages as "targets" for the search engines. This could be done weekly or monthly, for example. Each target page would contain a prominent link to the dynamic database-driven front-end of the website, so that searchers could find the site and then quickly get directly to the main front end. Not particularly elegant, but it seems like a reasonable work-around for the time being. The real solution, in the long run, will involve more sophicated searching and indexing paradigms.
What do people think about this approach?
Parsing .html with PHP3 not impossible (Score:2)
Database driven web pages are 'spam' (Score:3)
Honestly though. With something that is inherently dynamic like the internet, it is already near impossible to catalogue and make it searchable. Just to illustrate this take any given news site. Today they might have articles about Clinton, tomorrow it might be news about a big fire. Search engines can't just direct you to those sites based on queries because who knows what data they have.
Even if a search engine was able to validate the content on every site before it gave you the url it could still change by the time you actually got to see it.
So quite literaly there isn't even a clue of a way to catalogue a database generated web site. Now granted I know there are plenty of sites like Slashdot that eventually the 'content' settles down and becomes static. Still, how are you going to get some stupid program to verify and validate that for *every* dynamically generated web page. I don't think you can.
The web was created to be open and dynamic and it will stay that way. I've heard people say that maybe there should be *more* interoperability between things like search engines and spiders. This in my mind would do more damage.
Besides is it so bad that spiders don't get these pages? It probably isn't even reasonable because it would add that much more complexity to the search engine to catalogue what it finds. How do you rank content?
Anyway... just my 2 cents or so...
Internal Site Searches More Difficult as well (Score:2)
Now most content is stored in a SQL database. While it is fairly easy to search an SQL database, returning the information in usable form is not. This is especially true once you have many type of tables containing many different types of information.
Currently, the search engine on the site I work on has it's own built in forms for information from each type of table, but this method takes a lot of maintainance.
Another possible way is to point to the page (php3, asp,
It is about time someone developed some technology to do "smart searches" of sql data and return useful information without having to write a template for each and every type of data that might be queried.
I might be off my rocker a little bit on this, but I cannot believe I am the only one experiencing these problems.
-Pete
Use dynamically generated static pages (Score:1)
You can also make those static pages keyword and meta tag heavy without affecting the user experience.
New method? (Score:1)
Perhaps it's time for search engines to search by topic and direct to a site related to the enquiry. The individual sites could then have their own search utilities to trawl through their databases? Not sure if this is feasible or not though.
In terms of good search engines though - Google [google.com] and AllTheWeb.com [alltheweb.com] seem to find good content whenever I use them. The problem I guess is that you don't know what you're missing until you find it by some other means, and neither do the search engines.
Uses Keywords Luke! (Score:1)
I also find that self-registred index sites (like WebRings) can be useful. May be a search engine for WebRings (e.g. look 'Elbereth' on Tolkien WebRing) can be useful (I have to look if there aren't already one).
Personnally, I use specialized index site (like NewHoo, Linux Life or Freshmeat) when I'm looking for something. Those sites will just have more value in the future, IMHO.
Searching ineffectiveness (Score:1)
The sheer volume of websites out there makes effective searches difficult. I imagine a search engine could be tuned for better results, but will people be willing to wait while it crunches through data longer than a shoddy counterpart?
Re:diligent searching. (Score:1)
Search Engine to Search Engine protocol (Score:1)
I seem to be able to find what I need... (Score:2)
Indexing dynamic-content sites. (Score:1)
We're not there.... yet (Score:3)
Still, I see a potential threat in information becoming unmanagable, and, most of all, ways of finding information being abused (like using unrelated keywords just to get some visitors). Stanislaw Lem, the polish sf-writer described this situation in many of his books - starting with the 60s, when noone was even starting to think about such problems.. Sooner of later we'll have a large branch of computer sciences dealing only with searching information in Internet; searching services are already available, but they are either incomplete, or not evaluated. The latter is the key: and google is the first service I'm aware of which tries to automatize evaluating (by counting links pointing to a specific page).
There has been a lot of talk about "Internet agents" a couple of years ago (I remember an article in Scientific American...) - could some good soul explain to me how is the situation now?
Regards,
January
It's been unsearchable (Score:2)
These days, there is so much junk and bad indexing, that I may as well put the shingle back out. Almost any search will find mostly commercial sites, unrelated to the search, or completely useless garbage.
You almost have to be in a bizarre frame of mind to create a good search term these days.
Mark Edwards [mailto]
Proof of Sanity Forged Upon Request
Re:Extend the Robots.txt protocol... (Score:2)
'Karma, Karma, Karma Chamelion' -- Boy George
Black holes (Score:3)
html htm asp php shtml php3
I guess I'll add phtml
Other extensions and urls with query strings are ignored. This is mainly for self defense. There are many, many infinite loops and blackholes on the web and they're hard to avoid. For instance, my spider once got stuck on a server that would return the contents of
GET
found foo/broken.html
GET
webserver couldn't find path, so returns
GET
etc.
What was the programmer thinking?
This is just one example of the blackholes that lurk on the web. It was completely unexpected and pretty difficult to detect. What if someone wanted to write a search engine trap? I don't believe there is a simple solution to this problem.
Ryan
Re:Catchup (Score:1)
Domain Names are the Kludge to the Problem (Score:1)
Until there's a standard, the search engines will continue to miss more and more of the sites out there. XML may be the answer to indexing and exchanging data. However, on the bright side, the difficulty of finding data makes censorship much more difficult for the censors - and that's a good thing.
While they are nice we should already have more (Score:2)
Categories are nice but some (most) sites are personal sites and these sites chage quite often in subject matter.
While the categories are nice we should have a community planned and maintained categorical system along with a plain text search. Have identifier tags that go along with every web site and then have a standalone and a web based version of this program which will allow for anyone to create a hierical listing of anything according to tcertain tastes and peramaters.
Two-level structure (Score:2)
It is getting more and more so that to find an answer to a somewhat obscure question, I need first to find major sites on the topic, and then do a search through their databases or mailing list archives. I believe this reflects a real-life structuring of the Web and will have to be taken into account by next-generation search engines.
Kaa
Centralized Searching is the Wrong Approach (Score:1)
The lists of problems that exist for centralized search engines goes on and on: dynamic pages (of course), missing/broken/changed links, getting to new pages, and so on.
What I think could be done is to define a search protocol (perhaps through some kind of search://domain/search+terms method) that is standardized. The global search engines then search by determining the most likely sites to have information for you and querying those sites directly for information. This would fix the problem of broken/missing/changed links being reported, new pages would automatically be available (assuming sites updated their search engines quickly), and if the local search engines are integrated with dynamic page generators (which should be possible) than those pages could be searched too.
I realize that a lot of work would be needed to be put in to this in order for it to work. A protocol would need to be developed, as well as servers for the protocol. Search engines would have to learn to efficiently decide which sites to query to complete their searches, etc.
Perhaps a combination of both approaches could yield something even better. All I know is that what is out there right now, well, fails miserably.
Static gateways? (Score:3)
Search engines not picking up on php3 is a bit worrying though, all my sites are written purely in php3, although I never seem to have any problems with getting listed.
Gateway pages are a good way of making sure you get listed with the keywords you want, although they aren't very dynamic and unless you get really clever don't tend to reflect the contents of a regularly update site... however it seems to me that you can only really hope for *a* listing these days, not an index of all of your site.
Even google has a 3 month disclaimer on it's submit page, that's a mighty long time if you are looking for support on a brand new motherboard.
LASE seems to be the way to go... subject specifc full text indexes which spider regularly and can index specialised data keeping it up to date.
However you would still need a search engine to find a LASE that will get you what you want, but at least it's a bit more structured!
There are many ways round the search engine problems, and keeping on top of it is a full time job, Submit-it doesn't come close, that hasn't changed in the past 3 years, Search engines however have!
IMO a combination of all of the above will get you where you want. Keywords and Meta Tags still count, and you have to be persistent.
fault bad browsers and no index of quality (Score:1)
The most common way though people find out about worthy dynamic content sites I think is word of mouth. We could use more forums and link referrals to share websites we have found useful. This has the very distinct advantage over search engines of providing a better filter of QUALITY of information. After reading someone's recommendation of slashdot or an article elsewhere, I won't have to hurdle 19 irrelevant hits to get there.
Re:Catchup (Score:1)
And alas, as far as the crap that accounts for 75% of the web goes, the cost of accommodating the vast quantities of crap is less than the cost of removing it or improving ways of avoiding it; and until that is no longer the case we're going to have to put up with ever increasing quantities of sewage.
The Open Directory Project (Score:2)
XML? (Score:2)
Whatever happened to that? I don't mind all that much being taken to the front page of a site if I know that site has the information somewhere in there, I just hate having to hit seven sites to find that one.
Hotnutz.com [hotnutz.com]
Challenges for searching the web..... (Score:2)
Some of the challanges which will be faced for search the web in the future will be :
1. Displaying matching URLs as well as links which match the type of content. This is important. If I search for "throat infection" on a search engine..apart from the pages which mention "throat infection"
Search engines will have to maintain huge databases linking words to categories. And with the proliferation of hte internet the number of sites carrying content and disallowing search engines is going to increase. Search engines need a intelligent way to get around this.
2. Search engines will need "help" users with their searches. For example if I just search for "throat" the search engine should have a helper section where it can ask me more...whether I am searching for "throat infection" or "study off throat" and so on.
3. Search assisted by humans. This is also one of the concepts picking up these days. Basically you submit a question and there will be some person searching the web, and you will get you answer in a few hours/days. Chk out www.xpertsite.com.
4. Tools for better maintenance of bookmarks. I for one usually bookmark all relevant stuff and then I spend a full weekend arranging them so that I can find the relevant stuff from the bookmarks quickly. The current bookmarking scheme is very primitive causing a lot of users to "reinvent teh wheel" (searching for URLs which are already bookmarked).
Phew!
I'll jot down more thoughts later. Gotta work now.
CP
XML (Score:2)
I want to know if Linux is on top of this. Microsoft has an XML notepad available and I hear that it's going to be all over Win2000 (in the registry even). XML will be the foundation of the new internet and we don't want microsoft to have a technology edge there do we? Perl has XML modules, as I am sure other languages do too (python). Lets get some apps written!
What about Gnome and KDE? this could help make their projects easier. Especially KDE with all of the object similatrities between Corba and XML and Object RDB's. All Config files could be theoretically stored in XML. We need to push this one people!
-pos
The truth is more important than the facts.
Re:Not a problem (Score:1)
You mean like Sherlock? (Score:2)
One solution that attempts to address this is Apple's Sherlock [apple.com]. It uses XML to pass queries to web sites and return results. There are certainly some limitations: you have to choose which web sites you want to search (although this isn't always a bad thing), these web sites have to support Sherlock queries, and it only works on the MacOS. Currently lots of big name and Apple-specific sites support it.
The dev info at Apple is pretty clear though. It wouldn't be difficult for others to create clones for Sherlock that either work over a web interface or on other OSes too. (dunno if Apple could...or would... make any claim against this)
Scott
Is searching DBs really necessary? (Score:1)
Eventually, the *end user* has to do the infromation filtering, so you might as well take what you can get FAST so you can move on if you don't see what you need. Indexing every database or dynamic page on the web would slow down engines to a crawl. Do you honestly want Altavista bringing up books from Amazon, companies from the Thomas Register, and patents from the USPTO? There's no need for this. If you want specialized information, go to a specialized source.
Spider traps ... (Score:4)
Some time later, it occured to me to try and monitor the efficiency of web indexing tools using a spider trap.
The methodology is like this:
Anyone done this? I'm particularly interested in knowing how spiders handle large websites -- have been ever since I was doing a contract job on Hampshire County Council's Hantsweb site a few years ago and caught AltaVista's spider scanning through a 250,000 document web that at the time had only a 64K connection to the outside world. (Do the math! :)
It's not obsolete... (Score:1)
Where do you find the most dynamic content? News sites. Slashdot, Freshmeat, Linuxtoday, Yahoo! News, etc. These are the sites that need dynamic content.
Ironically, these are the exact sites that search engines are pretty much not interested in indexing, anyway. Even assuming that a database can update all its sites once per day, that means that the information is a day old-- centuries, in Slashdot time! People don't go to AltaVista to search for the story over at ABCNews.com. They go to AltaVista to find information about international child custody laws (to name a random hot issue of late).
Most of your general information stuff is pretty much static. This is what the search engines look for anyway-- this is the stuff that doesn't change often, so it's good stuff to record. Why would anybody bother to make a page about Cup 'O Noodles that's generated through a Perl script? It's too tough, and can be a huge pain in the ass to change it.
Why index the pages that are constantly changing, when the stuff you're looking for (by definition) doesn't change much? Sure, there's overlap (small sites that use generate the exact same content every time). But it's such a small segment that hardly anybody would miss it (yes, it may be important, but not important enough to totally revamp the indexing procedure).
Indexing dynamic content (was Re:Customers :) (Score:1)
Yes, for a very large category of dynamic pages, it is. For example, in an online shop, the actual number of a particular product in stock at the moment may very from minute to minute, the price of that product in the user's preferred currency may change from week to week, but the product itself doesn't change much over months over months or years. It makes perfect sense to index the product page, because although some of the contained data may be transient, a great deal more is not.
Or take another example: the weather forecast for a particular area. The forecast itself may change regularly, but the page always contains a current forecast and that fact is worth indexing. The best technology available for this sort of thing is probably RDF [w3.org] and the Dublin Core [purl.org] metadata specification. Of course, the search engines still have to be persuaded to take heed of this...
Two ways things might go (Score:2)
Therefore, it would be entirely feasable to have a system in which regular users saw regular pages and web crawlers saw a "static" index page, all at the same URL.
This would allow web crawlers to index according to genuinely useful keywords, rather than by how the crawler's writer decided to determine them.
An alternative approach would be to distribute the keyword database. Since all the web servers have the pages in databases of one sort or another, it should be possible to do a "live" distributed query across all of them, to see what URLs are turned up.
This would be a lot more computer-intensive, and would seriously bog down a lot of networks & web servers, but you'd never run into the "dead link" syndrome, either, where a search engine turns up references to pages which have long since ceased to be.
Re:Directories (Score:1)
mcrandello@my-deja.com
rschaar{at}pegasus.cc.ucf.edu if it's important.
The searchers and the searchees are ever-changing (Score:3)
I think the real problem with searching really isn't that the Internet is growing too large. The central problem with it being too hard to find information is due to the unfortunately ever-changing nature of HTML. (Yes, I know there are much better solutions out there -- I work with some of them on a daily basis. However, we seem to presently be stuck with HTML and its variants.)
It's a self-feeding monster, whose typical cycle goes as follows: SearchEngineInc (a division of ConHugeCo) creates a new technology that really impresses people with its ability to find what they want more quickly. (Right now SearchEngineInc is probably Google [google.com], at least in my view.)
Once the new technology takes root, content authors (well, maybe not the authors so much as their PHBs) note that SearchEngineInc doesn't bring their business (which sells soybean derivatives) to the top of the search list (when people type ``food'' into the search engine). Said PHBs make the techies work around this ``problem'', and all of a sudden SearchEngineInc's technology isn't so great anymore because the HTML landscape it maps has changed.
A similar situation occurs when PHBs think their site doesn't ``look'' quite as good as others. (Insert my usual rant about content vs. presentation here.) Whether via a hideous HTML-abusing web authoring program, or via all sorts of hacks that God never intended to appear in anything resembling SGML, the HTML landscape is changed there as well, and SearchEngineInc's product becomes less effective.
What's the solution to this? I'm not quite sure. Obviously there are better technologies out there that are at least immune to PHBs' sense of ``aesthetics'' but I would wager few of them are immune from hackery. I'd say that search engine authors are doomed for all time to stay just one step ahead of the web wranglers. At least it assures them that their market segment won't go away any time soon. :-)
Unsearchable? Possibly... (Score:1)
So, I believe the internet is outgrowing the current search engine technology.
-- Shadowcat
Computer indexing too primitive (for now...) (Score:1)
I'm not saying that this problem won't be figured out at some point. It's going to take a little more technology than we have right now, but no doubt it's on its way even as we speak. (Any AI experts out there?
Until then, indexing by hand seems to be the only 100% solution. Humans are fallible, but much less than the machines are at this present stage. Plus, directories geared towards specific topics would help narrow down your search before you even start searching.
Hide the query string (Score:1)
There is no excuse for having a purely database-driven website that does not appear to be straight HTML pages. If you have ?s everywhere then you're just lazy.
Firstly, even though you might pull everything out of a database, a large per cent of all such content is not really all that dynamic, which means you're probably better off precompiling the base down into static HTML, and recompile the page only when its content changes.
Secondly, if you have a script with a messy query string you can turn it into something that doesn't look like a script at all, e.g., /cgi-bin/script.cgi?foo=bar&this=that could be presented as /snap/foo/bar/this/that.
With Apache, you would just define and pass it off to a handler, that would pick up the parameters in the PATH_INFO environment variable. If people tried URL surgery, you could just return a 404 if the args made no sense.
Search engines are your best (and probably only) hope of getting people in to visit your site. It's up to you to make sure your URIs are search-engine friendly. If they can't be bothered to index what looks like a CGI script, well that is your problem. There are more than enough pages elsewhere for them to crawl over and index without bothering with yours.
Re:Not at all searchable (Score:1)
So then you put invisible content in the page instead. Same result.
There will always be a way to "fool" indexing robots if you're creative enough.
False positive hits. (Score:3)
If you do a search for Cortknee or Lotta Top you'll get a bazillion hits and 90%+ of them are "Click here to see young virgins having sex for the first time on their 18th birthday!"
As we all know, but nobody likes to admit, pron is the fuel that makes the net go 'round.
Many other sites have taken hints from the pron people. I'm sure that it was a deal of some sort, but everytime I do a search on metacrawler there's a line to search for anything I get a like to search a certain bookstore for books on the same topic.
Commercialism and shady practices are what are making the net so hard to search.
LK
Re:Searching searches? (Score:1)
I think the search engine community needs a paradigm shift in their way of approaching searches now, with the curve dynamic information has thrown at them. I don't know how well standards would work in this situation. It's up to the search engines to come up with a new way of sorting the huge ammounts of data they collect in an orderly fashion so they can serve us searchers with exactly what we ask for. Ok, so "exactly" is probably stretching it a bit, but I'll settle for pretty damn close.
Re:Black holes (Score:1)
Just curious.
Searching... (Score:3)
Our library has a wonderful online database where you can type in keywords and search for them, but the keywords only look as far as the Title, Author, or abstract of the book. If you wanted to look up some narrow topic, you can't expect that there's books written exactly on that topic, but there's always bound to be a few books out there that have a few pages dedicated to that subject (but isn't listed in the abstract). So, what do you do? You have to get your hands dirty.
My topic: Holy Wisdom (I won't bore you with details, but just stick with the subject). Looking in the online database, I find that there are zero books on the subject. Darn. Let's do some lookin...
After I read in a few Religion Dictionaries, I find that Holy Wisdom is also called "Sophia." I go back to the catalog, type in "Sophia," and I get one book. I skim this one book, and find that Sophia has sometimes been associated with the Holy Trinity. So, I go back to the catalog, enter "Holy Trinity," and BOOM, I get back 400 results (anyone seeing a similarity here...). Let's limit them...we'll search within the results for "History of," and I get back about 11 results. I read the abstracts, find a few books of interest, and start skimmin...
...Well, whadda know, there's a page in one book that talks about Sophia, and half a chapter in another book that talks about Sophia as well. There's a few more sources for the paper!
Now, for those of you who just don't understand what I'm trying to say here, just read from here on, cause here's my point: Computers aren't smart enough yet to "guess" at what we want, and personally, I don't think they ever will. Internet keyword searches are just like asking someone to help you who has no idea what your topic is...they can only search for what you ask them to search for.
Internet keyword searches are a hastle, and many times the first few returns won't be anything CLOSE to what you want (search for "Computer Science," you get back porn, search for "Linux," you get back porn, search for "White House,"...). But if you learn how to dig, like the people who lived fifty years ago WITHOUT Boolean Searches, you'll find what you're looking for. Sometimes, it's just like searching for a topic...you might not find anything directly, but you can't sum up an entire book in just a paragraph either!
Try some links, look around, and it'll be there!
PHP / Dynamic Pages Are Indexed (Score:2)
If you can get beyond the backend concept of a dynamic page, most pages really appear to be quite static, from an indexing perspective. A http-based indexing system (as opposed to filesystem-level) can't tell that pages are dynamic, and don't care.
I've never had a problem with search engines failing to index pages just because they had convoluted URL. If some engines do that, it's a bloody shame.
Re:XML (Score:2)
yep. (Score:1)
The point is there has to be a link there in the first place. They will not be able to index a dynamic page if it is only accessable through a "form" post.
The way you can get around this is to have a hidden (to users) page on your site with hardcoded (or database generated) links into the dynamic content that you'd like visible from search engines.
For example, if you have a whole heap of news articles on your site, with one per page, you can make a dynamic page called "newslinks" which, when generated by a crawler, querys the database and writes links to every news article in the site.
cheers, j.
Meta-Engines (Score:1)
I think we'll see more topic-specific search engines (I use trade rag sites exclusively for really good info on tech news, for example) linked together through the big search engines. The main engine (Google, or whatever) will check the search term to see if that term has been pre-linked by the engine managers to generate a search on a more topic-specific engine (for example a search on "market size" may cause the engine to do a lookup on the northpoint search engine) or engines, and then combine the results of its own search with that of the topic-specific engine for relevant results.
It's the whole idea of vertical portals taken to the next level. The vertical portals provide topic-specific searching capabilities over the 'Net to the behemoth engines and portals for a fee, or something.
Remember, the user will not get smarter, but will rather look for the faster and easier solution.
IMHO.
Semantics Antics (Score:3)
The problem is centalization (Score:1)
This looks like a job for.....XML! (Score:1)
Wasn't this sort of thing what XML and RDF were originally designed for?
less page, more site (Score:1)
as a site can be described by keywords even if their subsequent pages are database driven. i like searching by site usually anyways - provided that the site has a nice search engine
Re:XML (Score:3)
Not so, fortunately. A certain very large telco (which I'm not yet allowed to name) is now running its Intranet directory on an XML/XSL application which I've written. The application was developed on Linux and is currently running on Linux, although the customer intends to move it to Solaris.
My XML intro course is online [weft.co.uk]; it's a little out of date at the moment but will be updated over the next few months.
XML and particularly RDF [w3.org] do have a lot to offer for search engines - see my other note further up this thread.
Specialized Engines - Not More Engines! (Score:2)
The answer to all this isn't going to come from making existing engines better, nor is it going to come from bigger, badder, faster database engines powered by your friendly clustering technologies!
The answer is simple: More specialized search engines. You're looking for technical stuff? Then you should be able to search a technical database. Like, if I'm looking for source code to model fluid flows - that's pretty specific already. There's no reason that I should have to wade through all the references to "bodily fluids" that I'll get on altavista for instance!
Search engine people, take note of this. Classify your URLs into categories - like Yahoo - but come up with some way to do it automatically. Or even better yet, let the users do it, a la NewHoo [newhoo.com].
End of internet predicted. Film at 11. We've heard it before, and we'll hear it again. Just need someone with a little VC money to throw it towards an idea that supports more specialization in search engine tech.
Kudos..
From a web page owner (Score:2)
XML (Score:2)
Why not use ScriptAlias (Score:3)
Jon
Re:Not a problem (Score:3)
Any reasonable search term is likely to present results like "Search returned 417,373 hits. Hits 1-10 displayed." You have to then winnow by adding include and exclude words until you get it down to a manageable 7,422 hits, then you browse them.
The truth is, I turn to wide searches quite rarely. I tend to find and "bookmark" authoritative sites I find on a given topic and return to those over and over again. It is only when a site grows noticably stale or I have to research a new topic that I turn, reluctantly, to search engines. As for indexing database sites, I like the idea of extending the robot hack. Slightly less appealing would be to have a new HTML tag to include "bot content" in any page, including dynamic pages. An XML solution is a good idea, but I wonder how long before every extant site gets XML-aware? That plus XML is almost too flexible, making it likely that a hundred competing methods for indexing dynamic pages will appear and no one will know which one to cling to.
Dynamic Pages not indexed (Score:2)
when a webbot sees a dynamic page, it changes the query to ?Webbot - and expects to get back a specially formatted page starting <H1>Webbot index</H1> and followed by a set of comma separated keywords, a break, an URL, a paragraph, then the next set? The webbots would be happy, as they don't have to waste bandwidth and cpu time spidering over the site; the server should be happy, as it doesn't have to support the webbot's spidering, and the site owners should be happy, as they can specify what keywords each result will be indexed under. obviously, just reformatting the index to the product database could generate this page for an ecommerce site, and more static sites could just use a static statement of what their site carries.....
--
Searching (Score:2)
It would be nice if a generic search engine working in the following way:
1. User searches for say "Cisco VPN Routing"
2. The search engine identifies sites www.cisco.com and other sites which are related to the search query string.
3. Instead of trying to account these sites it calls on the search engine at the site matching the context and queries it instead.
4. Returns the results of the search at cisco.com to the user.
It's kind of like a distributedSearch, where the actual search is done by the holder of the data, all that the search engine actually does is try to find a context for the Search Query and find sites with their own search engines that match that context.
So in answer to your question: My answer is No, the Internet isn't unsearchable, we just haven't implemented a reasonable standard for searching, which can be as important as routing when it comes to a network of the size of the Internet.
Re:Database driven web pages are 'spam' (Score:2)
And yes, even slashdot uses this scheme
does fast == good? (Score:2)
For example, a few days ago I was looking for the dip switch settings on an old 14.4k modem. Now I *knew* the info was out there on the web somewhere. I also thought it was highly unlikely to be in any of the major search engines in-ram indexes. I would have been quite happy to submit a boolean or reg-ex query to a search engine and then check back an hour later to get the results.
In my mind, instant gratification search engines are useful and have their place, but I see a whole segment which just doesn't seem to be addressed. Is anybody even thinking about working on this?
-matt
New Standard? Look at Porn Sites (Score:2)
Give anyone the ability to talk directly to search engines and you'll see what has been happening with those damn porn sites on a large scale - do a query for anything, and it'll come up with a totally unrelated porn site for you.
People figured out how to abuse keywords real quick, and this would just make it worse. Which is why I wonder about the contnued existence of search engines. I use \. as my search engine - I use it to index my way into the web every day. I think that's the way of the future.
PS I hate the G3 keyboards. They're tiny! It's like carpal tunnel syndrome x 5!!!
Re:It's been unsearchable (Score:2)
Re:One use for the whole e-Speak shebang? (Score:2)
I'll be flipping through the 'E-speak tutorial for the rest of the afternoon!
Artificial Librarian (Score:2)
Browse a few relevant papers and find some keywords to search for more of the part of the field in which you are interested:
Re:Semantics Antics (Score:2)
Unfortunately, even Google is fast becoming useless - practically every search I do on it results in thousands of mirrors of some page describing some RPM that I care nothing about.
Search engines act like Lem's Demon of the Second Order right now - returning lots of information, but very little of it of any use or relevance. I've thought a bit about ways to improve it - say, a perl script that queries half a dozen search engines, and uses the pages that appear in a majority of the results, applying simple rules to them, based on markup (Like giving keywords that appear in a heading element higher priority than those that show up in a paragraph) and the number of other pages in the search area that link to them... Not AI, just a bunch of heuristics. Adding hierarchy-based rules (Ie, page B is only found via a link on page A, and A and B have similar URLs, so B might be a sub-page of A, and shouldn't be considered if A passes everything, because you'll get to it from A anyways) is an interesting possibility if I ever get around to writing something like this. I think the same rules could apply to static and dynamic pages, though. No need to treat them differently, aside from result caching.
Making Dynamic pages indexable (Score:3)
It depends on what technology you're using to generate the pages.
Zope [zope.org] sites for instance, are totally dynamically generated, even those pages that would normally be static. But the entire content of the site that's stored in the ODB is traversable via 'normal' URLs. This means that search engines can easily index your entire site.
Note, however, that this only works if you've taken care to expose your content via links. If you've delibarately hidden your content behind a search interface (and you can still do this with Zope), then your site will be no more indexable than any other dynamic site.
--
Shouldn't we use the right extension for the file? (Score:2)
It seem to me that having URLs with extensions of:
is incorrect. What is being served is not an ASP script, nor is it a PHP script, nor it is a Perl program. It is, however, an HTML file (or a GIF, or a PDF, etc.), and should be labelled as such.
If your server isn't smart enough to figure out how to generate the requested resource, and needs the generating program explicitly mentioned in the URL, then you need a smarter server. And if you aren't smart enough to figure out how to do this correctly, well...=)
Remember, kids, a URL != a file. All the /. end user cares about is getting an article with the comments formatted appropriately. They don't care[1] if it's stored as a text file, or generated by Perl, or..
[1] Well, they might care in a geek sense, but not in the way needed to read comments.
The answer is yes. (Score:2)
Good news! The solution is coming. Maybe the solution is here. google.com has their unique approach to web-indexing. Another method that's probably going to be tried sometime soon is to look all the natural-language-processing technology that has been researched in the past twenty years, take the most efficient heuristics, and index pages by apparent-topic instead of by keyword.
Then there are places like anipike.com - if it's a web page about Anime, it's on anipike, or it may as well not exist. I would -never- search the web for anything anime-related; I go through anipike.
I'm really, really hoping that linux.com will become that useful to the linux community, but I don't think they're quite there yet. They may never be. Anipike is generally very fast to load, especially compared to linux.com
(Apologies to any Joe out there who is proud of his links page.
Anyway, currently I still use search engines for Linux-stuff, but as I keep getting more and more hits on rpm files cluttering up the informational content, that may change soon. (Especially since I'm a debian user! I'm looking for information when I search the web, I know where my package is.
--Parity
Re:Distirbuted Databases? (Score:2)
Re:XML? (Score:2)
However (there's always a however) there's the metadata catch. If you divorce metadata from content, then it becomes easy for site admins to lie in their metadata in order to attract vistors. Remember the keywords spamming that used to occur? Now, imagine if thats extended to being able to lie completely about the content of an entire site. Unless you're in an environment where you can trust the providers of your metadata, by and large you're in trouble.
Cheers,
Simon.
How to get your dynamic pages indexed. (Score:2)
The solution is easy. Don't use them in your URLs.
Do not use GET args in dynamically built links, but hide your args in a longer plain ole URL. For example, a script at http://www/x/y can actually interpret http://www/x/y/z/ just fine and you can then parse off z as an argument.
First, alias a directory that runs your CGIs, PHPs, etc. Like you would cgi-bin but don't call it that!
Then, plant your cgi program(s) in there. The "arguments" further down would be in the PATH_INFO variable (which you'd have to parse out manually).
So, in the case of http://www/aa/xx/yy/zz/ the script is in the aliased /aa directory. The script is named xx and the PATH_INFO passed to it, in the above example, would be /yy/zz/
This works with Apache. Don't have Apache? Upgrade today at www.apache.org [apache.org] :-)
Re:Shouldn't we use the right extension for the fi (Score:2)
The client shouldn't infer the type of the object based on an "extension" in the URL at all ... that is what the Content-Type header is for!
Google, with a twist would do it (Score:2)
Google works on the idea that pages that have a lot of incoming links are authorities on what they discuss, so they should be ranked highly.
A modification of this is to not only rank a site's authoritativeness (eh?) this way, but also what kind of content it has. So if 10K geeks all have homepages that include the words "geek" and "computer" and also point to
Of course, some of those homepages will also have the words "tennis" and "knitting", that will be spuriously attributed to
This basically is keyword indexing, but the keywords are dynamically determined, rather than using the broken meta tags.
The big problem with this approach is implementation; the association tables are likely to be huge.
Also, you assume a large sample size, so that the outliers will cancel.
Johan
Are Meta-tags dead? (Score:2)
Re:Shouldn't we use the right extension for the fi (Score:2)
Yes, but many file systems, which may be the destination of the results of the HTTP request, *do* make use of extensions to determine file type. Though, perhaps, storing MIME-type meta-information would be better, we're stuck with what we've got.
Also, I mean the URL to also be used as a user-interface. For example:
http://slashdot.org/99/12/14/1154243/comments
would generate your browsers's preferred format, whereas requests to:
http://slashdot.org/99/12/14/1154243/comments.pdfl
and
http://slashdot.org/99/12/14/1154243/comments.scm
would return the PDF and the Slashdot Comment Markup Language (an XML app) respectively. This could be done with content-type markers, but the interface is much poorer than simply using file extensions.
Re:The problem isn't the search engines... (Score:2)
You missed the best stuff, though! You forgot to mention:
Take a look at my site, theFYI. [thefyi.com] Still a work in progress as the backend isn't done (yet). Dig through the source and see how it's built. I would have loved to use CSS for element layout but, hey, the browser support just is not there yet. Stuck with tables for a few more years. BUT take a look at the structure around each article. The header is denoted with an {h1} tag, its appearance changed with CSS-1. The paragraphs are marked with paragraph tags and, well hell, the linked URLs are surrouned with {cite} tags. That's how you code indexable HTML.
Used a lot of the same tricks on another site, http://www.ptrm.org/ [ptrm.org] and the site does well in the search engines. Specifically, check out the page on the PTRM's paleontology field tours [ptrm.org]. It does well in the engines simple because it's got 'dinosaur' in the page title and in a header tag.
(Yes, I know that curly brackets don't go around HTML tags. I just didn't want to escape the angle brackets everytime I used an example of HTML)
Re:The answer is yes. (Score:2)
I look for most of my information with HotBot [hotbot.com] just because its advanced search option lets me really weed out the bad hits.
Separation of commercial and non-commercial (Score:2)
Example: Suppose you're looking for information on a Zip drive. You already have the drive but are having trouble with it (problems with zip drives? really?
Of course I don't even bother, I just go straight for the LDP, but Windows users don't have that option.
It would be interesting to be able to search only engineering sites for engineering information. I once did a search for "wheatstone bridge" and got tons of $cientology links. If the engine was able to determine if a site was, in fact, an engineering information site, that wouldn't have happened.
How about a "no pr0n" checkbox. That would be sweet.
Of course that would require a herculean effort in changing the standards and getting site owners to be honest.
But maybe not. Here are some ideas I was thinking of:
It's not perfect, but it's gotta be better than the garbage we put up with now.
Patentable Re:Why not use ScriptAlias (Score:2)
I think that would qualify for a patent. Go for it. Its a great idea.
I just made the entire site unusuable by my entire company by viewing the robots.txt. How proxy server friendly. I hope nobody tries to look at the robots.txt file through an AOL connection.
Re:But now.. (Score:2)
I think what you meant to say was "If ebay has their way, accessing a copyrighted database and publishing information from it after explicitly being explicitly told not to is equivalent to cracking into another's system illegally."
I guess that means that we should do away with all search engines entirely...
I'm afraid you're right. We're pretty close to a time when most web pages will be served up programmatically from what amount to copyrighted databases. Indexing such sites without explicit permission from the content owners would be legally risky.
Don't Like It? Hey, build it yourself. (Score:2)
Standard disclaimer: IANALG (I am not a Linux geek.) Rather, I'm a web design geek. So please, be nice.
From what I understand, what a lot of the OpenSource movement is about is doing it yourself if you don't like how it's being done now. Don't like commercial Unix, Linus? Make your own fscking Linux and let everyone contribute. Oh yeah: Give it away free to really piss people off.
In this discussion, there are a ton of excellent ideas for how search engines should operate. Yet no one, to my knowledge, has put forward the next logical step: Build our own search engine. Google is a good start but hey, I know you guys could build it better. Worried about hardware and bandwidth costs? Venture capital.
As I said, do it yourself. :-)
XML-RPC (Score:3)
Then there's RSS [webreview.com], which is a way of serving up a news channel or other changing data. These applications are here and in use. Together, these XML-based technologies will someday provide the data layer for the software agents of the future. Read lately about that new "price-checker" technology? Imagine being the one business that doesn't serve up your product list and pricing to that agent.
An interface from XML to these "hidden" databases is only a matter of time. We're just caught right now at a moment between technologies: the authoring tools don't really exist.
----
Apache Directive Work Around (Score:2)
Minor cheapshot, admittedly (Score:2)