Comment Re:Artificial Librarian (Score 1) 313
this is key -- for years and years good AI indexing has been around the corner. it hasn't materialised, and i don't think it's likely to in my lifetime.
there simply is no substitute for a human domain expert doing a detailed job of indexing. look at how hopeless most automagically generated indexes are; key words are frequently omitted from text which discusses the ideas they represent, and often pop up in marginally relevant contexts, so without complex ontological analysis (which is way too hard to do properly right now, although easy for humans) it's no surprise that most indexes are full of pointless references.
key word searching is pretty well dead, mainly because of the way HTML turned out -- noone can make any semantic sense of an HTML document based on markup anymore.
i think there's two ways to go if you want reasonable searching, and both require quite extensive human up-front effort. first, standalone metadata. by this i mean a shared data model and means of representing the model (such as, say, the dublin core and rdf). second, adoption of new markup for content, using not only a shared grammar (ie the actual markup) but also a shared vocabulary (that is, a shared semantic for the grammar, and a shared understanding of context -- the context being dependent on grammar, but not limited by grammar).
the first approach is easier (but not easy) to implement and applicable to all 'documents', including non-textual documents. the second is much harder to do, but will make search algorithms simpler, faster, and more predictably useful.
there's no pain-free way out of the mess that we have currently though.
cheers
j
there simply is no substitute for a human domain expert doing a detailed job of indexing. look at how hopeless most automagically generated indexes are; key words are frequently omitted from text which discusses the ideas they represent, and often pop up in marginally relevant contexts, so without complex ontological analysis (which is way too hard to do properly right now, although easy for humans) it's no surprise that most indexes are full of pointless references.
key word searching is pretty well dead, mainly because of the way HTML turned out -- noone can make any semantic sense of an HTML document based on markup anymore.
i think there's two ways to go if you want reasonable searching, and both require quite extensive human up-front effort. first, standalone metadata. by this i mean a shared data model and means of representing the model (such as, say, the dublin core and rdf). second, adoption of new markup for content, using not only a shared grammar (ie the actual markup) but also a shared vocabulary (that is, a shared semantic for the grammar, and a shared understanding of context -- the context being dependent on grammar, but not limited by grammar).
the first approach is easier (but not easy) to implement and applicable to all 'documents', including non-textual documents. the second is much harder to do, but will make search algorithms simpler, faster, and more predictably useful.
there's no pain-free way out of the mess that we have currently though.
cheers
j