IBM vs. Content Chaos 216
ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.
Re:Get this setup (Score:5, Informative)
URL of the project page (Score:2, Informative)
Like NorthernLight? (Score:5, Informative)
NorthernLight was (it still exists, but apparently is not available to the nonpaying public at all) a search engine that displayed its results automatically sorted into as many as fifteen or twenty categories, automatically generated on the basis of the search. (For some reason, they called these categories "custom search folders.")
Since it's no longer available to the public I can't give a concrete example. I can't test it to see whether a search on "Pink" creates a couple of folders labelled "Singer" and "Color," for example. But that's exactly the sort of thing it does/did.
I actually would have used NorthernLight as one of my routine search engines--it worked quite well--had it not been for another major annoyance: in the publicly available version, it always searched both publicly available Web pages and a number of fee-based private databases, so whatever you searched for, the majority of the results were in the fee-based databases and I would have had to pay money to see what they were. In other words, it was heavy-handed promotion of their paid services and had only limited utility to those who did not wish to by them).
Re:What about Existing Data? (Score:2, Informative)
Re:Like NorthernLight? (Score:2, Informative)
Try it out, works quite often for me - beats Google for many queries, not in actual number of pages found, but in the time it takes me to find out whatever I'm looking for.
Re:How long before people start gaming the system? (Score:3, Informative)
As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.
This could be tricky -- WebFountain uses a kitchen sink approach, with a varying palette of content discriminators and disambiguators. The developers are also savvy to downweight link farm type approaches. Of course, one could say, conduct a campaign among bloggers to mention a term and make it appear well-known to WebFountain, but the inevitable consequence is that it would then actually be well-known!
Re:One Net to Rule Them All (Score:3, Informative)
(topic) -checkout -buy
Other things that work well sometimes:
(topic) site:.org
(topic) -amazon
(topic) -site:amazon.com -site:amazon.co.uk
and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the same identical garbage.
Does it? I always thought that's exactly what google is filtering out behind the "12345 more results were omitted because they were similiar" thingy.
Re:What about Existing Data? (Score:2, Informative)
You would need an enormous workforce to do that.
C'mon, give these guys some credit.
semantic web (Score:2, Informative)
Encourage Human Markup Discourage Machine MU (Score:3, Informative)
Google lets you do a keyword search (bottom-up) or via the directories - DMOZ (top-down). Vivisimo and Grokker were recently discussed on slashdot where they were creating dynamic categorizations, i.e. bottom-up. I think it would be better to let people analyze the markup (directory/top-down approach) or analyze the material (keyword/bottom-up) rather than mixing up the two and presenting the "results" to the person.
This is the second place where energies should be focused. Where the document is created may mean a lot. It could be in which directory I create a new file inherits the path (hence context), or it could be as simple that on the top-right of the screen I create personal files, on the bottom right I create files about sports, on the left-bottom-middle I create files about javaRe:How long before people start gaming the system? (Score:3, Informative)
You have only four basic defenses against this:
I'm not saying that WebFountain is hosed; Google has trouble but it is handlable. But it is worth talking about; certain basic algorithms will have certain effects as people try to game them, and it may be the case that some clever, useful algorithm is so easily gamed and so difficult to create countermeasures for that it will never be possible in the real world in the general case.
(I doubt this is the case, but there's only one way to find out, and that's try it and see what happens.)