IBM vs. Content Chaos

IBM vs. Content Chaos 216

Posted by Hemos on Monday January 12, 2004 @12:42PM from the help-me-find-directions-to-p4r1s-h1l70n dept.

ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.

IBM vs. Content Chaos

This discussion has been archived. No new comments can be posted.

Search 216 Comments Log In/Create an Account

Comments Filter:

Re:Get this setup (Score:5, Informative)

by orac2 ( 88688 ) writes: on Monday January 12, 2004 @01:01PM (#7953409)

Although the article didn't have room to go into this point (and I should know, I'm the author), IBM can completley compartmentalize competitors' data, even if hosted in house (IBM already does this in other parts of its business). If companies are still wary, they can host the data themselves and let WebFountain troll it on a need to know basis.

URL of the project page (Score:2, Informative)

by DerOle ( 520081 ) writes: on Monday January 12, 2004 @01:10PM (#7953521) Homepage

WebFountain [ibm.com]

Like NorthernLight? (Score:5, Informative)

by dpbsmith ( 263124 ) writes: on Monday January 12, 2004 @01:11PM (#7953527) Homepage

This sounds very similar to NorthernLight.

NorthernLight was (it still exists, but apparently is not available to the nonpaying public at all) a search engine that displayed its results automatically sorted into as many as fifteen or twenty categories, automatically generated on the basis of the search. (For some reason, they called these categories "custom search folders.")

Since it's no longer available to the public I can't give a concrete example. I can't test it to see whether a search on "Pink" creates a couple of folders labelled "Singer" and "Color," for example. But that's exactly the sort of thing it does/did.

I actually would have used NorthernLight as one of my routine search engines--it worked quite well--had it not been for another major annoyance: in the publicly available version, it always searched both publicly available Web pages and a number of fee-based private databases, so whatever you searched for, the majority of the results were in the fee-based databases and I would have had to pay money to see what they were. In other words, it was heavy-handed promotion of their paid services and had only limited utility to those who did not wish to by them).

Re:What about Existing Data? (Score:2, Informative)

by AndroidCat ( 229562 ) writes: on Monday January 12, 2004 @01:12PM (#7953541) Homepage

According to the article, Web Fountain is supposed to sift through information which isn't XML tagged.

Re:Like NorthernLight? (Score:2, Informative)

by Wiktor Kochanowski ( 5740 ) writes: on Monday January 12, 2004 @01:29PM (#7953690)

Vivisimo [vivisimo.com] is doing sorting searches.
Try it out, works quite often for me - beats Google for many queries, not in actual number of pages found, but in the time it takes me to find out whatever I'm looking for.

Re:How long before people start gaming the system? (Score:3, Informative)

by orac2 ( 88688 ) writes: on Monday January 12, 2004 @01:33PM (#7953742)

Disclaimer: I'm the author of the article

As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.
.

This could be tricky -- WebFountain uses a kitchen sink approach, with a varying palette of content discriminators and disambiguators. The developers are also savvy to downweight link farm type approaches. Of course, one could say, conduct a campaign among bloggers to mention a term and make it appear well-known to WebFountain, but the inevitable consequence is that it would then actually be well-known!

Re:One Net to Rule Them All (Score:3, Informative)

by Tom ( 822 ) writes: on Monday January 12, 2004 @01:45PM (#7953867) Homepage Journal

Do a search in Google, especially for programming or technical topics, and you're more likely to retrieve 100 links to online stores selling a book on that topic, than finding actual content regarding that topic.

(topic) -checkout -buy

Other things that work well sometimes:
(topic) site:.org
(topic) -amazon
(topic) -site:amazon.com -site:amazon.co.uk

and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the same identical garbage.

Does it? I always thought that's exactly what google is filtering out behind the "12345 more results were omitted because they were similiar" thingy.

Re:What about Existing Data? (Score:2, Informative)

by GT_Alias ( 551463 ) writes: on Monday January 12, 2004 @01:49PM (#7953921)

Erm...did your read the article? WebFountain has created multiple "annotators" to sift through the data fed to it and apply XML tags.
You would need an enormous workforce to do that.
C'mon, give these guys some credit.

semantic web (Score:2, Informative)

by jonasmit ( 560153 ) writes: on Monday January 12, 2004 @01:50PM (#7953923)

XML simply isn't enough. Structure != Meaning. Meaning must be inserted somewhere by someone. Trying to interpret HTML/natural language to form structured documents is a daunting task. If you want real meaning then the data needs to be described or translated into a meaningful form like RDF [w3.org] (yes represented by xml) when it is created so that intellegent agents such as this can *understand* the data. RDF uses triples (thing graphs) to describe relationships making use of URIs: Subject--Predicate--Object ...etc. Now think about how to merge all this information - with well formed rules RDF documents merge great: with traditional structured xml the merged docs would not be well-formed. Now they can be and XML can be generated for standard xml rendering. Take a look at the Semantic Web [w3.org]

Encourage Human Markup Discourage Machine MU (Score:3, Informative)

by leoaugust ( 665240 ) writes: <leoaugust@g[ ]l.com ['mai' in gap]> on Monday January 12, 2004 @02:00PM (#7954028) Journal

Analytic tools can ferret out patterns in, say, a sales receipt database, so that a retail store might see that people tend to buy certain products together and that offering a package deal would help sales. ...

This urban-legend example of people buying beers and diapers at the same time (hence the sections for beer and diapers should be close by, at least on Saturdays) has been beaten to death and beyond.

A sentence that originally read "We visited Mount Fuji and took some photos" would become something like ?We visited Mount Fuji and took some photos.?

I am not sure what the tags around "Mount Fuji" have added in this example. Only thing I can think of is that these are similar to the "smart-tags" of MS office that pre-populate straight forward relational data like a contact's email or address. Personally I would do a search for the latitude/longitude when I need this info in Google as "mount fuji latitude" and the first result I get is the one that gives me the latitude and longitude of Mount Fuji. What is the point of pre-feeding this info during the "markup"? And it bears repeating here that rather than complaining about results that you get with one or two keywords, think about adding keywords to narrow and specialize the search. Paris Hilton video is better than just Paris Hilton which might unnecessarily show you stuff about hotels.

By the time the annotators have finished annotating a document, it can be up to 10 times longer than the original.

So, a person was probably talking about a molehill, and the machine markup has changed that into a mountain. How much of the extra tags (even accounting for the verbosity of XML) have really added "meaning" to the document. How much of the "meaning" was intended and how much has been force-fed by the machine ?

These heavily annotated pages are not intended for human eyes; rather, they provide material that the analytic tools can get their teeth into.

This is where I think that they are using XML but going away from the XML concept. It was supposed to be human readable. If the IBM research group started focusing on how to help people make sense of the 1x material and 10x markup, they will be introducing the person at the right time in the analysis process - introducing a person at the last stage, esp in deriving "meaning" may not be the best strategy. The markups are just "filters" thru which when the material is viewed a lot of context becomes apparent. What we need to do is to let people start with the filters and then look for the material (top-down) or start with the material and look for filters (bottom-up) - sort of a more iterative procedure involving both these approaches.
Google lets you do a keyword search (bottom-up) or via the directories - DMOZ (top-down). Vivisimo and Grokker were recently discussed on slashdot where they were creating dynamic categorizations, i.e. bottom-up. I think it would be better to let people analyze the markup (directory/top-down approach) or analyze the material (keyword/bottom-up) rather than mixing up the two and presenting the "results" to the person.

E-mails or instant messages can't be labeled in this way without destroying the ease of use that is the hallmark of these ad hoc communications; who would bother to add XML labels to a quick e-mail to a colleague?

This is the second place where energies should be focused. Where the document is created may mean a lot. It could be in which directory I create a new file inherits the path (hence context), or it could be as simple that on the top-right of the screen I create personal files, on the bottom right I create files about sports, on the left-bottom-middle I create files about java .. etc. I think this beats anyday the bot-annotators that come after me and add 10 times markup than the whole of the quick email that I sent to a colleague.
Read the rest of this comment...

Re:How long before people start gaming the system? (Score:3, Informative)

by Jerf ( 17166 ) writes: on Monday January 12, 2004 @02:52PM (#7954569) Journal
It's important not to underestimate people's ability to game systems, regardless of the thought put into them. The simple algorithm
- Reconstruct algorithm.
- Simulate algorithm and play with the inputs until the outputs match what you want.
- Bring those inputs about.
is extremely powerful, and note that as a "meta-algorithm" there's absolutely no way to completely shut it down.

You have only four basic defenses against this:
1. Keep changing the algorithm (expensive and large changes may not be possible if stability is desirable, which for search results it generally is),
2. make the input-gaming process more expensive then the value of the output to the attacker (as you become more valuable you're a more enticing target),
3. make the outputs desired by the attacked impossible (generally not possible in the general case, but in certain limited ways it is; it is probably not possible to be the #1 google result for all possible search terms, for instance, despite the desirability of such a result),
4. or have a human monitoring attacks and shut them down manually (only possible if you can out-staff the attackers)
There are some other possibilities but a lot of them don't apply in the real world, like "make it impossible to reverse the inputs necessary for some output" (like MD5); this is not applicable to a real-world application like a search engine because there has to be some obvious human-sensible logic to the placement or the search engine is just returning random results, which is not even a "search engine", let alone a useful one. Not even all four can be brought to bear in a given situation; #2 probably doesn't apply in this case since the benefits could be in the millions of dollars in theory.

I'm not saying that WebFountain is hosed; Google has trouble but it is handlable. But it is worth talking about; certain basic algorithms will have certain effects as people try to game them, and it may be the case that some clever, useful algorithm is so easily gamed and so difficult to create countermeasures for that it will never be possible in the real world in the general case.

(I doubt this is the case, but there's only one way to find out, and that's try it and see what happens.)

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

IBM vs. Content Chaos 216

IBM vs. Content Chaos More Login

IBM vs. Content Chaos

Re:Get this setup (Score:5, Informative)

URL of the project page (Score:2, Informative)

Like NorthernLight? (Score:5, Informative)

Re:What about Existing Data? (Score:2, Informative)

Re:Like NorthernLight? (Score:2, Informative)

Re:How long before people start gaming the system? (Score:3, Informative)

Re:One Net to Rule Them All (Score:3, Informative)

Re:What about Existing Data? (Score:2, Informative)

semantic web (Score:2, Informative)

Encourage Human Markup Discourage Machine MU (Score:3, Informative)

Re:How long before people start gaming the system? (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot