Catch up on stories from the past week (and beyond) at the Slashdot story archive

IBM vs. Content Chaos 216

Posted by Hemos on Monday January 12, 2004 @12:42PM from the help-me-find-directions-to-p4r1s-h1l70n dept.

ps writes "IBM's Almaden Research Center has been featured for their continued work on "Web Fountain", a huge system to turn all the unstructured info on the web into structured data. (Is "pink" the singer or the color?) IEEE reports that the first commercial use will be to track public opinion for companies. " It looks like its feeding ground is primarily the public Internet, but it can be fed private information as well.

This discussion has been archived. No new comments can be posted.

IBM vs. Content Chaos

Load All Comments

Search 216 Comments Log In/Create an Account

Comments Filter:

I think a better question... (Score:5, Funny)

by bc90021 ( 43730 ) * writes: <bc90021@NOsPAm.bc90021.net> on Monday January 12, 2004 @12:45PM (#7953216) Homepage

...doesn't concern whether "Pink" is a colour or a singer, but whether "Paris Hilton" is a hotel in France or an oft downloaded video... ;)

Share
twitter facebook
- Re:I think a better question... (Score:2, Funny)
  
  by Dave2 Wickham ( 600202 ) * writes:
  
  "from the help-me-find-directions-to-p4r1s-h1l70n dept."
  - Re:I think a better question... (Score:2)
    
    by bc90021 ( 43730 ) * writes:
    
    The really funny part is that I didn't even see that until you pointed it out...
- - Re:I think a better question... (Score:3, Funny)
    
    by ePhil_One ( 634771 ) writes:
    
    Oh, by the way, which one's Pink?
  - Colour, singer OR band... (Score:3, Funny)
    
    by WebCowboy ( 196209 ) writes:
    
    Wonder if this "web fountain" will be smart enough to determine the context to THAT level.
    
    A painter thinks "colour" when he sees the word.
    
    A slashdot reader (and many other grown-ups) thinks of the band "Pink Floyd".
    
    If you are (or are the parent of) a teen-aged girl you think of neither...you think of the anti-Britney pop-star princess of angst Pink [pinkspage.com]
- - What is PINK? (Score:3, Funny)
    
    by BigBlockMopar ( 191202 ) writes:
    
    (Is "pink" the singer or the color?)
    I didn't get the joke.
    
    These are, after all, engineers. Pink is neither a color nor a singer (talented or otherwise).
    
    To an engineer, PINK can only be an acronym.
    - Riding the Gravy Train... (Score:2)
      
      by Pac ( 9516 ) writes:
      
      It's a reference to Pink Floyd's Have a Cigar [pinkfloydonline.com] lyrics, "And by the way, which one is Pink?"
      - Re:Riding the Gravy Train... (Score:2)
        
        by Thing 1 ( 178996 ) writes:
        
        Contract that verb, you insensitive clod!
    - Re:What is PINK? (Score:2)
      
      by Phreakiture ( 547094 ) writes:
      
      It could also be a song (by Aerosmith).
pr0nfountain (Score:1, Funny)

by 3lb4rt0 ( 736495 ) writes:

The spinoff that will be used by joe sixpack net user.
All we need... (Score:3, Interesting)

by TJ_Phazerhacki ( 520002 ) writes: on Monday January 12, 2004 @12:46PM (#7953230) Journal

There is already altogether too much "Stuff out there" for anyone to put any major effort into catogorizing it. We should soon reach the point of info overload, and then what? What is the point of catologing overflow data? Do we really need something like this? Or should we just ship a bunch of programmers wasting their time over to something else, like better spam filters and OS's without gaping security holes?

Share
twitter facebook
- Re:All we need... (Score:1, Flamebait)
  
  by Frymaster ( 171343 ) writes:
  
  the first commercial use will be to track public opinion for companies.
  here's one to start with:
  microsoft (msft) of redmond washington: you suck!
  now, go log that.
- Re:All we need... (Score:1, Insightful)
  
  by geoffspear ( 692508 ) writes:
  
  Oh yes, because there's such an enormous shortage of programmers right now. IBM should lay off all of these programmers so Microsoft will have a pool of available programmers who know nothing about OS security to work on security.
  And once all the game producers, who make a product we definitely don't "need" get rid of all of their programmers, there will be plenty of free people to work on anti-spam technology. Whee!
- Re:All we need... (Score:5, Insightful)
  
  by millahtime ( 710421 ) writes: on Monday January 12, 2004 @01:04PM (#7953444) Homepage Journal
  
  There are many organizations that need better ways to analyze their info. There are databases that are terabytes in size and have to do detailed searches. With SQL databases that can take a long time and any faster way can save a lot of time and money. There is a big need for this technology across many industries.
  
  Parent Share
  twitter facebook
- Re:All we need... (Score:5, Insightful)
  
  by xyzzy ( 10685 ) writes: on Monday January 12, 2004 @01:20PM (#7953619) Homepage
  
  That's really funny that you mention "spam filters", since that is exactly the content categorization task that you are talking about.
  
  Automatic categorization of overflowing data is exactly what you need to do when you have too much to think about -- it allows you to triage your attention span, which is the most limited resource you have.
  
  Parent Share
  twitter facebook
- Re:All we need... (Score:3, Interesting)
  
  by redragon ( 161901 ) writes:
  
  I think the inverse is the case.
  
  The more chaotic (overloaded in your terms) that data tends to be, then the greater the information contained in that data (think compression). So what they're going after is not "catogorizing" the internet, they're going after making some sense out of all of that data. Information overload begins to necesitate an intermediary to help filter out the data that you're interested in.
  
  The interesting thing becomes what sort of biases are built into a system like this? That is
Send link to Google (Score:5, Insightful)

by Urkki ( 668283 ) writes: on Monday January 12, 2004 @12:47PM (#7953236)

They could certainly use this kind of techniques to improve their results...

Then again, in a way they already use something like this, except they're only really concerned about links, not actual contents of pages...

Share
twitter facebook
structure... (Score:5, Funny)

by Rhubarb Crumble ( 581156 ) writes: <r_crumble@hotmail.com> on Monday January 12, 2004 @12:47PM (#7953239) Homepage

a huge system to turn all the unstructured info on the web into structured data
In order to do this, they will use a scheme by which each document is referred to by a string including the transfer protocol, the host name, and a file path.
oh, wait...

Share
twitter facebook
- Too easy, think complicated (Score:1)
  
  by korpiq ( 8532 ) writes:
  
  Some information at different paths might require cross-referencing. Thus, the scheme you propose should be extended so that there would be a way for text documents to contain links to each other.
  
  However, if you just take a big enough storage system and download all the documents from teh intterweb, you can have a flat directory containing all the documents. Woohoo, progress!
First customer (Score:3, Funny)

by Anonymous Coward writes: on Monday January 12, 2004 @12:47PM (#7953245)

IEEE reports that the first commercial use will be to track public opinion for companies.

Word has it the first test case will be SCO. Web fountian: "Outlook not so good"

Share
twitter facebook
- Obligatory SCO poke. (Score:2)
  
  by i_r_sensitive ( 697893 ) writes:
  
  Damnit, too busy reading stupid poll posts, damnit dmanit dmanit.
  You've won this round, Lonestar...
- Actually... (Score:2)
  
  by Kjella ( 173770 ) writes:
  
  ...they were used in calibration tests... you know, find the highs and lows of the system.
  
  Kjella
SITE ALREADY SLASHDOTTED, HERES A MIRROR! (Score:2, Funny)

by ThisIsAnExampleAccou ( 718430 ) writes:

Link to a Mirror [google.com]
Get this setup (Score:3, Interesting)

by millahtime ( 710421 ) writes: on Monday January 12, 2004 @12:49PM (#7953274) Homepage Journal

I wonder how long until IBM sells this setup. If it works well Logistics Orginazations would love to get their hands on it.

Share
twitter facebook
- Re:Get this setup (Score:1)
  
  by millahtime ( 710421 ) writes:
  
  I mean by this that most Logistics Orgainzations will have propritary info that they won't let IBM house.
  - Re:Get this setup (Score:5, Informative)
    
    by orac2 ( 88688 ) writes: on Monday January 12, 2004 @01:01PM (#7953409)
    
    Although the article didn't have room to go into this point (and I should know, I'm the author), IBM can completley compartmentalize competitors' data, even if hosted in house (IBM already does this in other parts of its business). If companies are still wary, they can host the data themselves and let WebFountain troll it on a need to know basis.
    
    Parent Share
    twitter facebook
    - Re:Get this setup (Score:2, Interesting)
      
      by The Limp Devil ( 513137 ) writes:
      
      let WebFountain troll it
      
      I sincerely hope you meant trawl it. The last thing we need is for IBM to build and sell an automated system for trolling the entire internet!
Expensive (Score:4, Interesting)

by starvingcodeartist ( 739199 ) writes: on Monday January 12, 2004 @12:51PM (#7953289)

In the article is says they plan on charging between $150,000 and $300,000 a year to use this super-search engine. They think corporate execs will pay for it. Seems really steep to me. BUT, for corporate execs, its probably not too expensive. They'll just outsource another 10-15 programming jobs to India to pay for it.

Share
twitter facebook
- Re:Expensive (Score:5, Interesting)
  
  by orac2 ( 88688 ) writes: on Monday January 12, 2004 @12:56PM (#7953349)
  
  The point is that it's not intended for use as a search engine, but a platform for doing computation intensive data mining and analysis. A search engine can tell you how many mentions of IBM appear on the web, but not how people feel about IBM.
  
  Parent Share
  twitter facebook
  - Re:Expensive (Score:2)
    
    by Speare ( 84249 ) writes:
    
    "A search engine can tell you how many mentions of IBM appear on the web, but not how people feel about IBM."
    
    I give you googlism.com: http://www.googlism.com/index.htm?ism=ibm&type =2
    
    Googlism for: ibm
    
    ibm is even "officially" spineless
    ibm is still the 'king'
    ibm is shipping 2 new powerpc processors
    ibm is bullish on asps and hosted services in
    ibm is offering internship that supports grid
    ibm is my choice
    ibm is outstanding
    ibm is giving peace
    ibm is planning to ship new
    ibm is willing to help
    ibm is announci
  - - Re:Expensive (Score:2)
      
      by orac2 ( 88688 ) writes:
      
      The point is that the "you" in "you can get exactly the data you're looking for" is not a person, but a data mining program.
      
      Disclaimer: I'm the author of the article!
corporate meddling (Score:3, Insightful)

by commo1 ( 709770 ) writes: on Monday January 12, 2004 @12:54PM (#7953322)

One of my main concerns with search databases is the inhenrent ability for corporations to increase their visibility on the web by manipulating data to their benefit to bring their corporate page up first on the list. I wonder if there is a way for the database to have a scoring system based on the validity of the data: is the information there, or are there just highly develpoped metatags doing the work? If you do a search for a specific part number for an HP product, what are the cances of getting a) the HP home page where a further search would be necessary to find any relevant info or b) the big chains like Staples, Sircuit City who just want to sell you cartridges and have the time and resources to steer you in the right direction. How would the system be regulated? (kinda like Slashdot mods :P)? Who watches the watchers, and can information validity be electronically implemented? What kind of AI would be necessary?

Share
twitter facebook
- Re:corporate meddling (Score:2)
  
  by orac2 ( 88688 ) writes:
  
  WebFountain isn't intended a a general purpose search engine, but to provide a platform for data mining and analysis.
Information... (Score:2, Funny)

by enrico_suave ( 179651 ) writes:

Information wants to be... Fuscia!

*shrug*

e.
- Re:Information... (Score:2)
  
  by __past__ ( 542467 ) writes:
  
  Really? But I heard mauve has the most RAM!
What about Existing Data? (Score:4, Interesting)

by ParadoxicalPostulate ( 729766 ) writes: <saapad.gmail@com> on Monday January 12, 2004 @12:54PM (#7953330) Journal

Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?

You would need an enormous workforce to do that.

And if they don't plan on doing that, what about all the existing information? Is it going to be excluded from the database? Seems like much of a waste to me!

Damn but I would love to have access to one of these, even if the amount of information available will be miniscule (relatively speaking) for the next few years.

Share
twitter facebook
- Re:What about Existing Data? (Score:5, Funny)
  
  by Ronald Dumsfeld ( 723277 ) writes: on Monday January 12, 2004 @01:11PM (#7953528)
  
  Are you telling me that there are programmers willing to go through [Insert Ludicrously Large Number Here] files and "annotate" them using XML to fit the new system?
  
  No, they're writing software to put in the XML tags.
  
  What will be more interesting to see is if it's possible to pollute the database by putting in your own XML. Instead of Google-Bombing we'll have people pissing in the WebFountain.
  
  Parent Share
  twitter facebook
  - Re:What about Existing Data? (Score:2)
    
    by cookie_cutter ( 533841 ) writes:
    
    Instead of Google-Bombing we'll have people pissing in the WebFountain.
    And so a new piece of slang, is born.
    - Re:What about Existing Data? (Score:2)
      
      by K-Man ( 4117 ) writes:
      
      Better yet, a mental image [google.com]. Can't wait for the IBM brochure.
- Re:What about Existing Data? (Score:2, Informative)
  
  by AndroidCat ( 229562 ) writes:
  
  According to the article, Web Fountain is supposed to sift through information which isn't XML tagged.
- Re:What about Existing Data? (Score:2, Informative)
  
  by GT_Alias ( 551463 ) writes:
  
  Erm...did your read the article? WebFountain has created multiple "annotators" to sift through the data fed to it and apply XML tags.
  You would need an enormous workforce to do that.
  C'mon, give these guys some credit.
- - Re:What about Existing Data? (Score:2)
    
    by corbettw ( 214229 ) writes:
    
    If they are prepared to pay me enough, I'll do it!
    
    Well, they're probably prepared to pay $1.50 an hour. So unless you live in India or the Philipines, I wouldn't be dusting off the ol' resume if I were you.
Entirely unsuited (Score:4, Insightful)

by happyfrogcow ( 708359 ) writes: on Monday January 12, 2004 @12:54PM (#7953337)

From the article, "But many online information sources are entirely unsuited to the XML model--for example, personal Web pages, e-mails, postings to newsgroups, and conversations in chat rooms."

entirely unsuited? chrissake. email, unsuited. newsgroups, unsuited. chat rooms, unsuited. If personal home pages are unsuited, then so are corporate home pages, as there is nothing inherantly different about the two. All this from an IEEE article... I would have thought them to be more acurate and less misleading. I could put <popularmusic>Pink</popularmusic> in my HTML as easily as Amazon could in theirs.

HTML is based on the XML model. HTML is used to create personal web pages. How on earth then, could personal web pages be "entirely unsuited to the XML model"?

Share
twitter facebook
- HTML is based on the XML model. (Score:2)
  
  by wiredog ( 43288 ) writes:
  
  Ummm. No. HTML predates XML.
  - Re:HTML is based on the XML model. (Score:2)
    
    by happyfrogcow ( 708359 ) writes:
    
    details details...
    
    HTML (1992?) does predate XML (1996?). My point is that they are both SGML based, and a strict HTML 4.01 document is a valid XML document, unless I have something wrong in my understanding of all of this.
    
    Furthur, my point was not a debate on what is or isn't HTML considered to be derived or a subset of, but that personal web pages are not inherantly different from other web pages. To say a company can do something with their data that an individual cannot do, is misleading.
    - Yeah (Score:2)
      
      by wiredog ( 43288 ) writes:
      
      Good point.
- Re:Entirely unsuited (Score:5, Insightful)
  
  by orac2 ( 88688 ) writes: on Monday January 12, 2004 @01:10PM (#7953525)
  
  Disclaimer: I'm the author of the article.
  
  Most people don't and won't tag as they go. (Except for those of us used to writing HTML-enabled comments on /. of course). Also, in order to be able to write <popularmusic>Pink</popularmusic>, and have it make sense, you'd have to be following a DTD.
  
  As anyone who's been involved in DTD formulation can attest, even for internal documentation, it can be a royal pain in the butt. I don't think the vast majority of on-line rapid content generators (all those bloggers, emailers, chatters) will ever use XML to routinely tag their content manually. The article isn't talking about machine generated or commercial content, like Amazon's, but the day to day stuff that gets put up in the time it takes to write it and click submit, and which is of most interest to market researchers.
  
  Parent Share
  twitter facebook
  - Re:Entirely unsuited (Score:2)
    
    by xyzzy ( 10685 ) writes:
    
    More to the point, HTML tags for RENDERING, not semantics. To a first order, ALL HTML pages look alike.
    - - Re:Entirely unsuited (Score:2)
        
        by xyzzy ( 10685 ) writes:
        
        Right, but on the great unwashed web, practice is all ya got. And the semantics of the tags (and their origins in document generation) is pretty darned impoverished. really doesn't tell you *anything* unless you are looking for tables -- what people REALLY need to find information is and and things like that.
        
        Re:Entirely unsuited (Score:2)
        
        by xyzzy ( 10685 ) writes:
        
        What people really need is to preview before submitting :-)! The last comment should have said "...to find information is <Year> and <GasPrices> and things like that".
  - Re:Entirely unsuited (Score:2)
    
    by happyfrogcow ( 708359 ) writes:
    
    Is it unreasonable to imagine a web community that advocates the use of some relavant DTD? On the nerdly end of things, if slashdot had their own DTD or used some other DTD, I might use it. It could ad value to the site from a usability perspective as well as economic value for the owners.
    
    I think that if it was suffiecntly easy for a person to know what tag to put around "Pink", and know that it would ad something to the usability and understandability (am i making up words?) they might do it.
    - Re:Entirely unsuited (Score:2)
      
      by orac2 ( 88688 ) writes:
      
      On the nerdly end of things, if slashdot had their own DTD or used some other DTD
      
      Even back when the web was just composed and read by nerds, people still didn't follow the "rules" -- look at how HTML drifted from it's original use of marking up content to being a poor man's page layout language.
      
      they might do it.
      
      Sorry, I just can't believe it. Most contributors to the web (i.e. non computer nerds) are hard pressed to remember even a handful of HTML tags, let alone maintain a familiarity with a DTD, ho
- - Re:Entirely unsuited (Score:2)
    
    by happyfrogcow ( 708359 ) writes:
    
    Um... No, XML is based on the HTML model.
    
    no, XML is based on the SGML model. HTML too, with exceptions to some SGML features. more info: http://www.w3.org/TR/html401/intro/sgmltut.html [w3.org]
Impact on Google IPO (Score:3, Interesting)

by G4from128k ( 686170 ) writes: on Monday January 12, 2004 @01:01PM (#7953412)

This is the type of technology that could either ensure or derail Google's future (I'm not saying that it will, only that it could). Semantic analysis and clustering of web pages could improve search. I hope Google gets to use/create this type of tech.

Share
twitter facebook
Echelon? (Score:2, Interesting)

by SexyKellyOsbourne ( 606860 ) writes:

This project sounds quite interesting -- it could really help out projects like Echelon [aclu.org] to help win the war on terrorism, if it's capable of understanding other languages of course, and could possibly build a whole database of information that's intercepted from other places. All that chatter, with the codewords they use, could possibly be understood by a football field full of Linux rackmounts, and might foil something.

Of course, such power could also be horribly misused if it came into the wrong hands.
- Re:Echelon? (Score:4, Insightful)
  
  by orac2 ( 88688 ) writes: on Monday January 12, 2004 @01:26PM (#7953672)
  
  Disclaimer: I'm the author of the article.
  
  I know, from talking to the WebFountain team that they're very sensitive to privacy concerns. WebFountain obeys robots.txt and doesn't archive material which has vanished from the publicly visible web (if only for reasons of storage capacity!).
  
  The point is that all the information that feeds into IBM is already publicly availble. If wanted to go after Green Party members and if the Green Party posted it's membership roll on a webserver, I think they'd be able to get it, WebFountain or no.
  
  Of course, I suppose WebFountain could be used to construct a membership list by scanning people's home page's to find out if they say that they're a member, but again this is publicly declared information.
  
  Bottom line, as always: if you don't want it generally accessible to all, don't put it on a public web server.
  
  Parent Share
  twitter facebook
  - Re:Echelon? (Score:2)
    
    by Nevyn ( 5505 ) * writes:
    
    The point is that all the information that feeds into IBM is already publicly availble. ... Of course, I suppose WebFountain could be used to construct a membership list by scanning people's home page's to find out if they say that they're a member, but again this is publicly declared information.
    But that's it, you can't just say "all I did was collect public data" so it can't have privacy concerns. It's obviously still got them (unless your collector is useless).
    For instance, I might say on /. that
    - - Re:Echelon? (Score:2)
        
        by Nevyn ( 5505 ) * writes:
        
        Aggregation of the info is irrelevant. The fact that some system makes it easy to collate and/or find the information doesn't change the fact that the information is *already* out there. How can it have any privacy concerns beyond its public existance in the first place?
        Making it eaiser is a big thing though. For instance it's possible for someone to find out my social security or credit card numbers by just stealing information from the right place(s). This is not particuly well kept information, I'd
One Net to Rule Them All (Score:5, Insightful)

by null etc. ( 524767 ) writes: on Monday January 12, 2004 @01:03PM (#7953441)

It would be nice if, in parallel to the Internet, another network was developed to hold only symantically organized knowledge. That network would be free of marketing and commercial business, and would ostensibly be the largest repository of organized knowledge in the planet. Think Internet2, based entirely in XML.
Similar to HTML's current weakness in separating presentation from content, the web today has a weakness in separating content sites from sales sites. Do a search in Google, especially for programming or technical topics, and you're more likely to retrieve 100 links to online stores selling a book on that topic, than finding actual content regarding that topic. This lack of ability to separate queries for knowledge, verses queries for product sales literature, is especially frustrating for scientists and programmers. I think Google is taking a step towards this with Froogle, meaning that if Froogle becomes popular enough, it's possible that Google will strip marketing pages from their search results.
Worse even, is when someone registers a thousand domains (plumbing-supplies-store.com, plumb-superstore-supplies.com, all-plumbing-supplies.com, etc) and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the same identical garbage. You would think that Google could detect this "marketing domain spam" and reduce the relevancy of such search results.
Anyways, I can't complain, because I can find nearly anything on the web I need, compared to 10 years ago.

Share
twitter facebook
- Re:One Net to Rule Them All (Score:3, Informative)
  
  by Tom ( 822 ) writes:
  
  Do a search in Google, especially for programming or technical topics, and you're more likely to retrieve 100 links to online stores selling a book on that topic, than finding actual content regarding that topic.
  
  (topic) -checkout -buy
  
  Other things that work well sometimes:
  (topic) site:.org
  (topic) -amazon
  (topic) -site:amazon.com -site:amazon.co.uk
  
  and posts the same marketing page content ("Buy my plumbing supplies!") on each domain. A search on Google will then retrieve 100 separate links containing the
  - - Re:One Net to Rule Them All (Score:2)
      
      by K-Man ( 4117 ) writes:
      
      According to one talk I went to, Google uses approximate hashing to find duplicates (easy), and near-duplicates (hard). They may not be using the best methods, but even if they were, I suspect it would be difficult to find all the duplicate pages.
      
      Maybe if they looked for duplicate contexts on each search it would cover a lot of the problem.
- on the desktop (Score:2)
  
  by goon ( 2774 ) writes:
  
  utilising your own system is a start. on the desktop there's nat [nat.org] *Ximian* friedmans Dashboard [nat.org]
- - - Re:One Net to Rule Them All (Score:2)
      
      by gilroy ( 155262 ) writes:
      
      Blockquoth the poster:
      If you searched google for Javascript menu, you'd get a billion results for companies that sell DHTML/Javascript menus...but you wouldn't (easily) be able to find an informative article on how to make one yourself.
      
      A lot of people's complaints about searching the Net come from a very narrow idea of search terms. Although sometimes I get swamped with commercial sites, I am generally able to find 6-8 useful pages on the first page of Google's results. For example, try
      javascript me
In other news... (Score:2)

by jetkust ( 596906 ) writes:

Researchers in Alabama are working on a system which converts all music on the internet into a single Menudo mp3 file. EIEIO reports the first public use will be to create a single mp3 file that results in trilllions of dollars in royalties to the RIAA when traded illegally.
i.e. nameprotect (Score:4, Interesting)

by joeldg ( 518249 ) writes: on Monday January 12, 2004 @01:09PM (#7953507) Homepage

nameprotect does something similar, except they are looking for people violating copyrights.
in addition I think they might be one of the most banned bots online.

anyway, their users are all corporate entities who pay a lot of money to be able to auto-cease and desist copyright infringers..

These same companies will pay IBM to tell them that since their cease and desist spree everyone hates them.

Share
twitter facebook
URL of the project page (Score:2, Informative)

by DerOle ( 520081 ) writes:

WebFountain [ibm.com]
Like NorthernLight? (Score:5, Informative)

by dpbsmith ( 263124 ) writes: on Monday January 12, 2004 @01:11PM (#7953527) Homepage

This sounds very similar to NorthernLight.

NorthernLight was (it still exists, but apparently is not available to the nonpaying public at all) a search engine that displayed its results automatically sorted into as many as fifteen or twenty categories, automatically generated on the basis of the search. (For some reason, they called these categories "custom search folders.")

Since it's no longer available to the public I can't give a concrete example. I can't test it to see whether a search on "Pink" creates a couple of folders labelled "Singer" and "Color," for example. But that's exactly the sort of thing it does/did.

I actually would have used NorthernLight as one of my routine search engines--it worked quite well--had it not been for another major annoyance: in the publicly available version, it always searched both publicly available Web pages and a number of fee-based private databases, so whatever you searched for, the majority of the results were in the fee-based databases and I would have had to pay money to see what they were. In other words, it was heavy-handed promotion of their paid services and had only limited utility to those who did not wish to by them).

Share
twitter facebook
- Re:Like NorthernLight? (Score:2, Informative)
  
  by Wiktor Kochanowski ( 5740 ) writes:
  
  Vivisimo [vivisimo.com] is doing sorting searches.
  Try it out, works quite often for me - beats Google for many queries, not in actual number of pages found, but in the time it takes me to find out whatever I'm looking for.
  - - Re:Like NorthernLight? (Score:2)
      
      by orac2 ( 88688 ) writes:
      
      Exactly what IBM wants to achieve, it seems.
      
      Except IBM isn't trying to build a general purpose search engine for humans, but a platform for data mining programs.
      
      Also WebFountain is trying to analyse not 150 hits, but the millions of hits returned over the web, not just the handful of top-ranked hits that vivisimo returns from other search engines (look at the details sections of the vivisimo result page where it lists the engines searched). It's apples and oranges really.
Gaming Webfountain (Score:4, Interesting)

by G4from128k ( 686170 ) writes: on Monday January 12, 2004 @01:11PM (#7953534)

I wonder how long it will take sleazy e-commerce sites and p0rn sites to game WebFountain and turn it into SpamFountain?

I suspect that this tool (and any like it) must make a core assumption -- that each webpage is about one semantic thing and that the creators are trying to communicate that one thought. In contrast, people who try to boost their page rank have no compuction about misleading people (or algorithms). Clever tagging and misleading verbage should be able to fool IBM's analyzer into clustering a site where it does not belong (but where the site owner wants it). The result is pages look like it is about another thing (some popular search term)while being about soemthing else (selling their junk or porn).

Next will come high-priced consultants that tell you how to make you site pace highly on WebFountain (like the ones that currently game Google).

Share
twitter facebook
IBM's Pink (Score:2, Funny)

by th77 ( 515478 ) writes:

IBM should know that Pink was the predecessor to Taligent [wikipedia.org] which was the predecessor to absolutely nothing.
ObSCO ref (Score:2)

by gosand ( 234100 ) writes:

IEEE reports that the first commercial use will be to track public opinion for companies.

Can't wait to see what the entry for SCO looks like...
so-called tags (Score:2)

by jamesl ( 106902 ) writes:

"Things such as price or product identification numbers are identified by bracketing them with so-called tags, as in Deluxe Toaster , $19.95 ."

They're "tags", not "so-called tags".

Tags! Like those little things they hang on stuff at the store to tell you how much it costs. Tags.

Of course, he may have been referring to their use in a "software program".
How long before people start gaming the system? (Score:5, Interesting)

by dpbsmith ( 263124 ) writes: on Monday January 12, 2004 @01:21PM (#7953631) Homepage

As Google has discovered, it's only possible for simple heuristics and algorithms to "understand" the human content on the Web for as long as it doesn't matter.

As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.

And the stakes are much higher for gaming WebFountain than for gaming Google.

For example, I'd imagine there would be big money for anyone who could convince companies that they know how to make it appear that a particular movie/song/toy/computer was "hot," so that the WebFountain-using Walmarts and Best Buys of the world would stock more of it.

WebFountain will work well only until it is actually introduced.

Share
twitter facebook
- Re:How long before people start gaming the system? (Score:3, Informative)
  
  by orac2 ( 88688 ) writes:
  
  Disclaimer: I'm the author of the article
  
  As soon as people become aware that Google or WebFountain or whatever is trying to evaluate web content, immediately they will begin trying to reverse-engineer and subvert the algorithms and heuristics that are used.
  .
  
  This could be tricky -- WebFountain uses a kitchen sink approach, with a varying palette of content discriminators and disambiguators. The developers are also savvy to downweight link farm type approaches. Of course, one could say, conduct a campaign
  - Re:How long before people start gaming the system? (Score:3, Informative)
    
    by Jerf ( 17166 ) writes:
    It's important not to underestimate people's ability to game systems, regardless of the thought put into them. The simple algorithm
    
    Reconstruct algorithm.
    
    Simulate algorithm and play with the inputs until the outputs match what you want.
    Bring those inputs about.
    is extremely powerful, and note that as a "meta-algorithm" there's absolutely no way to completely shut it down.
    
    You have only four basic defenses against this:
    
    Keep changing the algorithm (expensive and large changes may not be possible if stabi
    - Re:How long before people start gaming the system? (Score:2)
      
      by orac2 ( 88688 ) writes:
      
      The thing is, that it's hard to do the second step of your general algorithm: Simulate algorithm and play with the inputs until the outputs match what you want.
      
      Determining the outputs and closing the feedback loop is hard -- getting WebFountain output is pretty pricey, compared to search engine results, where you can have a very low-cost feedback loop. This makes reconstructing the alogrithms hard, if not impossible. Also remember that the exact set of algorithms varies depending on the problem: because t
      - Re:How long before people start gaming the system? (Score:2)
        
        by Jerf ( 17166 ) writes:
        
        First, the "hardness" needs to be measured against the value of the benefit obtained from gaming. If it's large, more effort will be thrown at it.
        
        Second, you seem to have missed the implications of my carefully-chosen word simulate. You don't need to replicate the algorithm, just create something that mostly works in most of the situations that you care about. (Both "mosts" are important.) This is a significantly lower bar then "complete replication", and is one of the reasons it's so hard to combat this;
    - Re:How long before people start gaming the system? (Score:2)
      
      by kindofblue ( 308225 ) writes:
      
      I second this sentiment, that gaming of any system is likely, not merely possible.
      This is because humans can be "gamed" in the real world. That is, one can fabricate a "buzz" about things, not simply by overt measures like commercials, but plants in social situations. Sony or some other consumer electronics companies planted people in Times Square and other highly visible situations to pretend to use some cool new gadget. Then people see it and tell their friends and then eventually, they hope, there
- Re:How long before people start gaming the system? (Score:2)
  
  by bogie ( 31020 ) writes:
  
  You mean kinda like how Google is getting ruined by scumbags who set up thousands of fake sites that just refer everything you've ever searched for directly to Amazon? Google has become almost worthless for product research anymore. Sure its still "better" than anything going, but the spammers and marketers have filled it with way too much garbage.
"Is this web site selling something"? (Score:4, Insightful)

by Animats ( 122034 ) writes: on Monday January 12, 2004 @01:22PM (#7953634) Homepage
Search engine spiders need to understand more about sites. Things like this:
- The site is selling something.
- The page is composed of multiple unrelated articles or ads, each one of which should be viewed as a separate entity for search purposes.
- The page is part of a blog.
- Content on this site duplicates that found on other sites.
- The site is owned by an organization with a known Dun and Bradstreet number. (If a site is selling something, and its Whois info doesn't match the DNB corporation database, it should be downgraded in search position. This would encourage honest Whois info.)
Share
twitter facebook
- Dun and Bradstreet number (Score:2)
  
  by rark ( 15224 ) writes:
  
  > The site is owned by an organization with a
  > known Dun and Bradstreet number. (If a site is
  > selling something, and its Whois info doesn't
  > match the DNB corporation database, it should
  > be downgraded in search position. This would
  > encourage honest Whois info.)
  
  This may be a question born of serious ignorance. If so, I'd really appreciate some enlightenment.
  
  This is also not so theoretical for me, as I am currently privately developing a product that I will eventually be selling online.
- - Re:"Is this web site selling something"? (Score:2)
    
    by Animats ( 122034 ) writes:
    
    A good way to find out if a site is selling something is by looking for links that lead to forms that take credit cards.
    That's a good spam-filtering algorithm, too. As I keep telling people who fight spam, "follow the money". Quit worrying about where the spam is coming from. Follow where the money goes.
SCO (Score:5, Funny)

by Zork the Almighty ( 599344 ) writes: on Monday January 12, 2004 @01:22PM (#7953638) Journal

IEEE reports that the first commercial use will be to track public opinion for companies.

Searching "SCO" Found "Slashdot" ERROR arithmetic underflow.

Share
twitter facebook
CrapFountain (Score:5, Funny)

by s4m7 ( 519684 ) writes: on Monday January 12, 2004 @01:25PM (#7953661) Homepage

Here's how it works:

Executive Bob, who's paid IBM $150,000 for his enterprise liscence of webfountain, enters into his webfountain search box: "Pink the musician, not the color"

IBM's powerful software parses this command into "pink music -color" and passes it to google, retrieves the results, removes Google's paid ads and replaces them with IBM's paid ads. The content is then served to Executive Bob, who shouts: "EUREKA" since within the top ten search results he finds "NUDE PICTURES OF RAPPER PINK!"

IBM then lands a lucrative support contract with Exectutive Bob to remove all the viruses and spyware from his desktop PC. Rinse and Repeat.

Share
twitter facebook
Half a football field? (Score:4, Interesting)

by AndroidCat ( 229562 ) writes: on Monday January 12, 2004 @01:27PM (#7953677) Homepage

(Imperial or metric football fields?)
IBM's breakthrough is called WebFountain--half a football field's worth of rack-mounted processors, routers, and disk drives running a huge menagerie of programs.
Later:

It uses a cluster of thirty 2.4-GHz Intel Xeon dual-processor computers running Linux to crawl as much of the general Web as it can find at least once a week.
To ensure that WebFountain's finger is constantly on the pulse of the Internet, an additional suite of similar computers is dedicated to crawling important but volatile Web sites, such as those hosting blogs, at least once a day. Other machines maintain access to popular non-Web-based sources, such as Usenet (a newsgroup service that predates the Web) and the Internet Relay Chat system, known as IRC. The data is then passed into WebFountain's main cluster of computers, currently composed of 32 server racks connected via gigabit Ethernet. Each rack holds eight Xeon dual-processor computers and is equipped with about 4-5 terabytes of disk storage.
That's a lot of stuff, but half a football field? Possibly they're including cubicles for the staff or did they just inherit some old Big Iron space that was that large?

Share
twitter facebook
Prior art :o) (Score:4, Funny)

by Mr_Silver ( 213637 ) writes: on Monday January 12, 2004 @01:34PM (#7953752)

IEEE reports that the first commercial use will be to track public opinion for companies
You can do that already with Google:
A search for "Microsoft is evil" gets you 600,000 pages.
A search for "Microsoft is good" gets you 3,590,000 pages.
Therefore Microsoft is more good than evil.
Err ... that wasn't quite the answer I was expecting.
(cue sounds of joke falling apart...)

Share
twitter facebook
- Re:Prior art :o) (Score:2)
  
  by CaptnMArk ( 9003 ) writes:
  
  The funny thing is: if you search for the above with usenet google will suggest an interesting list of newsgroups.
- Re:Prior art :o) (Score:2)
  
  by imr ( 106517 ) writes:
  
  Nope,
  -searches for
  microsoft is evil
  and
  microsoft is good
  produce such results.
  BUT
  -searches for
  "microsoft is evil"
  and
  "microsoft is good"
  produce a different result:
  2070 and 1020 respectively, showing that:
  1/ microsoft IS evil.
  2/ good prevails over evil on the internet.
- Google fight (Score:2)
  
  by Kidbro ( 80868 ) writes:
  
  That sounds a whole lot like Google fight [googlefight.com] :)
  
  This [googlefight.com] wasn't the answer I was hoping for either ;)
Potential money saver: Differential buzz (Score:2, Insightful)

by benja ( 623818 ) writes:

The head of a research and development department could feed WebFountain all the e-mails, reports, PowerPoint presentations, and so on that her employees produced in the last six months. From this, WebFountain could give her a list of technologies that the department was paying attention to. She could then compare this list to the technologies in her sector that were creating a buzz online. Discrepancies between the two lists would be worth asking her managers about, allowing her to know whether or not the
It already exists (Score:3, Interesting)

by claudebbg ( 547985 ) writes: on Monday January 12, 2004 @01:45PM (#7953871) Homepage

I've already seen/heard of such system, basically in the Business Intelligence field.
In England, a systems like Autonomy [autonomy.com] (used by the police at the beginning) can crawl a mass of information with dedicated spiders (not only for the web, but also commercial databases, files...). Then, it structures all the content in thematics with links and proximity.
I personnaly tested it some years ago, feeding it with information websites and asking some articles "close to" another one. The efficiency was amazing because it was able to make the difference between close terms that have really different meaning depending on the context. Usually, search engines are wrong because they can't use the context.
I also set up some "agents" for recurrent searches (an agent is basically a search plus some training, letting Autonomy know what found document are close and not) and it was able to propose everyday a really good press review with nearly no wrong documents.
As a complement to Autonomy, I know a BI team that uses some other tools like Pericles [datops.com]to feed the searches with "relevant" content, basically thematics that are "appearing" in the group of documents and are close to some interests.
Such BI tools can already provide the kind of information cited, like a opinion movement against a company detected in the newsgroup or some websites. And IBM is certainly on the tracks to improve such tools with the techniques of their labs.
I hope these tools won't be limited to PR articles on the web and/or private use by big corporations, because it could only be another Echelon with all its bad consequences:
- bad use of public information
- paranoia feeded with wrong scares
- public/corp. power against the citizens
If tools like echelon could be used by everybody, it would have to let much more privacy to citizens and the public leaders would have to explain the investments.

Share
twitter facebook
Sounds like CYC (Score:3, Interesting)

by Sanity ( 1431 ) * writes: on Monday January 12, 2004 @01:50PM (#7953922) Homepage Journal

CYC [cyc.com] have been trying to collect all human knowledge for the last few decades and feed it into a knowledge base. They have even open sourced part [opencyc.org] of their database.
Despite the apparent promise of the project, it is difficult to find actual examples of it doing really cool stuff.

Share
twitter facebook
semantic web (Score:2, Informative)

by jonasmit ( 560153 ) writes:

XML simply isn't enough. Structure != Meaning. Meaning must be inserted somewhere by someone. Trying to interpret HTML/natural language to form structured documents is a daunting task. If you want real meaning then the data needs to be described or translated into a meaningful form like RDF [w3.org] (yes represented by xml) when it is created so that intellegent agents such as this can *understand* the data. RDF uses triples (thing graphs) to describe relationships making use of URIs: Subject--Predicate--Object .
social trends analysis (Score:2)

by WebTurtle ( 109015 ) writes:

This technology should be made available to social scientists, anthropologists, cultural critics, etc. so that current social trends can be analyzed. Perhaps IBM would be kind enough to provide free access to this system to Universities?

It is a pity that the WebFountain system is geared toward corporate users. Of course, there must be some ROI... but, still it makes me sad that every new technology seems to be driven by corporate desire for good PR and world domination.
Interestingly, this article comes
Encourage Human Markup Discourage Machine MU (Score:3, Informative)

by leoaugust ( 665240 ) writes: <leoaugust&gmail,com> on Monday January 12, 2004 @02:00PM (#7954028) Journal

Analytic tools can ferret out patterns in, say, a sales receipt database, so that a retail store might see that people tend to buy certain products together and that offering a package deal would help sales. ...

This urban-legend example of people buying beers and diapers at the same time (hence the sections for beer and diapers should be close by, at least on Saturdays) has been beaten to death and beyond.

A sentence that originally read "We visited Mount Fuji and took some photos" would become something like ?We visited Mount Fuji and took some photos.?

I am not sure what the tags around "Mount Fuji" have added in this example. Only thing I can think of is that these are similar to the "smart-tags" of MS office that pre-populate straight forward relational data like a contact's email or address. Personally I would do a search for the latitude/longitude when I need this info in Google as "mount fuji latitude" and the first result I get is the one that gives me the latitude and longitude of Mount Fuji. What is the point of pre-feeding this info during the "markup"? And it bears repeating here that rather than complaining about results that you get with one or two keywords, think about adding keywords to narrow and specialize the search. Paris Hilton video is better than just Paris Hilton which might unnecessarily show you stuff about hotels.

By the time the annotators have finished annotating a document, it can be up to 10 times longer than the original.

So, a person was probably talking about a molehill, and the machine markup has changed that into a mountain. How much of the extra tags (even accounting for the verbosity of XML) have really added "meaning" to the document. How much of the "meaning" was intended and how much has been force-fed by the machine ?

These heavily annotated pages are not intended for human eyes; rather, they provide material that the analytic tools can get their teeth into.

This is where I think that they are using XML but going away from the XML concept. It was supposed to be human readable. If the IBM research group started focusing on how to help people make sense of the 1x material and 10x markup, they will be introducing the person at the right time in the analysis process - introducing a person at the last stage, esp in deriving "meaning" may not be the best strategy. The markups are just "filters" thru which when the material is viewed a lot of context becomes apparent. What we need to do is to let people start with the filters and then look for the material (top-down) or start with the material and look for filters (bottom-up) - sort of a more iterative procedure involving both these approaches.
Google lets you do a keyword search (bottom-up) or via the directories - DMOZ (top-down). Vivisimo and Grokker were recently discussed on slashdot where they were creating dynamic categorizations, i.e. bottom-up. I think it would be better to let people analyze the markup (directory/top-down approach) or analyze the material (keyword/bottom-up) rather than mixing up the two and presenting the "results" to the person.

E-mails or instant messages can't be labeled in this way without destroying the ease of use that is the hallmark of these ad hoc communications; who would bother to add XML labels to a quick e-mail to a colleague?

This is the second place where energies should be focused. Where the document is created may mean a lot. It could be in which directory I create a new file inherits the path (hence context), or it could be as simple that on the top-right of the screen I create personal files, on the bottom right I create files about sports, on the left-bottom-middle I create files about java .. etc. I think this beats anyday the bot-annotators that come after me and add 10 times markup than the whole of the quick email that I sent to a colleague.
Read the rest of this comment...

Share
twitter facebook
- Re:Encourage Human Markup Discourage Machine MU (Score:2)
  
  by orac2 ( 88688 ) writes:
  
  I think you are missing the point. The tags are not for people, but for data analysis software. Comparing a search engine to a general analysis platform (which is what WerbFountain is) is like comparing apples to oranges. The entire apparatus (WebFountain plus data mining software) is designed to produce high level reports that talk about data in the aggregate.
  - Re:Encourage Human Markup Discourage Machine MU (Score:2)
    
    by leoaugust ( 665240 ) writes:
    
    The entire apparatus (WebFountain plus data mining software) is designed to produce high level reports that talk about data in the aggregate.
    
    The tags are not for people, but for data analysis software
    
    My perspective is from the point of view of a business man trying to use the "data." This data must have some correlation to reality of the business, and most preferably illustrate some correlation or cause-effect that I could use to predict the future a little more accurately. This is where the theor
A rack of servers can't beat good old META data (Score:2)

by prototype ( 242023 ) writes:

Trying to intelligently search for information in the universe is an age-old problem. How can my system be so smart to tell the difference between Pink the singer and pink the color (or colour if you prefer). Basically, it can't.
Nothing is smart enough to tell the difference because the content is contextual (hence the name). In a corporation like the one I'm at now (a class A railway) we have hundreds of terabytes of information flowing through our systems on a regular basis. Trying to track it, categori
- Re:A good idea for search engines follow? (Score:2)
  
  by rcastro0 ( 241450 ) writes:
  
  Shame on me for being curious.
  
  lemon party [urbandictionary.com]
  a group of 3 or more old men in a circle sucking each other off
  Bill and Carl joined their grand fathers at the lemon party
  
  I don't want to think about why this term ever arose and was able to drive trafic through google.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

I think a better question... (Score:5, Funny)

Re:I think a better question... (Score:2, Funny)

Re:I think a better question... (Score:2)

Re:I think a better question... (Score:3, Funny)

Colour, singer OR band... (Score:3, Funny)

What is PINK? (Score:3, Funny)

Riding the Gravy Train... (Score:2)

Re:Riding the Gravy Train... (Score:2)

Re:What is PINK? (Score:2)

pr0nfountain (Score:1, Funny)

All we need... (Score:3, Interesting)

Re:All we need... (Score:1, Flamebait)

Re:All we need... (Score:1, Insightful)

Re:All we need... (Score:5, Insightful)

Re:All we need... (Score:5, Insightful)

Re:All we need... (Score:3, Interesting)

Send link to Google (Score:5, Insightful)

structure... (Score:5, Funny)

Too easy, think complicated (Score:1)

First customer (Score:3, Funny)

Obligatory SCO poke. (Score:2)

Actually... (Score:2)

SITE ALREADY SLASHDOTTED, HERES A MIRROR! (Score:2, Funny)

Get this setup (Score:3, Interesting)

Re:Get this setup (Score:1)

Re:Get this setup (Score:5, Informative)

Re:Get this setup (Score:2, Interesting)

Expensive (Score:4, Interesting)

Re:Expensive (Score:5, Interesting)

Re:Expensive (Score:2)

Re:Expensive (Score:2)

corporate meddling (Score:3, Insightful)

Re:corporate meddling (Score:2)

Information... (Score:2, Funny)

Re:Information... (Score:2)

What about Existing Data? (Score:4, Interesting)

Re:What about Existing Data? (Score:5, Funny)

Re:What about Existing Data? (Score:2)

Re:What about Existing Data? (Score:2)

Re:What about Existing Data? (Score:2, Informative)

Re:What about Existing Data? (Score:2, Informative)

Re:What about Existing Data? (Score:2)

Entirely unsuited (Score:4, Insightful)

HTML is based on the XML model. (Score:2)

Re:HTML is based on the XML model. (Score:2)

Yeah (Score:2)

Re:Entirely unsuited (Score:5, Insightful)

Re:Entirely unsuited (Score:2)

Re:Entirely unsuited (Score:2)

Re:Entirely unsuited (Score:2)

Re:Entirely unsuited (Score:2)

Re:Entirely unsuited (Score:2)

Re:Entirely unsuited (Score:2)

Impact on Google IPO (Score:3, Interesting)

Echelon? (Score:2, Interesting)

Re:Echelon? (Score:4, Insightful)

Re:Echelon? (Score:2)

Re:Echelon? (Score:2)

One Net to Rule Them All (Score:5, Insightful)

Re:One Net to Rule Them All (Score:3, Informative)

Re:One Net to Rule Them All (Score:2)

on the desktop (Score:2)

Re:One Net to Rule Them All (Score:2)

In other news... (Score:2)

i.e. nameprotect (Score:4, Interesting)

URL of the project page (Score:2, Informative)

Like NorthernLight? (Score:5, Informative)

Re:Like NorthernLight? (Score:2, Informative)

Re:Like NorthernLight? (Score:2)

Gaming Webfountain (Score:4, Interesting)

IBM's Pink (Score:2, Funny)

ObSCO ref (Score:2)

so-called tags (Score:2)

How long before people start gaming the system? (Score:5, Interesting)

Re:How long before people start gaming the system? (Score:3, Informative)

Re:How long before people start gaming the system? (Score:3, Informative)

Re:How long before people start gaming the system? (Score:2)

Re:How long before people start gaming the system? (Score:2)

Re:How long before people start gaming the system? (Score:2)

Re:How long before people start gaming the system? (Score:2)