Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?
Check out the new SourceForge HTML5 internet speed test! No Flash necessary and runs on all devices. ×

Mining Neologisms from Wikipedia 93

holy_calamity writes "Natual Language Programming researchers have developed a tool called Zeitgeist that can discover the meaning of new words for itself using Wikipedia. It looks for entries for words not in the WordNet database and works out their meaning by looking for known words linked to them. Development of the tool is focusing on using it to understand what bloggers (using slang and neologisms) are saying about companies' products."
This discussion has been archived. No new comments can be posted.

Mining Neologisms from Wikipedia

Comments Filter:
  • ...one entity gathers what another entity spills...
    • > It looks for entries for words not in the WordNet database and
      > works out their meaning by looking for known words linked to them

      I suspect some bugs need to be worked out. For example, it came up with this definition:

      slashdot: v To surf for pictures of pretty girls (e.g. Natalie Portman) for the purpose of satisfying unrelieved sexual frustration owing to social retardation, using powerful network-enabled computers (e.g. Beowulf clusters).

  • by mdhoover ( 856288 ) on Tuesday September 05, 2006 @10:18AM (#16044332) Homepage Journal
    if they pointed it at slashdot...
    "ass-hat" and "tard" could take on a whole new meaning
  • by ettlz ( 639203 ) on Tuesday September 05, 2006 @10:19AM (#16044335) Journal
    The Slashdot, Digg, or Fark effect is the term given to the phenomenon of a popular website linking to a smaller site, causing the smaller site to slow down or even temporarily close due to the increased traffic. The name comes from the huge influx of web traffic that often results from sites being mentioned on Slashdot, Digg, or Fark.com, popular user submitted news and information sites. Typically, less robust sites are unable to cope with the huge increase in traffic and become unavailable - either their bandwidth is consumed or their servers fail to cope with the high number of requests.
    • by daniil ( 775990 )
      The main weakness of the Zeitgeist seems to be that it takes the word out of context, meaning that in many cases, it gets the meaning of the word plain wrong (see the example of 'feminazi' from the article). They would probably run into the same problem with 'slashdotting'. If someone uses the term 'slashdotted' in relation to a company's product, then does it have a negative or positive meaning (or connotation)? Will they really understand what the bloggers are saying?
      • by eepok ( 545733 )
        Remember that languages are written and enforced, but DIALECTS are spoken and understood. Thus, there may be multiple meanings for a single word or phrase to different cultures or sub-cultures.

        We'll use your example of "Feminazi". To some people, any feminist is a "Feminazi". To others (take some feminists), it's a feminist who is irrational in his/her ways and seeks power over equal or fair treatment and expectations.

        Another example would be the word "Jew". It can be said or used in such a way to give insu
    • Slashdotting is slashdotting. It is irritating that they are trying to rename it "the Digg or Fark effect." If Digg or Fark cause a site to get hammered, it should still ne called slashdotting. Why? Because we (this community) are the originals, and still, in my opinion, the best.
      By the way, I have an odd problem with the word neology. Why? Because in my 7th grade Latin Class, one of our assignments was to be a neologist, using latin roots to make up a new word. So the word neology makes me think of 7th g
    • Credit where credit is due: flash crowd [wikipedia.org].
      • Excellent insight by Larry Niven.

        To extend, the lack of huge crowds of time-travelling tourists at events such as the WTC collapse is the best evidence that time travel from the future into our present cannot happen. Either that, or time travellers are required to cloak themselves. Otherwise, the streets and skies of NYC would have been packed with tricked-out Deloreans on 9-11-01.

        Stephen Hawking thinks this may be because the furthest you can travel back in time is to the invention of the time machine (and
        • Robert L. Forward's "Timemaster" deals with confined time travel like that.
          You make two ends of a wormhole and carry them to wherever you want. It
          obviously takes a very long time at ~cee to do this. Your wormhole is now
          a fixed-length (space-)time machine on the order of how much time you spent
          transporting the ends.
  • by packetmon ( 977047 ) on Tuesday September 05, 2006 @10:20AM (#16044338) Homepage
    Imagine the chaos and reboots as the program analyzes a George W. Bush speech
  • by perkr ( 626584 ) on Tuesday September 05, 2006 @10:21AM (#16044344)
    Figuring out what people on the net says about your products is the "new" thing apparantly. IBM has their own engine for the task too [ibm.com]. Kind of makes you wonder how much power the net community will in fact have in day-to-day decision making in the corp head quarters' marketing strategy depts.
    • so far companies rely mainly on surveys for feedback regarding their products. One of the problem with surveys is that normal people tend to reject them, so then we end up trusting the feelings of a small survey-geek market; that are low paid to answer and give unrelevant feedback.

      blog/online feedback research is different in that it focuses on what people consider is worth saying/writing about a certain product. The risk of bias is less probable, because of transparency.
    • by Who235 ( 959706 )

      Figuring out what people on the net says about your products is the "new" thing apparantly. IBM has their own engine for the task too [ibm.com]. Kind of makes you wonder how much power the net community will in fact have in day-to-day decision making in the corp head quarters' marketing strategy depts.

      I don't know, but I know those snakflabbing IBM products really zorf me right in the snurls. . .
    • Thanks for that link perkr .... another interesting thing to play with.
    • by rm999 ( 775449 )
      I am sure that the net has been an invaluable resource for some time. Any marketing or customer research department worth anything has probably employed someone to scour message boards and webpages to gather buzz and opinions surrounding their own products. Automating this task is simply the next step, but I question how much information a computer really can gather about something so subjectve.
  • by brunascle ( 994197 ) on Tuesday September 05, 2006 @10:22AM (#16044353)
    George W. Bush
    1. 43rd president of the United States.
    2. miserable failure.
  • by sbaker ( 47485 ) * on Tuesday September 05, 2006 @10:25AM (#16044375) Homepage
    The trouble is that Wikipedia has a policy of not writing about (or using) Neologisms:

        http://en.wikipedia.org/wiki/WP:Neologism [wikipedia.org]

    Many articles about neologisms *do* get created in violation of this policy - but they are generally put up for deletion via the Wikipedia process for deleting inappropriate material - so they only exist briefly.

    So, for example, the article entitled "Windows Rot" is being debated today, Although it looks like this one will be merged into an existing article, it won't survive as the name of an article - so Zeitgeist presumably won't be able to find it.

    It may be that enough of these kinds of articles slip through the system to be useful to Zeitgeist but that is not by design - so coverage will be patchy at best.

    A further consequence of this is that the articles that Zeitgeist does find will most likely be so new that only one person will have worked on them - which will make for poor quality.

    Also, it is very common for people such as bloggers who come up with what they consider to be clever new words to try to wedge them into common usage by writing about the word in Wikipedia. This 'vanity word' problem is one of the main reasons that Wikipedia seeks to avoid articles on neologisms.

    • Of course, you're presupposing that having an official policy can preclude the development of language. This isn't quite the case: see for example the strictly conservative linguistic prescriptivism in France, carried out by institutions such as Académie française, who are charged with the final authority on French usage, and their total inability to prevent the pervasive influence of the English language on the French lexicon. Likewise, despite Strunk and White's steadfast insistence, "hopefully"
      • You're correct, but Wikipedia is not a dictionary -- Wiktionary is.
        • by sbaker ( 47485 ) *
          I don't know whether Wiktionary has similar policies to Wikipedia as regards neologisms. I agree that these guys should theoretically be using Wiktionary instead of Wikipedia - but that's not what they are doing, so it's somewhat irrelevent.
      • by sbaker ( 47485 ) *
        I'm not in any way suggesting that Wikipedia's policy is an attempt to control the language. The Academie Francaise is ineffectual because they are trying to regulate the way people speak - and there is no way to make that stick. They have made rulings about words like "Parking" which are creeping into the French language because there is no good single-word alternative. The Acadamie says "Stationary at the side of the road" is "correct" French - and make laws that say that roadside signs must use that p
    • by sootman ( 158191 )
      This is a totally bogus policy. Most neologisms are perfectly cromulent [wikipedia.org] words.

      (BTW, am I the only one who has added 'cromulent' to his spellchecker's list of good words?)
    • Wikipedia greatly endorses the Neologism (or perhaps Protologism according to their page) "initialism".

      For some reason, someone decided to redefine acronym and make up a new word to cover what acronym covered before. And Wikipedia uses it constantly, despite the pointlessness of it and the fact that the word hasn't caught on widely, thus making it a protologism. Although protologism isn't a word that has caught on widely either, thus making it a protologism itself at best, more likely a vanity word.
      • And thus protologism would then be true of itself, making it homological. Reminds me of this quote:

        If a homological adjective is one that is true of itself, e.g., "polysyllabic", and a heterological adjective is one which is not true of itself, e.g., "bisyllabic", then what about "heterological?" Is it heterological or not?
            - Grelling's Paradox
      • by sbaker ( 47485 ) *
        Initialism is defined in Websters - it's not new. What might be new is the annoying tendance for people to argue that an acronym like 'IBM' is in fact an initialism.

        Ironically 'protologism' does seem to be a neologism - there is a definition for it in The Urban Dictionary from 2003 - so it's at least 3 years old.
    • I wish there were a better feedback system for sites that could be useful slang dictionaries, like urbandictionary.com (I think that is the url). Some entries reflect actual usage, some are obvious inventions on the spot, but get ranked highly anyway, because someone thinks they are funny or useful enough.
  • For example, in french slang, the same person could use the word "batard" as either an insult or a display of respect, and neither of these meaning is related to the target's father.

    I wish them good luck...
    • Re: (Score:2, Funny)

      by RubberBaron ( 990477 )
      Yeah, you gotta admit, it's a wicked idea...
    • In what french expression exactly is it a show of respect to use "bâtard"??? I cannot think of one.
      • Indeed, in my experiences in both Belgium ('95) and France ('05) bastard is very much a deragatory term. Perhaps he means in English. In English bastard can be used in a familiar sense, though even in the familial sense the "target" is not the father.
      • by 4D6963 ( 933028 )

        Well, for example, when you refer to a friend you envy for a precise in-context reason, calling him a bastard would somehow be what the GP is talking about. But that would also work for other insults, such as enculé, and undoubtfully even in other languages.

        Example :
        "-Dude, I just had sex the Olsen twins!
        -You bastard!"

  • 31g 3r0+her iz wa+ch1ng U!
  • This is what the Urban Dictionary [urbandictionary.com] is for.
    • Indeed. A quick look at number 5 on that list [urbandictionary.com] shows just how reliable a dictionary written, reviewed and read by a community of prurient 13-year-olds can be (as of writing the 5th most popular definition of neologism is The One's manjuice) . People question Wikipaedia's accuracy - just take a look at the Urban Dictionary and see how bad it could have been.
  • by clickclickdrone ( 964164 ) on Tuesday September 05, 2006 @10:41AM (#16044470)
    and started creating its own gazornaplatting words that no-one but the program itself could middlybundy? It could eat up bibblys of disk space as all the new words chimmdudlied in a grawn.
    • Re: (Score:3, Funny)

      by sbaker ( 47485 ) *
      started creating its own gazornaplatting words

      Gazomplat. Wow! I remember that word from the mid 1970's. Bear with me a moment...

      When I was learning to program in FORTRAN in my high school math class. Our teacher (who didn't know how to program either) was trying to teach us by the age-old process of reading the book one chapter ahead of the class she was teaching. As a consequence, she was no better at it than the rest of us and we ended up debugging her code about as often as she helped with debugging
      • started creating its own gazornaplatting words

        Gazomplat. Wow! I remember that word from the mid 1970's. Bear with me a moment...

        If you needed any more proof that the slashdot font sucks, here you go.

        It's a sad day when


        is mistaken for


        Next thing you know, pom enthusiasts stray into the wrong conversation, and you can never go back from that.

        • by XorNand ( 517466 ) *
          Slashdot doesn't define the page fonts. Change your browser's default sans-serif to whatever you choose and you're golden. I personally use Swis721 BT as my sans-serif font and Bitstream Vera Sans Mono as the monospaced font. They work nicely.
    • That's ALMOST as bad as when people use "of" instead of "have". Ie: "I would of been rich by now if i hadn't..."
    • by CptNerd ( 455084 )
      I really whimmle them glogsnarp with that shnazpackle. Shivlepate the wonkpregark when it azgranks wooversmeeps!
    • Dude, something might be wrong with your fron.
    • by neminem ( 561346 )
      Might it then, possibly, go whiffling through a tulgey wood, burbling as it went? Might a hero need to be dispatched to cut off its head (vorpally, of course)? P.S. "bibblys" of disk space doesn't sound any sillier than "gibibytes" does...
  • This sounds like a great way to locate (and sue) walmartsuck.com type sites.

    Corporate censorship. Now Automated with "Zeitgeist".

    Think I'm a nut.
    Call me back in 5 years...
  • chance (Score:3, Funny)

    by Jon Luckey ( 7563 ) on Tuesday September 05, 2006 @11:00AM (#16044647)
    Sounds like a excellect chance to inject some new perfectly cromulent words into wide use.
  • by Hoplite3 ( 671379 ) on Tuesday September 05, 2006 @11:01AM (#16044659)
    Time for step two: deliver a mild electric shock to neologism users. Then I won't have to hear "blogosphere" ever again.
  • What is with people using the term ZeitGeist? Google uses it [google.com] for its end of the year search roundup. It is even used more heavily by others not associated on the internet.

    OK why not the term DefMiner? Then get an old guy to be the site mascot? On second thought, never mind. Just dont be supprised when people get you confused with another product.
    • I was thinking the same thing. The word is PLAYED. If was "hip" in the beginning and gave the person who used it a certain linguistic gravitas but now people make up bullshit just to squeeze the word into their articles, blogs and evening meals. Its like the word plethora or pedantic.
    • by Omestes ( 471991 )
      I guess the estate of Hegel should sue, since he used it first. Actually I think it was a common german word in the 18th and 19th centuries, so I guess common language has prior art. It means spirit of the times, which is pretty applicable to both Google's use, and this use.
  • Santorum! (Score:5, Funny)

    by mr_stinky_britches ( 926212 ) on Tuesday September 05, 2006 @11:21AM (#16044812) Homepage Journal
    One of my personal favorites is the word Santorum [wikipedia.org].
  • If I see a new word in text, I hypothesize its meaning from its context rather than lookup its meaning. However, recently dictionary lookup been easier when reading online with Google Define: available.

    This usually only works in languages I know fairly well. If there are two or three unknown terms in a paragraph I'll have less success in understanding them.
  • Hello? (Score:5, Interesting)

    by MarkusQ ( 450076 ) on Tuesday September 05, 2006 @11:29AM (#16044888) Journal
    Development of the tool is focusing on using it to understand what bloggers (using slang and neologisms) are saying about companies' products."

    You do not need a fancy program to do this. I can do it for you, without even reading the blogs in question.


    They are saying your products suck, and that your customer support is worthless.

    See how easy that was? Now, you might be wondering how I know this. Simple. They don't use made up words to say good things about you. I'm not sure why (maybe they aren't worried about being sued for saying good things?), but the pattern is very consistent. If somebody goes to the trouble of writing about you in their blog using made up words, they don't like you or the horse you rode in on.

    Likewise, if you are a journalist, they call you funny names (Steno Sue, Laura Dildo, Kneepads Miller, "Dollar a Word" Armstrong, etc.) because they've noticed that you consistently write to favour a certain party, position, politician, company, or lifestyle, even when this requires ignoring a pile of facts the size of Paraguay, any one of which would shred your position.

    And if you're a politician, it means that someone noticed that what you say in speeches is so unconnected to what you do with the office you hold that the only link between them is the way in which they combine to mollify your nominal constituents while maximizing the benefit to your corporate sponsors.

    If you are an industry association, they are saying they hate you, period, and that you are evil incarnate.

    See how easy this is? If you still don't get it, I am willing to come out of retirement as a consultant to explain it to you, provided the price is right.


  • I suspect database mining algorithms for Wikipedia Neologisms could also help refactor Wordnet's definitions to be more succinct and hence provide a better basis for modeling other natural language corpora.

    paq8hp3 [binet.com.ua] is the current Hutter Prize lead contender [hutter1.net] and has compressed the first 100M of Wikipedia to just over 17M. Wordnet's .exe file is just over 17M. One wonders what would happen if the "cream" of Wordnet's vocabulary were compressed using paq8hp3 and then incorporated into paq8hp3 to be a better

  • So, I tried WordNet and it didn't work! Natual! Indeed!
  • Shouldn't they also crawl through something like the urban dictionary which will have ten times more slang definitions?

  • Zeitgeist today decided that it was a perfectly cromulant product.
  • May I offer my heartiest contrafibularities!

    I am leaving now, but I shall return interfrastically.

    (5 points to whoever places the origin of this bastardized quote)

Asynchronous inputs are at the root of our race problems. -- D. Winker and F. Prosser