Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Software

More on Statistical Language Translation 193

DrLudicrous writes "The NYTimes is running an article about how statistical language translation schemes have come of age. Rather than compile an extensive list of words and their literal translations via bilingual human programmers, statistical translation work by comparing texts in both English and another language and 'learning' the other language via statistical methods applied to units called 'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big." See our previous story for more info.
This discussion has been archived. No new comments can be posted.

More on Statistical Language Translation

Comments Filter:
  • by marcopo ( 646180 ) on Thursday July 31, 2003 @08:24AM (#6578788)
    The key improvement is not just to search for phrases that appear in the sample texts. If you have an idea for what a word means and what its grammatical role is then you can plug it into other sentences and greatly extend the set of phrases you can translate. Thus an important idea is to search for phrases that match gramatically with phrases you can translate.
    however, this requires a stage where the sample texts are used to extract grammatical information on the second language. Of course, it helps alot if you are familiar with one of the two languages.
  • by shish ( 588640 ) on Thursday July 31, 2003 @08:27AM (#6578802) Homepage
    What happens when it hits a word with several meanings? For example the reply to a previous story "I got pissed and installed OSX"

    drunk?
    angry?
    urinated?
  • by Surak ( 18578 ) * <surakNO@SPAMmailblocks.com> on Thursday July 31, 2003 @08:31AM (#6578825) Homepage Journal
    I remember reading about IBM doing this research about 10 years ago. The biggest problems then adequate processing power and storage space. Those things have greatly improved in the last 10 years (thank the spirits of Moore). I think that's why you're starting to see all this cool research with speech recognition and AI that was being done in the 80s and 90s become more and more commonplace. This trend will likely continue, and all the cool research only stuff you remember reading about in the 80s and 90s will just be common fixtures on PCs of today.

    Speaking of which -- speech recognition, AI, translation learning algorithms -- sounds like we have the seeds for the Universal Translator. :)

  • by kmak ( 692406 ) on Thursday July 31, 2003 @08:40AM (#6578879)
    I have one question though, while obviously, you can get a mapping of definitions, can you actually translate a full sentence into another full sentence?

    With exceptions in tons of languages, is this even feasible in the near future? Sure, we can understand a poorly translated sentence, but can it translate it so that we don't have to?
  • by panurge ( 573432 ) on Thursday July 31, 2003 @08:42AM (#6578885)
    Modern languages tend to have less inflected grammars than older languages. That is a benefit for statistical methods because individual words do not change significantly. But how would this work for Latin, Greek and other highly inflected languages? Anyone who knows "The Turn of the Screw" (Britten version) will remember:

    malo: I had rather be
    malo: in an apple tree
    malo: than a naughty boy
    malo: in adversity

    based on four very distinct meanings of malo, in which the word endings put the stem of the word in context, but unfortunately the same word endings are used for different things.

    Not that I'm trying to rubbish the work, because I actually think that statistical methods are close to the fuzzy way that we actually try and make out foreign languages. I just wonder what the limits are.

  • by beacher ( 82033 ) on Thursday July 31, 2003 @08:43AM (#6578894) Homepage
    The article's text has "Compare two simple phrases in Arabic: "rajl kabir'' and "rajl tawil.'' If a computer knows that the first phrase means "big man," and the second means "tall man," the machine can compare the two and deduce that rajl means "man," while kabir and tawil mean "big" and "tall," respectively". Are we going pro-homeland security and not tipping off the powers that be? Or did michael want to show his uber leet 1st quarter espanol skillz?

    Spanish is easy and led me to believe that the article had relatively little weight (it is lightweight and a topical PHB read anyway). I do a lot of data mining in text streams and have found it to be fairly easy work. Getting cursors to play in ideograms/unicode and reversing the data is something I haven't tried yet and the article barely covers it. When I saw that they were covering language sets that were extremely dissimilar to english, my interest in multi-language applications piqued again. All of my databases are unicode and I want to learn more about having truly international systems that are automated and then hand tweaked to avoid the engrish.com [engrish.com] type mistakes. Any help here?
    -B
  • Missed the idea (Score:2, Interesting)

    by marcopo ( 646180 ) on Thursday July 31, 2003 @08:49AM (#6578923)
    Translation (computerized or not) is about picking the correct meaning from the context. If the word appears in the given text and in a similar context in the sample texts you could pick the correct meaning.

    As for inflected (read most) languages, learning to separate a word into its stem and inflections is the first step, even if you have a number of such possible break-ups.

  • by NathanE ( 3144 ) on Thursday July 31, 2003 @08:52AM (#6578939)
    You are prety much right about that, although I do not see the need to actually maintain your table in RAM. Trigrams require a HUGE corpus of training material to get good results, and even then you come up with the need to fudge your data a bit when you come across an unknown trigram. I think its called "and one rounding" or something like that (trying to remember from class).

    Fascinating stuff for sure, but hardly new unless they have come up with some new development. I haven't read the article.
  • by Anonymous Coward on Thursday July 31, 2003 @08:58AM (#6578973)
    There are plenty of highly inflected modern language, e.g. Russian and a few dozen other Slavic languages and Japanese are highly inflected.

    Get this idea out of your head. There is no continuum of inflectedness upon which modern languages align to the uninflected.
  • by Anonymous Coward on Thursday July 31, 2003 @09:03AM (#6579003)
    If this is just statistics, and you can do anything in C, why not statistically relate C to machine code and look at Windows machine code to get a C source that is clean room? Or perhaps look at MSword input vs word document format?
  • by Alkonaut ( 604183 ) on Thursday July 31, 2003 @09:06AM (#6579013)
    Since the meaning of "pissed" is determined by the context (nationanlity for example), you would need more information than the sentence itself to make an educated guess. A little context is given by the "installed OSX", but probably not enough to decide between angry and drunk...

    Does anyone know if for example babel is context/locale sensitive in this sense:

    If I write "theatre" or some other word with british spelling, does it then understand that any other words with different meanings in en-US and en-GB english should use the meaning from en-GB? The test sentance "At the theatre getting pissed" won't work since no slang seems to work with babel.

  • by davids-world.com ( 551216 ) on Thursday July 31, 2003 @09:14AM (#6579060) Homepage
    Statistics work quite well not just for phrases or so-called collocations such as "high and low" (vs. *"high and small"). they can help figure out the meaning of a word (bank=credit institute vs. bank=place to rest in a park). You can even learn (automatically learn) this stuff from parallel corpora where you can get a sentence-by-sentence translation, and you figure out statistically, which words or phrases belong together.

    But that's an old story. Even the translation of complete sentences is fairly feasible in terms of syntactic structure.

    Harder to translate are things like discourse markers ("then", "because") because they are highly ambiguous and you would have to understand the text in a way. I have tried to guess these discourse markers with machine learning model in my thesis [reitter-it-media.de] about rhetorical analysis with support vector machines (shameful self-promotion), and I got around 62 percent accuracy. While that's probably better than or similar to competing approaches, it's still not good enough for a reliable translation.

    And that's just one example for the hurdles in the field. The need for understanding of the text kept the field from succeeding commercially. Machine Translation in these days is a good tool for translators, for example in Localization [csis.ul.ie].

  • by domovoi ( 657518 ) on Thursday July 31, 2003 @09:16AM (#6579076)

    There are a number of problems with the model here that point very clearly to the fact that it has the same shortcomings as other machine translation models.

    For example, so long as we're working with cognates or 1:1 equivalencies (tall, man, etc.) it's fine. If we go to words for which there is no 1:1 lexical item, what's it do then? Consider especially words that signify complex concepts that are culture-bound. There would be, by definition, no reason for language #2 to have such a concept, if the culture isn't similar. The other problem arises from statistical sampling. Lexical items that are used exceedingly rarely and have no 1:1 or cognate would be unlikely to make the reference database.

    Another similar problem arises with novel coinages and idioms. The example of "The spirit is willing..." is rightly cited. Consider the Russian saying, "He nyxa, He nepa," which translates as "Neither down nor feathers" but doesn't mean anything of the sort.

    Real machine translation has been the golden fleece of computational linguistics for a long time. I'll believe it when I see it.

  • by gidds ( 56397 ) <slashdot.gidds@me@uk> on Thursday July 31, 2003 @09:26AM (#6579156) Homepage
    ...there's no ambiguity. Becoming angry is getting pissed off. I urinated is I pissed (no 'got'). So, here, your sentence could only refer to inebriation. (Though why that should be a prerequisite for installing such a cool system, I've no idea.)

    I always said you Yanks couldn't even use your own language properly... [fx: ducks]

  • by dhodell ( 689263 ) on Thursday July 31, 2003 @09:28AM (#6579182) Homepage
    I'm sure that everybody's familiar with the output and quality of different various translators available online. I myself have been very interested in creating such a utility, and then one based on statistical language analysis. In my time in Holland, I've enjoyed learning the Dutch language, and have found online utilities to be of little help when translating documents (though I do not require this much anymore, it would have been helpful in the beginning).

    Although these methods work better than literal word-for-word translation, they're still not going to be perfect without some sort of human intervention. Dutch, for instance, has a completely different sentence structure than does English. For instance, the sentence "The cow is going to jump over the moon." becomes "De koe gaat over de maan springen" or, literally, "The cow goes over the moon to jump".

    Don't laugh at this structure or perhaps any unobvious usefulness. I've had discussions with people regarding the grammatical structure of a language and the society around it. Indeed, a specific example I have comes from a TV show "Kop Spijkers", which is a show focused mainly poking fun at political activity and news events. At times, they have people dressed as popular media and political figures and have comical debates.

    In one show, a person acting as Peter R. de Vries (roughly the Dutch equivalent of William Shatner on America's Most Wanted) stated the following joke (JS stands for Jack Spijkerman, the host of the program):
    PRdV: ...Maar ja, ik ben de niet roker van het jaar. JS: Hoezo? PRdV: Nou, ik rook 2 pakjes per dag... niet.

    Translated into English, we would not find the humor in this transaction:
    PRdV: ...Anyway, I'm the non smoker of the year. JS: How do you figure that? PRdV: Well, I ... don't ... smoke 2 packs per day.

    Sure you can crack a smile about it, but it's much funnier when the punchline comes at a climax. And in English, it is not possible to state "Well, I smoke 2 packs per day... NOT" (without sounding like a retard who's watched too much Wayne's World).

    Getting back on topic, I believe there will be major issues with any tranlsation algorithm to come. This is, of course, to be expected; I hope, however, that more advances will soon follow.
  • by Frantactical Fruke ( 226841 ) <renekita@@@dlc...fi> on Thursday July 31, 2003 @09:36AM (#6579251) Homepage
    On the other hand, having just finished translating a letter from Finnish to German, I fear that in light of the fact that, unlike most other cultures, Germans consider unspeakably long, intertwined sentences with multiple asides quoting their dead grandmothers who used to go on and on like this all day and the mandatory Goethe or Immanuel Kant quote concerning the importance of staying on topic, of which this run-on piece of drivel gives you but a faint impression, rather stylish and intelligent, we might have to wait a while yet.

    Would a program know how to break up a monster like that?

    Or, seriously, I ended up rewriting most of the letter to convey its contents in a tone that hopefully won't insult the recipient because of differing cultural expectations.

    Finns often consider politeness a waste of time. Now explain that to a statistical translator program: "Leave out/add in some polite blablablah"?
  • by Orne ( 144925 ) on Thursday July 31, 2003 @09:44AM (#6579299) Homepage
    Or bank = shoreline, as in river bank
    or bank = hardware bus, as in a bank of memory
    or banking = betting, as in I'm banking on that... :)

    These statistical language solutions are interesting, in that they can analyze sentence structures and deduce the grammar of a language; however, I would think that they fail on generating the actual definitions of words. You almost need to generate a list of "concepts", then link each concept to a word, by language. Not my field, thank goodness; I wouldn't have the patience for it.
  • by godot42a ( 574354 ) <s,pado&ed,ac,uk> on Thursday July 31, 2003 @09:44AM (#6579301)
    Short and simplified version: Look out for different typically co-occurring words and cluster them. For "pissed", you'll find Cluster 1: {pissed, toilet} Cluster 2: {pissed, booze, get} and probably some more These clusters correspond to different meanings of the word. Then determine which of these clusters fits the current usage.
  • by Anonymous Coward on Thursday July 31, 2003 @10:09AM (#6579538)
    This idea is like the behavioralist idea that a baby is a blank slate and he just learns the language by association like Pavlov's dog. something similar has been tried with neural networks etc.

    However, this method does not work, as the silly examples elsewhere in the discussion show. You can only understand or translate if you "know" what is meant.

    There is no way of figuring it out. There isn't enough information supplied in the texts themselves. You have to be born with the inherent ability to understand stuff.

    You'll find a good discussion of this in Steven Pinker's "The Language Instinct", which I recommend.
  • by YU Nicks NE Way ( 129084 ) on Thursday July 31, 2003 @10:14AM (#6579590)
    When I read this, I'm reminded of the SPHINX project at CMU in the mid 80's. Kai-Fu Lee was a doctoral student at CMU in computer science. His advisor set him to evaluating the performance of the (clearly inferior) statistical SR systems that IBM was touting. It was a throw-away project; his advisor just wanted some numbers to compare his rule-based system against. The linguists had clearly shown that the irregularities of human speech required deep knowledge of the phonology, syntax, and sematics of the language being spoken, but the projectg leader needed a benchmark to measure against.

    Lee's toy project, SPHINX, won the DARPA competition that year. The highest scoring rule-based system came in fifth. What the linguists "knew" was wrong.

    The example you gave is another example of the linguists not know as much about statistics as they think. The corpora used for statistical translation include examples of idiomatic usages. Idiomatic usage is highly stereotypical, so the Viterbi path through an N-gram analysis captures such highly linked phrases with high accuracy.
  • Limited value? (Score:3, Interesting)

    by sjasja ( 694035 ) on Thursday July 31, 2003 @10:16AM (#6579612)
    Automatic dictionary generation for MT seems of limited value to me. You can purchase dictionaries easily enough, or get trained monkeys^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H linguistics students cheaply enough to do the work.

    Raw dictionary work is pretty much the least interesting, most mechanical part of an MT system.

    Grammar (source parsing, transformation and target generation) takes a lot more work and careful thinking.

    The more accurate you want your MT system to be, the more extra information you want to attach to your dictionary entries (the more the system knows about all the words, the more disambiguation using real-world knowledge it can do.) "I have a ball" vs "I have an idea" translate to some languages quite differently; you need to know that you don't (usually) physically hold "an idea" in your hand. The most common words ("is", "have") are often the worst in this respect.

    (I have worked coding an MT system.)

  • by tibbetts ( 7769 ) <jason@@@tibbetts...net> on Thursday July 31, 2003 @10:20AM (#6579651) Homepage Journal

    (Offtopic, but indulge me.)

    For anyone who doesn't know Latin, or for anyone who isn't familiar with inflected languages in general, here's a detailed morphological breakdown of this poem.

    malo: I had rather be

    First-person, present indicative active form of the irregular verb malle, "to prefer, wish". It takes an infinitive (most likely esse, "to be"), which is often, as here, dropped.

    malo: in an apple tree

    The locative form of malus, -i (feminine noun), "apple tree").

    malo: than a naughty boy

    Dative of comparison (as dictated by malle) of the adjective malus, -a, -um, "bad, evil". This is the masculine (or neuter) form, hence the translation "boy".

    malo: in adversity

    Ablative of the neuter noun (really a substantive adjective) malum, -i "evil".

    In short, we have a verb, a noun, an adjective, and a homonymic noun.

    (Thanks to the original poster for the poem--I've never heard this one.)

  • unfortunately doomed (Score:5, Interesting)

    by aziraphale ( 96251 ) on Thursday July 31, 2003 @10:36AM (#6579843)
    Like most computerised translation efforts, this ignores the fact that translation always requires context. The sentence 'fruit flies like an orange' is a classic example in the English language of a sentence which can be interpreted in two different ways - sentences can easily be constructed which have completely different meanings in different contexts.

    'As a punishment, he was given a longer sentence'. Obviously, we're talking prison, right? Well, what if the preceding sentence was:
    'The teacher had grown weary of his poor attempts at translation'?

    A statistical system, even working with the entire phrase, won't be able to figure out which meaning of the word 'sentence' is intended there.

    how about:
    'The box was heavy. We had to put it down'
    'The dog was ill. We had to put it down'

    You need semantic understanding to be able to perform translation.

  • by Mawbid ( 3993 ) on Thursday July 31, 2003 @11:38AM (#6580419)
    The one you mentioned if often accompanied by two more, so I'll continue the tradition. These smell like urban legend, but who cares? :-)

    An engineer was confused when a a translated spec included water goats. "Water goats"?! Hydraulic rams, actually.

    And perhaps most famous of all, "out of sight, out of mind" supposedly came back as "blind idiot".

    Language is a curious thing. I can't help thinking there's some deeper meaning to the fact that misapplication of it can so easily be funny to us.

  • by PhilHibbs ( 4537 ) <snarks@gmail.com> on Thursday July 31, 2003 @12:00PM (#6580640) Journal
    a) The lemonade got drunk.
    b) My friend got drunk.
    Gramatically speaking, what's the difference?
    Grammatically, there is none. However, a statistical translation system could cope with this. If it had two matched texts:

    "The liquid was pissed some time later" translated into Language X as "The liquid was urinated some time later"

    "John was pissed some time later" translated to Language X as "John was inebriated some time later"

    It would assimilate this into it's linguistic map as something like:

    pissed = inebrated
    liquid pissed = liquid urinated
  • by Flwyd ( 607088 ) on Thursday July 31, 2003 @04:12PM (#6582749) Homepage
    "If we can learn how to translate even Klingon into English, then most human languages are easy by comparison," [Dr. Knight] said.

    That's not really the case. Klingon was created through conscious effort and hasn't evolved many (any?) warts over time. Its structure is akin to well-understood human languages.

    Now take Turkish, which has concatenative grammar. Adjectives are applied by tacking suffixes on to the word, sometimes changing spelling of previous chunks. Thus, a 20-word English phrase may correspond to a single Turkish word and extremely long words may be reasonably assumed to be unique. Statistical techniques can work with Turkish, but it requires some work up front to extract tokens. \b\B+\b doesn't help much. German (and, I think, Greek) are like this to a lesser extent.

    Statistical approaches are often quite effective in language processing, much to the surprise and disheartening of linguists. They're far from perfect, but often the best thing so far.
  • by _randy_64 ( 457225 ) on Thursday July 31, 2003 @11:03PM (#6585135)
    The article says n-grams are "Phrases like these, called "N-grams" (with N representing the number of terms in a given phrase)". I've always used n-grams as character counts, using a sliding window over the text. For example, the 5-grams of the phrase "for example" would be

    [for e][or ex][r exa][ exam][examp] and so on.

    Using n-grams this way helps with things like mis-spellings. Mr. Metlin (parent of this) used the character definition is his paper. N-grams are widely used in Information Retrieval Research [umbc.edu].

The hardest part of climbing the ladder of success is getting through the crowd at the bottom.

Working...