Comment Re:fascinating (Score 1) 418
Probability models can handle non-exact or "fuzzy" matches just fine. If your corpus includes a phrase like "The shirt costs $20," then the probability that its translation is acceptable for "The blouse costs $30" is higher than that of "Freedom is on the march."
Stochastic models can also deal with idioms for the same reason -- if your corpus includes equivalences between "it's raining cats and dogs" and "il tombe des cordes" then that idiom will be learned.
Word boundaries are usually dealt with by using n-grams. Given knowledge of a language's vocabulary, you pick a size of "n" that serves as a word length. You then identify "words" by sliding a window of size n across the corpus, picking up every sequence of "n" consecutive characters, including whitespace. This avoids the problem of having to devise a great word tokenizer. It's also essential to dealing with ideographic languages or languages, like Thai, in which word boundaries can only be identified with a dictionary.
Alternate character systems are tough. Japanese has three writing systems and it's possible to say "the same" thing in more than one of them. You want a really big corpus -- one that provides coverage of all systems.
Punctuation rules can be dealt with stochastically. In fact they have to -- translation "units" are typically sentences and so you need a model that knows how to find sentences in a given language. At first blush you might think it's just periods, question marks, and exclamation points. But remember that sentences can end with a period. They sometimes end with elipses. Periods are used in addresses (St. Ave.) and in names (Dr. John Q. Blankenship). And quotations sometimes contain question marks and exclamation points.
Sentence segmentation for translation is made even thornier when you consider that it's a many-to-many mapping. That is, one sentence in English might equal three in French. But in Chinese, you might map one sentence to that same English sentence *plus* the one before it.
The problem with using a corpus like UN translator transcripts is that while it's fine for dealing with (reasonably) civil discourse among educated elites, it's less good for learning the argot of Saudi-born terrorists who've been hiding in an Afghani cave for four+ years. Just as there aren't likely to be many Texas-isms or Yorkshire-isms (or whatever dialect you like) in UN transcripts, there sure isn't going to be a lot of coverage for regional Arabic variations. And even if there were, humans will simply change the rules of their spoken language, leaving the MT trainers with no good corpus to feed their models. NSA's onto you because you used the word bomb? Start calling it something else.
There is an open source implementation of the "maximum entropy" approach to statistical natural language processing that is used for systems of this kind. If you're curious, it (and the scientific work on which it is based) would be a good place to start: http://opennlp.sourceforge.net/.
Stochastic models can also deal with idioms for the same reason -- if your corpus includes equivalences between "it's raining cats and dogs" and "il tombe des cordes" then that idiom will be learned.
Word boundaries are usually dealt with by using n-grams. Given knowledge of a language's vocabulary, you pick a size of "n" that serves as a word length. You then identify "words" by sliding a window of size n across the corpus, picking up every sequence of "n" consecutive characters, including whitespace. This avoids the problem of having to devise a great word tokenizer. It's also essential to dealing with ideographic languages or languages, like Thai, in which word boundaries can only be identified with a dictionary.
Alternate character systems are tough. Japanese has three writing systems and it's possible to say "the same" thing in more than one of them. You want a really big corpus -- one that provides coverage of all systems.
Punctuation rules can be dealt with stochastically. In fact they have to -- translation "units" are typically sentences and so you need a model that knows how to find sentences in a given language. At first blush you might think it's just periods, question marks, and exclamation points. But remember that sentences can end with a period. They sometimes end with elipses. Periods are used in addresses (St. Ave.) and in names (Dr. John Q. Blankenship). And quotations sometimes contain question marks and exclamation points.
Sentence segmentation for translation is made even thornier when you consider that it's a many-to-many mapping. That is, one sentence in English might equal three in French. But in Chinese, you might map one sentence to that same English sentence *plus* the one before it.
The problem with using a corpus like UN translator transcripts is that while it's fine for dealing with (reasonably) civil discourse among educated elites, it's less good for learning the argot of Saudi-born terrorists who've been hiding in an Afghani cave for four+ years. Just as there aren't likely to be many Texas-isms or Yorkshire-isms (or whatever dialect you like) in UN transcripts, there sure isn't going to be a lot of coverage for regional Arabic variations. And even if there were, humans will simply change the rules of their spoken language, leaving the MT trainers with no good corpus to feed their models. NSA's onto you because you used the word bomb? Start calling it something else.
There is an open source implementation of the "maximum entropy" approach to statistical natural language processing that is used for systems of this kind. If you're curious, it (and the scientific work on which it is based) would be a good place to start: http://opennlp.sourceforge.net/.