More on Statistical Language Translation 193
DrLudicrous writes "The NYTimes is running an article about how statistical language translation schemes have come of age. Rather than compile an extensive list of words and their literal translations via bilingual human programmers, statistical translation work by comparing texts in both English and another language and 'learning' the other language via statistical methods applied to units called 'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big." See our previous story for more info.
Translator (Score:3, Informative)
Can anyone try this on the new (or some other recent) algorithm?
BTW here's Doc Och's most recent website:
Franz Josef Och [isi.edu] [isi.edu]
--
Esteem isn't a zero sum game
Re:IBM research 10 years ago (Score:5, Informative)
http://www-2.cs.cmu.edu/~aberger/mt.html
Re:Older languages not supported? (Score:2, Informative)
wow (Score:2, Informative)
Arabic Grammar Nazi (Score:5, Informative)
Not to be overly anal (hopefully to raise an important point), "rajl kabir" actually means "old man" not "big man." The Arabs will definitely laugh at you if you mix these up. You'd use the word "tawil" for a tall or generally large man. The word "sameen" refers to a fat or husky guy. In a different context (referring to an inanimate object), "kabir" does in fact mean big.
I wonder how good these statistical systems really are at learning the various grammical nuances of a language like Arabic. For example, in Arabic, non-human plurals behave like feminine singulars, whereas human plurals behave like plurals.
It's really incredibly cool that these machines can learn language mechanics and definitions on their own. But as previous posters have already noted, the machine still has to know the meanings of words in order to do a good translation.
For example, to translate "big box" and "big man" into Arabic, you'd actually use different words for big, since the box is inanimate, but the man is animate.
Re:Older languages not supported? (Score:2, Informative)
Japanese doesn't use inflection for any meaning at all. You can speak Japanese without using any inflection, you would just sound like a robot.
Sometimes it's easier to understand two words that sound similar with inflection, but the way they are written or even spoken is different without any inflection.
Re:Older languages not supported? (Score:2, Informative)
Re:Limited value? (Score:5, Informative)
Strictly separating raw dictionary work and grammar seems rather old-fashioned to me. Of course, it can work to some degree, but there are so many different types of collocational preferences that just providing each lexeme with a 'grammatical category' from a relatively small list and basing the grammar on these grammatical categories is hardly enough.
It is true that automatic systems' lack of world knowledge is a big problem, but the examples you provide aren't really a good demonstration of this fact. As you write, 'have' is translated differently into some languages depending on whether the object is abstract. So, given a translation system that recognizes the verb and its object and a bilingual parallel corpus, a statistical system can find out about that.
I heard of people who write dictionaries that can be used for automatic processing, for every lexeme they need between half an hour or an hour (consulting dictionaries and corpora, checking whether the application of rules gives correct sentences). This can only work if the aim of the MT system is either only a very limited domain (e.g. weather forecasts, for which there are working rule-based translation systems) or very low quality. It could never be affordable to have trained people provide all relevant characteristics for the millions of words that would be needed for a good MT system with wide coverage.
Differentiating between concrete and abstract entities is something that seems quite natural to us, but there are many other relevant characteristics of lexical items that don't come to linguists' minds so easily, statistical analyses can be better at discovering them.
A paper on this (Score:4, Informative)
You can find the paper here (PDF) [metlin.org] and the presentation here [metlin.org].
EGYPT translation toolkit is GPL'ed. (Score:3, Informative)
I can imagine some distributions of this translation system that take this code - with improvements - and precook large corpuses to create translators. Anyone want to write the Mozilla and OpenOffice plug-ins for the new menu item "Edit/Translate Language"?
Re:the real problems lie in understanding... (Score:3, Informative)
You start with a few words that occur with each sense, you now can disambiguate a few example occurences in the text. Each of these occurences has words around it - add them to your list of sense indicators. Then do the whole thing again and again.
Re:Arabic Grammar Nazi (Score:3, Informative)
For example, to translate "big box" and "big man" into Arabic, you'd actually use different words for big, since the box is inanimate, but the man is animate.
I think that one of the major points of the statistical technique is to deal with precisely this sort of thing.
It doesn't have to know the "meaning" of words like "box" or "man," it just has to have seen them in a particular context before. If it has, then it knows that "big" is usually translated one way when it appears with "man," and another way when it appears with "box." So it just follows those observed patterns, without knowing anything about "meaning."
In principle (the article hints at this), it might even be possible in the future to make good guesses about combinations that it hasn't seen before, by inferring rules about how things are put together. For example, it encounters the phrase "strong box," which it hasn't seen before. But it has seen many other words that frequently are associated with inanimate-strong and that are also frequently associated with inanimate-big, and many words that frequently are associated with animated-strong and also frequently associated with animate-big, but few words that are frequently associated with inanimate-big and animate-strong. So it infers that inanimate-strong is somehow "parallel" to inanimate-big. Since it also finds that "box" is more typically associated with inanimate-big than with animate-big, it infers that the version of "strong" that goes with "box" is the one that is parallel to inanimate-big. Therefore, it selects inanimate-strong. Ta da!
Re:the real problems lie in understanding... (Score:3, Informative)
KBMT can be done. We demonstrated that pretty definitively. It's labor-intensive. Yes, we DID create concept maps (ontologies) for the domains of human endeavor relating to the texts to be translated, and yes, we DID link words (lexical units) to those concepts, in multiple languages. And it turned out that we didn't have to make the ontologies very deep--we had to make them broad, and start by assuming the need for a one-to-one mapping of concepts to nouns, verbs, and modifiers in the domain. This we arrived at after several attempts at making them deep. Turned out the lexicon is the real key. You only have to use enough ontological structure to support the way you deal with analyzing function words (relative pronouns, conjunctions, prepositions, certain adverbs, etc.) and to capture deep generalizations for certain classes of verbs and verb-derived nouns (deverbal nouns.) The system uses a fast (real-time) English analysis parser based on the Tomita algorithm, and a rule-based target-language generator based on the KANT system.
We created a custom-built MT system for Caterpillar, to perform automated translation of their operations and maintenance manuals from the English of Peoria, Illinois into French, Spanish and German. It took us six years (not counting all the projects that preceeded it, from which we learned a great deal.) The system empoys a controlled subset of English that forces Caterpillar's technical writers to favor certain constructions in their writing, and to disambiguate certain other constructions using a writer's workbench interface. (Caterpillar has a patent on this application of MT technology for technical documents.) It contains all the vocabulary that Caterpillar needs--hundreds of thousands of terms. Caterpillar updates the lexicons as needed.
This system has been in production use at Caterpillar since 1996. It translates controlled English text at accuracies in the high 90-percents. The tech writers adapted, the translators got turned into post-processors (and I believe there was some turnover of personnel--the work had to have gotten a lot more boring), the English reads a little bit stilted but it's perfectly clear. Response from Caterpillar's customers was positive, the manuals get translated faster, and are accurate. The controlled English can actually force a little higher accuracy. Caterpillar's investment in this techology ended up saving them a bundle.
Due to the proprietary context of our work for Caterpillar, there were very few academic publications that came out of the project.
If you want to engage in further reading, search on KBMT-89 (an MT project funded by IBM-Japan that laid much of the foundation for workable real-time KBMT.) We published a book on it.
You can read about the KANT technology at
http://www.lti.cs.cmu.edu/Research/Kant/
There are also a number of pointers to other knowledge-based projects on the lti.cs.cmu.edu site.
For looking at the progress of KBMT in the U.S. generally (over the past couple of decades), search for publications by Jaime Carbonell, Sergei Nirenburg, Eric Nyberg, Masaru Tomita, Teruko Mitamura, Robert Frederking, Lori Levin, Kathy Baker, Ralf Brown, and a cast of dozens. Warning--this will bring you vast amounts of material.