More on Statistical Language Translation

More on Statistical Language Translation 193

Posted by michael on Thursday July 31, 2003 @08:18AM from the ma-grandmere-est-flambe dept.

DrLudicrous writes "The NYTimes is running an article about how statistical language translation schemes have come of age. Rather than compile an extensive list of words and their literal translations via bilingual human programmers, statistical translation work by comparing texts in both English and another language and 'learning' the other language via statistical methods applied to units called 'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big." See our previous story for more info.

More on Statistical Language Translation

This discussion has been archived. No new comments can be posted.

Search 193 Comments Log In/Create an Account

Comments Filter:

Translator (Score:3, Informative)

by Anonymous Coward writes: on Thursday July 31, 2003 @08:29AM (#6578813)

That's an example from a few years' back of an attempt to translate "the spirit is willing but the flesh is weak" from English to Russian and back to English using a different translator.

Can anyone try this on the new (or some other recent) algorithm?

BTW here's Doc Och's most recent website:

Franz Josef Och [isi.edu] [isi.edu]

--
Esteem isn't a zero sum game

Re:IBM research 10 years ago (Score:5, Informative)

by Jugalator ( 259273 ) writes: on Thursday July 31, 2003 @08:49AM (#6578925) Journal

Yes, I see IBM's project was called the "Candide Project". Here's a link with some information about it, including a link to the paper describing the prototype system they built:

http://www-2.cs.cmu.edu/~aberger/mt.html

Re:Older languages not supported? (Score:2, Informative)

by godot42a ( 574354 ) writes: <(s.pado) (at) (ed.ac.uk)> on Thursday July 31, 2003 @09:53AM (#6579376)

> Modern languages tend to have less inflected > grammars than older languages. In general, that's not true. There is development in both directions, depending on the language family. Proto Indo European started out with many cases, and that's why there is a tendency towards less inflections and more particles. In languages with many particles, the development can be inversed. Cliticization is such a process. For example, in some dialects of German, personal pronouns become new verb endings: Laufen Sie! (run!) -> Laufen'S!

wow (Score:2, Informative)

by Anonymous Coward writes: on Thursday July 31, 2003 @09:57AM (#6579415)

'N-grams'- e.g. if 'hombre alto' means tall man, and 'hombre grande' means big man, then hombre=man, alto=tall, and grande=big."
Wow. You could not provide a more wrong description of what's going on here. I don't know where to start. The statistical methods are explicitly free of meaning. There's no symbol-grounding going on here. Thus the statistical method does not say that hombre = man and alto = tall. All it says is that often when "hombre" showed up in text A, "man" showed up in text B, regardless of whatever the symbols "hombre" or "man" mean. Further, an Ngram is a fixed-length string of symbols that's two or more in length: a bigram is two symbols, and a trigram is three symbols, etc. If a symbol were taken to mean a word, then "hombre" is not an Ngram. "hombre grande" is an Ngram. Anyway, if the statistics are based on Ngrams, then they're computing relationships between Ngrams, NOT between the pieces inside of them ("hombre", "grande", etc.).

Arabic Grammar Nazi (Score:5, Informative)

by nat5an ( 558057 ) writes: on Thursday July 31, 2003 @10:43AM (#6579916) Homepage

From the Article: Compare two simple phrases in Arabic: "rajl kabir" and "rajl tawil." If a computer knows that the first phrase means "big man," and the second means "tall man," the machine can compare the two and deduce that rajl means "man," while kabir and tawil mean "big" and "tall," respectively.

Not to be overly anal (hopefully to raise an important point), "rajl kabir" actually means "old man" not "big man." The Arabs will definitely laugh at you if you mix these up. You'd use the word "tawil" for a tall or generally large man. The word "sameen" refers to a fat or husky guy. In a different context (referring to an inanimate object), "kabir" does in fact mean big.

I wonder how good these statistical systems really are at learning the various grammical nuances of a language like Arabic. For example, in Arabic, non-human plurals behave like feminine singulars, whereas human plurals behave like plurals.

It's really incredibly cool that these machines can learn language mechanics and definitions on their own. But as previous posters have already noted, the machine still has to know the meanings of words in order to do a good translation.

For example, to translate "big box" and "big man" into Arabic, you'd actually use different words for big, since the box is inanimate, but the man is animate.

Re:Older languages not supported? (Score:2, Informative)

by Xerithane ( 13482 ) writes: <xerithane.nerdfarm@org> on Thursday July 31, 2003 @11:37AM (#6580410) Homepage Journal

Japanese are highly inflected.

Japanese doesn't use inflection for any meaning at all. You can speak Japanese without using any inflection, you would just sound like a robot.

Sometimes it's easier to understand two words that sound similar with inflection, but the way they are written or even spoken is different without any inflection.

Re:Older languages not supported? (Score:2, Informative)

by sesquipedalian_one ( 639698 ) writes: on Thursday July 31, 2003 @11:49AM (#6580531)

Clearly you've never looked at Turkish. Or any of the Bantu languages, which make the inflectional system of Latin or Greek look like child's play. But the differences between inflectional systems in two languages is really part of a broader issue, namely that translation doesn't occur on the basis of a token-for-token replacement. One word in the source language may correspond to several in the target language, and vice-versa. This is a problem in alignment, and any MT system must deal with it, but that's a fairly well understood problem. A system of this sort certainly would not just look at words as atomic units, but would have to look at parts words (i.e., their morphology)

Re:Limited value? (Score:5, Informative)

by Jadrano ( 641713 ) writes: on Thursday July 31, 2003 @11:57AM (#6580609)

Of course, you can buy dictionaries or get trained people write them, but the amount of data needed for every lexical item would be so large that a wide coverage would be very hard to achieve. For example, you have to note all collocations. Often, such preferences aren't clear-cut. For instance, 'essential' appears much more frequently in an attributive position (e.g. 'X is essential') than in , while 'basic', which can have a very similar meaning in many contexts (e.g. 'the essential X'), appears much more often in an attributive position. Such information is necessary for good translation, but dictionaries usually don't provide it. Statistical analyses of lexical items reveal many things dictionaries don't tell you. Nowadays, a significant part of the work of trained people writing dictionaries is looking at corpora, and making this process automatic is a logical step.

Strictly separating raw dictionary work and grammar seems rather old-fashioned to me. Of course, it can work to some degree, but there are so many different types of collocational preferences that just providing each lexeme with a 'grammatical category' from a relatively small list and basing the grammar on these grammatical categories is hardly enough.

It is true that automatic systems' lack of world knowledge is a big problem, but the examples you provide aren't really a good demonstration of this fact. As you write, 'have' is translated differently into some languages depending on whether the object is abstract. So, given a translation system that recognizes the verb and its object and a bilingual parallel corpus, a statistical system can find out about that.

I heard of people who write dictionaries that can be used for automatic processing, for every lexeme they need between half an hour or an hour (consulting dictionaries and corpora, checking whether the application of rules gives correct sentences). This can only work if the aim of the MT system is either only a very limited domain (e.g. weather forecasts, for which there are working rule-based translation systems) or very low quality. It could never be affordable to have trained people provide all relevant characteristics for the millions of words that would be needed for a good MT system with wide coverage.

Differentiating between concrete and abstract entities is something that seems quite natural to us, but there are many other relevant characteristics of lexical items that don't come to linguists' minds so easily, statistical analyses can be better at discovering them.

A paper on this (Score:4, Informative)

by metlin ( 258108 ) writes: on Thursday July 31, 2003 @12:03PM (#6580671) Journal

I had written a paper on this of the application of N-gram technique with statistical methods for use in CBR a long time ago.

You can find the paper here (PDF) [metlin.org] and the presentation here [metlin.org]. ;-)

EGYPT translation toolkit is GPL'ed. (Score:3, Informative)

by dwheeler ( 321049 ) writes: on Thursday July 31, 2003 @12:18PM (#6580812) Homepage Journal

I was curious about this statistical translation toolkit, so I downloaded it from here: http://www.clsp.jhu.edu/ws99/projects/mt/toolkit/ [jhu.edu]. I then peeked into the LICENSE file, and found that it's released under the GPL. No funny weird one-off licenses, or requiring only non-commercial use, or such. So, if you're interested in statistical translation, download this system and try it out.
I can imagine some distributions of this translation system that take this code - with improvements - and precook large corpuses to create translators. Anyone want to write the Mozilla and OpenOffice plug-ins for the new menu item "Edit/Translate Language"?

Re:the real problems lie in understanding... (Score:3, Informative)

by ornil ( 33732 ) writes: on Thursday July 31, 2003 @12:24PM (#6580858)

This is one of the oldest basically solved problems in natural language processing: word-sense disambiguation. Simply look at the words around it: if you see "river", or "park", or "memory", or "money" - you know which one to pick. That works amazingly well, and you can learn which words correspond to each sense, by starting with only a few examples belonging to each sense and then bootstrapping.

You start with a few words that occur with each sense, you now can disambiguate a few example occurences in the text. Each of these occurences has words around it - add them to your list of sense indicators. Then do the whole thing again and again.

Re:Arabic Grammar Nazi (Score:3, Informative)

by capologist ( 310783 ) writes: on Thursday July 31, 2003 @02:33PM (#6582037)

But as previous posters have already noted, the machine still has to know the meanings of words in order to do a good translation.

For example, to translate "big box" and "big man" into Arabic, you'd actually use different words for big, since the box is inanimate, but the man is animate.

I think that one of the major points of the statistical technique is to deal with precisely this sort of thing.

It doesn't have to know the "meaning" of words like "box" or "man," it just has to have seen them in a particular context before. If it has, then it knows that "big" is usually translated one way when it appears with "man," and another way when it appears with "box." So it just follows those observed patterns, without knowing anything about "meaning."

In principle (the article hints at this), it might even be possible in the future to make good guesses about combinations that it hasn't seen before, by inferring rules about how things are put together. For example, it encounters the phrase "strong box," which it hasn't seen before. But it has seen many other words that frequently are associated with inanimate-strong and that are also frequently associated with inanimate-big, and many words that frequently are associated with animated-strong and also frequently associated with animate-big, but few words that are frequently associated with inanimate-big and animate-strong. So it infers that inanimate-strong is somehow "parallel" to inanimate-big. Since it also finds that "box" is more typically associated with inanimate-big than with animate-big, it infers that the version of "strong" that goes with "box" is the one that is parallel to inanimate-big. Therefore, it selects inanimate-strong. Ta da!

Re:the real problems lie in understanding... (Score:3, Informative)

by Knowledge Hacker ( 694126 ) writes: on Thursday July 31, 2003 @04:02PM (#6582650)

I spent a decade working in the field of knowledge-based machine translation (KBMT), in the Center for Machine Translation (now part of the Language Technologies Institute) at Carnegie Mellon. Prior to that, I worked on several natural language processing projects that were focused on knowledge-based automatic analysis of English text..

KBMT can be done. We demonstrated that pretty definitively. It's labor-intensive. Yes, we DID create concept maps (ontologies) for the domains of human endeavor relating to the texts to be translated, and yes, we DID link words (lexical units) to those concepts, in multiple languages. And it turned out that we didn't have to make the ontologies very deep--we had to make them broad, and start by assuming the need for a one-to-one mapping of concepts to nouns, verbs, and modifiers in the domain. This we arrived at after several attempts at making them deep. Turned out the lexicon is the real key. You only have to use enough ontological structure to support the way you deal with analyzing function words (relative pronouns, conjunctions, prepositions, certain adverbs, etc.) and to capture deep generalizations for certain classes of verbs and verb-derived nouns (deverbal nouns.) The system uses a fast (real-time) English analysis parser based on the Tomita algorithm, and a rule-based target-language generator based on the KANT system.

We created a custom-built MT system for Caterpillar, to perform automated translation of their operations and maintenance manuals from the English of Peoria, Illinois into French, Spanish and German. It took us six years (not counting all the projects that preceeded it, from which we learned a great deal.) The system empoys a controlled subset of English that forces Caterpillar's technical writers to favor certain constructions in their writing, and to disambiguate certain other constructions using a writer's workbench interface. (Caterpillar has a patent on this application of MT technology for technical documents.) It contains all the vocabulary that Caterpillar needs--hundreds of thousands of terms. Caterpillar updates the lexicons as needed.

This system has been in production use at Caterpillar since 1996. It translates controlled English text at accuracies in the high 90-percents. The tech writers adapted, the translators got turned into post-processors (and I believe there was some turnover of personnel--the work had to have gotten a lot more boring), the English reads a little bit stilted but it's perfectly clear. Response from Caterpillar's customers was positive, the manuals get translated faster, and are accurate. The controlled English can actually force a little higher accuracy. Caterpillar's investment in this techology ended up saving them a bundle.

Due to the proprietary context of our work for Caterpillar, there were very few academic publications that came out of the project.

If you want to engage in further reading, search on KBMT-89 (an MT project funded by IBM-Japan that laid much of the foundation for workable real-time KBMT.) We published a book on it.

You can read about the KANT technology at

http://www.lti.cs.cmu.edu/Research/Kant/

There are also a number of pointers to other knowledge-based projects on the lti.cs.cmu.edu site.

For looking at the progress of KBMT in the U.S. generally (over the past couple of decades), search for publications by Jaime Carbonell, Sergei Nirenburg, Eric Nyberg, Masaru Tomita, Teruko Mitamura, Robert Frederking, Lori Levin, Kathy Baker, Ralf Brown, and a cast of dozens. Warning--this will bring you vast amounts of material.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

More on Statistical Language Translation 193

More on Statistical Language Translation More Login

More on Statistical Language Translation

Translator (Score:3, Informative)

Re:IBM research 10 years ago (Score:5, Informative)

Re:Older languages not supported? (Score:2, Informative)

wow (Score:2, Informative)

Arabic Grammar Nazi (Score:5, Informative)

Re:Older languages not supported? (Score:2, Informative)

Re:Older languages not supported? (Score:2, Informative)

Re:Limited value? (Score:5, Informative)

A paper on this (Score:4, Informative)

EGYPT translation toolkit is GPL'ed. (Score:3, Informative)

Re:the real problems lie in understanding... (Score:3, Informative)

Re:Arabic Grammar Nazi (Score:3, Informative)

Re:the real problems lie in understanding... (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot