Knowledge Hacker - Slashdot User

Comment Re:the real problems lie in understanding... (Score 3, Informative) 193

by Knowledge Hacker on Thursday July 31, 2003 @04:02PM (#6582650) Attached to: More on Statistical Language Translation

I spent a decade working in the field of knowledge-based machine translation (KBMT), in the Center for Machine Translation (now part of the Language Technologies Institute) at Carnegie Mellon. Prior to that, I worked on several natural language processing projects that were focused on knowledge-based automatic analysis of English text..

KBMT can be done. We demonstrated that pretty definitively. It's labor-intensive. Yes, we DID create concept maps (ontologies) for the domains of human endeavor relating to the texts to be translated, and yes, we DID link words (lexical units) to those concepts, in multiple languages. And it turned out that we didn't have to make the ontologies very deep--we had to make them broad, and start by assuming the need for a one-to-one mapping of concepts to nouns, verbs, and modifiers in the domain. This we arrived at after several attempts at making them deep. Turned out the lexicon is the real key. You only have to use enough ontological structure to support the way you deal with analyzing function words (relative pronouns, conjunctions, prepositions, certain adverbs, etc.) and to capture deep generalizations for certain classes of verbs and verb-derived nouns (deverbal nouns.) The system uses a fast (real-time) English analysis parser based on the Tomita algorithm, and a rule-based target-language generator based on the KANT system.

We created a custom-built MT system for Caterpillar, to perform automated translation of their operations and maintenance manuals from the English of Peoria, Illinois into French, Spanish and German. It took us six years (not counting all the projects that preceeded it, from which we learned a great deal.) The system empoys a controlled subset of English that forces Caterpillar's technical writers to favor certain constructions in their writing, and to disambiguate certain other constructions using a writer's workbench interface. (Caterpillar has a patent on this application of MT technology for technical documents.) It contains all the vocabulary that Caterpillar needs--hundreds of thousands of terms. Caterpillar updates the lexicons as needed.

This system has been in production use at Caterpillar since 1996. It translates controlled English text at accuracies in the high 90-percents. The tech writers adapted, the translators got turned into post-processors (and I believe there was some turnover of personnel--the work had to have gotten a lot more boring), the English reads a little bit stilted but it's perfectly clear. Response from Caterpillar's customers was positive, the manuals get translated faster, and are accurate. The controlled English can actually force a little higher accuracy. Caterpillar's investment in this techology ended up saving them a bundle.

Due to the proprietary context of our work for Caterpillar, there were very few academic publications that came out of the project.

If you want to engage in further reading, search on KBMT-89 (an MT project funded by IBM-Japan that laid much of the foundation for workable real-time KBMT.) We published a book on it.

You can read about the KANT technology at

http://www.lti.cs.cmu.edu/Research/Kant/

There are also a number of pointers to other knowledge-based projects on the lti.cs.cmu.edu site.

For looking at the progress of KBMT in the U.S. generally (over the past couple of decades), search for publications by Jaime Carbonell, Sergei Nirenburg, Eric Nyberg, Masaru Tomita, Teruko Mitamura, Robert Frederking, Lori Levin, Kathy Baker, Ralf Brown, and a cast of dozens. Warning--this will bring you vast amounts of material.

Slashdot Top Deals