Slashdot Log In
Google Faces Plagiarism Questions Over Chinese Software
Posted by
Zonk
on Sun Apr 08, 2007 02:03 PM
from the i'll-just-take-a-look-and-yoink dept.
from the i'll-just-take-a-look-and-yoink dept.
yaohua2000 writes "Google's laboratory in China has launched its first product, a Pinyin Input Method Editor. The software allows the romanized characters to be translated to more traditional Chinese symbols , via entering on a QWERTY keyboard. Users soon discovered that the data Google used for the product was unusually similar to the data used by a Chinese rival, Sogou. Google has evaded the question about software similarities, reports PC World. 'The similarities, which included an error involving the name of a celebrity, were noted on a Google Labs discussion board about its Pinyin IME. Users noted that entering the Pinyin pinggong into the Google IME incorrectly produced the name of Feng Gong, an actor and comedian.'"
Related Stories
[+]
Your Rights Online: Google Admits to Using Sohu Database 209 comments
prostoalex writes "A few days ago a Chinese company, Sohu.com, alleged Google improperly tapped its database for its Pinyin IME product, stirring controversy on whether two databases were similar just due to normal research process. Today Google admitted that its new product for Chinese market 'was built leveraging some non-Google database resources.' 'The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents. In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.'"
This discussion has been archived.
No new comments can be posted.
Google Faces Plagiarism Questions Over Chinese Software
|
Log In/Create an Account
| Top
| 187 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Google Should Defend Themselves the OpenBSD Way (Score:5, Funny)
I'm a stupid American, so... (Score:5, Funny)
Input method (Score:5, Interesting)
(http://www.sympato.ch/)
Chinese is a complex language to write. It doesn't use an alphabet (like most western languages). It doesn't even use syllables (like, for example, 2 of the Japanese writing system), it uses logographs : in an over-simplified way, we can say they use 1 symbol for every different word/idea/etc.
This makes thousands of different symbols (According to wikipedia : a little less than 50k variants in the Kangxi dictionary).
This ISN'T something you can put on a regular occidental 107 keys keyboard.
Therefor you have several solutions [wikipedia.org] :
- Custom keyboards :
Use special keyboards where the most frequently couple of thousand of symbols are present.
Not very practical (symbols harder to find compared to looking for a letter on a 107 keyboard). Wikipedia has a picture.
- By shape of characters :
Either by handwriting recognition, or by decomposing charachters (the different strokes) and putting them on a regular keyboard layout.
- By sound of words :
Either by using something like Zhuyin [wikipedia.org] which is system that was invented to help teaching chinese. It has 31 symbols, 1 for each consonant or vowel in chinese. As such, it can be used for other purposes, like putting it on a keyboard : the person type the sound and the software guess the corresponding word/logogram.
Or an alternative method is the Pinyin [wikipedia.org] : it uses latin letters to write the sound. (And thus is interesting for computers on which latin keyboards are widespread).
The mapping of sound to logographs isn't completely straightforward, for example Chinese is a tonal language, but some system don't require the writer to specify tones using marks. Some software work is required. And this software isn't infallible.
Google released such a software. User can phonetically type Chinese on any occidental keyboard using (tone-less) pinyin, and the software tries to convert it to actual Chinese characters.
This software produce the same correct results as another popular one. (Hopefully. If the google soft didn't give the correct results, there would be problems. I wouldn't be a functional pinyin input system).
Sometime, the software hesitates and give a choice of possibilities. Most of the time, the same as the concurrent (Possibly explained by the fact that both softwares have to process the same user input, using the same pronunciation system that isn't unambiguous).
But, sometime the Google soft is plain wrong, and produces the same errors as the concurrent. And THIS is suspicious, because maybe some part of the software uses piece from the concurrent (part of the algorithm ? statistical data ?)
The company is suing googles on the grounds that if both softwares behave the same down to the bugs, maybe some part could have been illegally copied.
Meanwhile, adepts of Google Seppuku [wikipedia.org] rejoiced world wide a cheap and easy to find software that could also be used to produce random chinese caracter to be subsequently imported into Google as Kanji.
Identical typos... (Score:5, Insightful)
Coming up with the same algorithm isn't terribly unlikely. Structuring it in the same way is not uncommon either. Making exactly the same mistakes, however, is hard to believe.
Re:Identical typos... (Score:5, Insightful)
(http://thebestpageintheuniverse.net/)
As such this story is useless. The internet needs no more speculation as it is, it's hard enough arguing what is wrong or right when concrete evidence is available [slashdot.org]. Our flamewars should be founded on solid ground.
Re:Identical typos... (Score:5, Insightful)
(http://stuckinthecube.blogspot.com/)
I work in I18N and deal with IMEs all the time, from the basic, non-learning MS Windows versions to the ones which come with the NJ Star and give preference to lesser-used terms previously selected to various other proprietary variants. There are only so many ways to write an IME, and there are only so many ways to do good prediction. If I type "go" in Japanese, my first choice will usually be "5" followed by the symbol for "language" and the game "Go", then various other possibilities. Only when I next type a "z" or a "g" do the symbols for a.m. and p.m. move to the front. Now if I'd written an IME and wanted to protect it I might have it always bring up "Mifune Go" ( as the fifth selection or, more subtly, bring up "Go" as the fifth possibility if you typed a "G" or "Go" after "Mifune". This isn't the case here.
With Google's work and implementation of prediction methods, I find it hard to accuse the company of plagiarism for having the same bug (which comes as a result of predictive methods) as some other company. This is a bug, not some zyzzyx or easter egg which a programmer included to catch thieves. It was unintentional on Sogou's part and likely equally unintentional on Google's.
Then again, there's a lot of pressure to excel at Google and maybe someone gave in to temptation despite working for a company that knows more about data than anyone else out there. Unlikely, but possible... and if Google issue a statement that someone did indeed plagiarise Sohu's work, fine. It could happen anywhere. Doesn't make Google bad, only one programmer. It makes the company culpable, but it hardly looks malicious.
Ironic, isn't it? (Score:2, Insightful)
Re:Ironic, isn't it? (Score:5, Insightful)
The unfortunate fact of the matter is that China's government and industry are completely unconcerned about the source of the technology that they mass-produce and sell to everyone. They just don't care, period, and I suppose when you get right down to it there's no reason they should. On the other hand, that just means there's no reason why we should respect their "intellectual property" either, and when their scientists and engineers come up with something good they damn well shouldn't expect us to concern ourselves over their rights either. If Google did indeed rip off their Chinese counterparts my feeling is
So, it's not a statement of prejudice (e.g. "I dislike Chinese people because they are Chinese, or have yellow skin, or slanted eyes, or talk funny") but a legitimate observation on the state of affairs in that country.
Just watch it when you start playing the race card without a good reason
Re:Ironic, isn't it? (Score:4, Insightful)
You're definitely new here. We complain about Microsoft pinching other people's work continuously here on Slashdot, mainly because Microsoft does, continuously. We also regularly bitch about how the current patent and copyright systems here in the United States are seriously flawed. And the OP is correct in pointing out that China has always been, shall we say, less than respectful of others' rights in this regard ("blatantly ripping them off" is as good a description as any.)
What was your complaint again?
Or, basically... (Score:5, Insightful)
Hmm... (Score:5, Funny)
not saying it's the case (Score:5, Insightful)
(http://kozo.apparitiondesigns.com/)
Re:not saying it's the case (Score:4, Informative)
This is big news in China (Score:5, Informative)
There were actually much more evidence than the PC World article mentioned, the most convincing being that Google IME included many names of the developers of Sogou IME.
Although according to the other users (I don't use Google Pinyin myself now, or Windows for that matter), the error has been fixed - and those developer names has been removed - in the most recent version of Google IME (1.0.17.0).
Ming
Re:This is big news in China (Score:5, Interesting)
Eventually I found an extremely effective compression method (the IME portion of our system fit into 128K including dictionary) using a hash table approach. Collisions in the hash table generated spurious terms. The spurious terms that conflicted with legitimate terms were suppressed by a "phantom dictionary". The rest of the phantoms were allowed to remain. These only came up for pinyin bigrams (almost always bigrams) that were non-productive in the stock dictionary. The user supplied dictionary took priority over the system dictionary (and the phantoms it contained) so conflicts didn't arise.
Because of the way the hash table was constructed, our dictionary generated an exponentially increasing number of phantoms with increasing phrase length. By the time you got to four character phrases, the phantoms vastly outnumbered the legitimate vocabulary. Note that our system distinguished 8000 hanzi characters for the input system, so the space of possible four character phrases was up in the trillions, and the phantoms were extremely sparse by that metric, and never seen in the wild.
Any competitor who had decided to enumerate our dictionary (I could have suggested several practical ways to achieve this) would have ended up with barrels of nonsense, unless they also devoted the resources, as we had, to "research" rather than plagiarise.
Nor was it possible to copy our dictionary directly in its compressed format, as the hash function was tied to a hardware dongle. I never heard that the algorithm embedded in the dongle was ever cracked directly, but I do know that the vendor's recommended algorithm for feeding the dongle was awful, and failed most of my statistical tests. We beefed up the routine until many (but far from all) of the statistical tests for randomness were satisified, and then ran the device ten times overspec to get the performance we required. Fun times.
A funny story is that our software was listed as "cracked" on some hacker site because some l33t dude had removed the code to test for the presence of a functioning dongle, and the message we displayed "where's your dongle?" (OK, it wasn't quite like that) without noticing that with the dongle absent, the pinyin input method used white noise as the dictionary hash function, and produced nothing but chicken soup for the hanzi output text. To successfully change the hash function and maintain the dictionary compression ratio, you had to solve a bipartite graph matching problem and then recompute the phantom table, and none of that code shipped with the product.
In this era, with the amount of data you can scrape off the internet on a the barest whim, I'm a bit shocked that anyone still stoops to our tried and true "research" methodologies from the mid eighties. My involvement ended around 1991 as it became apparent that Windows 3.x was going to take over the world. My joy in life at that time was writing bug-free code, and I didn't see any way to achieve that the way the world was turning. If someone tapped me on the shoulder and woke me up after my fifteen year snooze, I could probably suggest many fascinating IME features I had planned back then that still haven't been implemented, though I haven't checked on this in a long while. We already had simplified/classical, Mandarin/Cantonese working from a single dictionary. It wasn't proper dialectic Cantonese though, that was something I wished to do, but never completed. We did all this pre Unicode, so we had to invent our own Unicode, too. Anyone need a first edition Unicode standard? I think I've got three.
a dark secret revealed... (Score:1)
They were pirates all along! I knew their original idea of 'searching the web' seemed oddly similar to their rival yahoo...
I'll be watching you, google!
Of course they're pirates! (Score:4, Funny)
(http://www.oddquad.org/)
Pot calling Kettle (Score:1, Flamebait)
(Last Journal: Wednesday July 11, @08:27PM)
hits++ (Score:3, Funny)
Maybe more to the story (Score:1, Flamebait)
Did Google say they did not hired another company to program the software, or is the media insinuating it?
Did Google say who they're Chinese rivals vs. allies are, or did the media tell us?
One possible situation: What if Google hired another company to create the software? What if that 3rd party company stole IP? What if Google is looking into the issue right now and therefore won't comment to the public media? (pure speculation, but proves other avenues might exist for possibilities to open up).
You see, Google has not answered many questions yet and this does not an admission of guilt.
On the Bright Side (Score:2, Funny)
This wouldn't be the first time... (Score:3, Interesting)
Another example is the spell checkers that Google's Gmail have for the dozen or so languages to support. Nowhere to be found is an explanation of where these spell-checkers come from, so it would be safe to assume that Google wrote them themselves, or at least bought them from some company that allowed them not to give them credit? Well, the reality is more sad. It turns out that Google actually uses the free-software project, aspell, to do its spell-checking, and the dozens of person-years that went into writing the actual dictionaries for aspell were simply co-opted by Google. When you spell-check in some language X, you do not see any credit for the person who wrote the dictionary, or to aspell. Even if you look very hard in the documentation, this credit is nowhere to be found. It's all very legal under the GPL, but ugly behavior, especially for scientists (like most of the Google who's-who) who are used to giving credit where credit is due.
And how do I know that Google's Gmail uses free-software spell-checkers? Well, I used a method very similar to that described in the article. I'm the author of one of the dictionaries that Google "adopted", and I deliberately inserted some "misspelled" (aka "easter-egg") words into the dictionary, so I can immediately recognize a spell-checker based on my dictionary - and it turns out that Google's Gmail spell-checker is indeed based on my dictionary.
So it's great that Google reuses other software - free-software and commercial software - but they should learn to give credit where credit is due. It doesn't have to be the google.com homepage (of course) - even in some deep-down help page would do.
Re:This wouldn't be the first time... (Score:5, Insightful)
Thousands of people donate their time, money, and code to GPL-licensed projects. As one of those contributors, I can tell you that I don't believe that Google is doing anything wrong at all with aspell. The terms of the license are clear. Users are no way required to give attribution. In fact, there is not even a suggestion, hint, or implication that attribution would be nice. You suggesting that it should be that way is fine, but to state that aspell was "co-opted" is factually incorrect and falsely implies that Google is doing something against the GPL license.
If you, as a contributor to aspell, don't like aspell's license terms, you are free to start another project with similar goals under different license terms.
Re:This wouldn't be the first time... (Score:5, Interesting)
Pinyan Input Method (Score:1, Troll)
(Last Journal: Monday September 25 2006, @01:19PM)
http://en.wikipedia.org/wiki/Kenneth_Pinyan [wikipedia.org]
evidence not very clear (Score:2)
(http://billposer.org/)
Without more evidence it isn't clear to me whether Google has done anything wrong. The fact is, there are various files around pairing Chinese characters and words written in Chinese characters with their Roman equivalents, many of them "free" for various notions of "free". Fairly comprehensive lists of this type have been around long enough that it is unlikely that anybody would start from scratch. You'd get some existing lists, combine them, review them for errors, and look for things that need to be added.
Even if there is evidence that Google consulted its competitor's data, it is far from obvious who if anyone has rights to the compilation. Suppose that the competitor itself took a free list and made a few modifications? Ethically, I don't see that Google would be doing wrong in that case. Legally, maybe - it would depend on the original license and on whether the competitor made sufficient modifications for a court to consider that they had a right to a compilation copyright.
I don't put it beyond Google to screw up, and maybe even intentionally "do evil", but I don't think that we can judge this case without knowing a lot more of the history of the data in question.
that's not plagiarism (Score:1)
That is not to say that a company can never commit plagiarism. In fact, several well-known computer companies are claiming in their marketing materials that they invented important technologies that they simply did not invent; arguably, they are committing some form of plagiarism, since they are, in fact, misrepresenting the source of creative ideas.
Combing (Score:5, Insightful)
The people who own this IP need not have stolen any other IP.
It is as dumb as saying that all Americans are christian, guntouting, fat fuckasses.
Plagerism is all too common in China (Score:1)
Sohu has patents?? (Score:1)
(Last Journal: Friday November 09, @01:36AM)
Welcome to China in 2008 (Score:1)
Clarifications. (Score:3, Informative)
(http://designelement.us/)
The keyboards used in China, Taiwan, Singapore and even Japan are almost always QWERTY, but that's irrelevent. Virtually nobody except Westerners use that to type. Printed on Chinese keyboards are 4 sets of characters. The first set is our alphabet, and the next 3 sets include characters for different text entry methods.
I don't know about China, but in Taiwan one of the sets is Zhuyin fuhao. That system, as I've seen mentioned here, is a set of simple characters, each corresponding to a distinct sound, 21 consonants and 16 syllables. It's the closest thing to a Chinese alphabet in existence. It's only really used for educational purposes, but I don't see why it isn't widely adopted in the same way the Japanese use hiragana or katakana.
Anyway, that system is comparable to Pin Yin, which is more or less a romanized version of the same thing and it's what is used for signage in China, and now in Taiwan as well. This is the method a westerner is more likely to use to type Chinese.
The funny thing about Chinese is that the same word could have many different meanings each of which has a distinct character. So you type the word, including the appropriate tones and up comes a list with all the corresponding characters. Then one character is chosen from a list. It's kind of like predictive text. In same cases, when a set of characters produce a meaning, upon entering the first character the user is given a list of additional characters. It's all done, obviously to speed up the typing process.
So, this input method can be sufficiently quick. Comparable to typing English. However, there are other entry methods, based on different factors which can be more precise and significantly quicker. I have no idea how to use any of those, but it's my impression that typing in those methods can be quite faster than most people typing in English.
Of course, this begs the question, why did Google bother coming up with their own system? Things are always a bit of a mess with all the options out there.
As for the possibility of code being plagiarized. I'm really not surprised at all. This is one of the consequences of outsourcing. The company might have a policy against this sort of thing, but the programmer clearly didn't care. He probably thought he could save himself a bit of trouble and ultimately saw nothing wrong with it. I've experienced similar things first hand. Unless you have a team you trust there needs to be a lot of oversight and careful management
Google evolves (Score:5, Funny)
Congrats to them.
scary? insane idea? (Score:2)
Good (Score:1)
Go and take a look at Baidu www.baidu.com, China's number 1 search engine, (its a copy of google) which owns sogou, (the company accusing google of copying their input engine).
Copyright trap (Score:1)
Plagiarism? (Score:2)
Every day (Score:2)
(http://sc.tri-bit.com/ | Last Journal: Sunday July 08, @02:36AM)
Re:Do no evil my ass (Score:1)
This sort of thing always reminds me of Lewis Carroll's excellent "Alice in Wonderland":
`I like the Walrus best,' said Alice: `because he was a little sorry for the poor oysters.'
`He ate more than the Carpenter, though,' said Tweedledee. `You see he held his handkerchief in front, so that the Carpenter couldn't count how many he took: contrariwise.'
`That was mean!' Alice said indignantly. `Then I like the Carpenter best -- if he didn't eat so many as the Walrus.'
`But he ate as many as he could get,' said Tweedledum.
This was a puzzler. After a pause, Alice began, `Well! They were both very unpleasant characters----'
Re:"Google's" ? (Score:1)
Re:"Google's" ? (Score:2)
In this case it's not only wrong, but the s doesn't belong there at all. If you're going to grammar-nazi, at least do it well.
Re:My girlfriends pussy.... (Score:2)
(Last Journal: Tuesday November 30 2004, @06:34PM)
Back on topic, if IP doesn't exist in China, then why does this matter?
Contrariwise, if IP is recognized in China, why doesn't somebody tell the Chinese?
Re:Google Suggest was released in 2004 (Score:1)
true perspective (Score:4, Interesting)
(Last Journal: Friday December 01 2006, @10:51AM)
It also partially why you do not want to use china to do any IP type work. They will steal from others and leave your company at risk, as well as allow other chinese companies to steal from yours.
Understand that this is simply a big part of who they are now. They have been taught for the last 60 years that all the property belongs to the state and the community. It will be difficult for them to consider private ownership of anything for a number of generations. I am guessing that it will end about the time that China considers itself a superpower (which will happen). Sadly, that may be when a war occurs with between either China and (America|Russia|Europe|India). Offhand, I am guessing Russia. They will need a number of their resources (land, water, oil, etc).
Translation of Google China's Official Blog (Score:2)