Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Why Unicode Won't Work on the Internet

Posted by Hemos on Tue Jun 05, 2001 10:11 AM
from the -Linguistic,-Political,-and-Technical-Limitations dept.
We reeived this interesting submission from N. Carroll: "Unicode, the commercial equivalent of UCS-2 (ISO 10646-1) , has been widely assumed to be a comprehensive solution for electronically mapping all the characters of the world's languages, being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to approximately 170,000 characters. This paper summarizes the political turmoil and technical incompatibilities that are beginning to manifest themselves on the Internet as a consequence of that oversight. (For the more technical: the recently announced Unicode 3.1 won't work either.)" Read the full article.
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1) | 2 | 3 | 4 | 5 | 6
  • Is this a problem? by Anonymous Coward (Score:1) Tuesday June 05 2001, @06:19AM
  • Re:another drawback of unicode by Anonymous Coward (Score:1) Tuesday June 05 2001, @06:51AM
  • Re:Is this a problem? - FYI by Anonymous Coward (Score:1) Tuesday June 05 2001, @06:58AM
  • Define two unicode escape chars = 196000 chars. by Anonymous Coward (Score:1) Tuesday June 05 2001, @07:00AM
  • Re:Is this a problem? by Anonymous Coward (Score:1) Tuesday June 05 2001, @07:46AM
  • Actually, India speaks English by Anonymous Coward (Score:1) Tuesday June 05 2001, @07:49AM
  • Re:Well DUH! It's not meant to have every characte by Anonymous Coward (Score:1) Tuesday June 05 2001, @08:51AM
  • Re:Is this a problem? by Anonymous Coward (Score:1) Tuesday June 05 2001, @10:05AM
  • Languages by Alex Belits (Score:2) Tuesday June 05 2001, @02:45PM
  • Re:Unicode's reply by Alex Belits (Score:2) Tuesday June 05 2001, @03:28PM
  • Re:Languages by Alex Belits (Score:2) Wednesday June 06 2001, @06:38PM
  • Re:Unicode's reply by Alex Belits (Score:2) Thursday June 07 2001, @08:28PM
  • Re:Unicode's reply by Alex Belits (Score:2) Thursday June 07 2001, @08:32PM
  • Re:Conspiracy Theories and Unicode by Alex Belits (Score:2) Thursday June 07 2001, @08:41PM
  • Re:Statelessness of text by Alex Belits (Score:2) Sunday June 10 2001, @06:11AM
  • Re:Unicode's reply by Alex Belits (Score:2) Sunday June 10 2001, @06:21AM
  • Re:Conspiracy Theories and Unicode by Alex Belits (Score:2) Monday June 11 2001, @12:49AM
  • Re:Unicode's reply by Alex Belits (Score:2) Monday June 11 2001, @09:14PM
  • Re:Unicode's reply by Alex Belits (Score:2) Monday June 11 2001, @09:43PM
  • Re:Unicode's reply by Alex Belits (Score:2) Wednesday June 13 2001, @12:08PM
  • Re:Unicode's reply by Alex Belits (Score:2) Wednesday June 13 2001, @07:07PM
  • Re:Unicode Character Set vs Character Encoding by Jordy (Score:2) Tuesday June 05 2001, @08:31AM
  • The current permutation of Unicode gives a theoretical maximum of approximately 65,000 characters (actually limited to 49,194 by the standard).
    The biggest problem with Unicode is that no one understands what it is. Unicode defines two things, a character set that maps a character into a character code and a number of encoding methods that map a character code into a byte sequence.

    ISO 10646, the Universal Character Set defines a 31 bit character set (2,147,483,648 character codes), not a 16 bit character set. Unicode 3.0's character set corresponds to ISO 10646-1:2000. Unicode 3.1 which was recently released goes a bit further.

    UCS-2, as mentioned by this article, is the same as UTF-16 and is severely limited by it's 16 bit implementation. UTF-16 is unfortunately used by Windows and Java, but is rarely used on the web. The article claims UTF-16 can only map 65,000 characters, but using surrogate pairs can actually map over 1 million characters.

    Thankfully, there are several other encoding methods for Unicode. UTF-8, which is a variable length encoding most commonly used on the web allows a mapping of Unicode from U-00000000 to U-7FFFFFFF (all 2^31 character codes). It also has a nice feature of the lower 7 bits being ASCII, so there is no conversion necessary from ASCII to UTF-8.

    UTF-32 or UCS-4 is a 32 bit character encoding used by a number of Unix systems. It's not exactly the most space efficient form (UTF-8 requires roughly 1.1 bytes per character for most Latin languages), but it can handle the entire Unicode character set.

    A good document on this is available at UTF-8 And Unicode FAQ [cam.ac.uk]
  • Re:Duh. by jandrese (Score:2) Tuesday June 05 2001, @06:44AM
  • Uh, I Don't Get It by Aaron M. Renn (Score:1) Tuesday June 05 2001, @06:57AM
  • Re:Is this a problem? by Isaac-Lew (Score:1) Tuesday June 05 2001, @07:57AM
  • Re:Overstating and misunderstanding the problem by imroy (Score:1) Tuesday June 05 2001, @09:34AM
  • I.e. all the character sets *in common use* in Asia today, maps into a subset of Unicode. They even map into the 16 bit subset, but overlap in a way that make slightly different characters from different character sets share the same code point. That is why an extended version of Unicode is used, so Chinese/Japanese/Korean characters have different codepoints.

    Unicode does not contain all characters ever used, for example it does not contain the Nordic runes. These are not used today except by scolars, who will need special software (most likely using the "reserved to the user" part of Unicode). The same is true for many ancient Asian characters.

  • Babel by jafac (Score:2) Tuesday June 05 2001, @07:42AM
  • UCS-4 by Iffy Bonzoolie (Score:1) Tuesday June 05 2001, @06:27AM
  • Re:Duh. by Malc (Score:1) Tuesday June 05 2001, @07:05AM
  • Re:Solution - Everybody use Euro-English! by Zarquon (Score:1) Tuesday June 05 2001, @07:44AM
  • by iabervon (1971) on Tuesday June 05 2001, @07:01AM (#175380) Homepage Journal
    UTF-8 encodes 7-bit ASCII characters as themselves and all of the rest of UCS-4 (the unicode extension to 32-bits) as sequences of non-ascii characters. This means that apps which can't handle anything but ascii can simply ignore non-ascii and get all of the ascii characters (and, with minimal work, report the correct number of unknown characters).

    The only issue is that there's not a good way to set a mask for the characters such that 0-127 (which take up a single byte) are the common characters for the language, and so on, so English is more compact than other languages, even languages which don't require more characters.
  • Re:2 + 1 bytes? by Jeremy Erwin (Score:2) Tuesday June 05 2001, @06:38AM
  • Chinese language(s) by danny (Score:2) Tuesday June 05 2001, @05:37PM
  • Re:Overstating and misunderstanding the problem by tjansen (Score:2) Tuesday June 05 2001, @06:48AM
  • Case by Pseudonymus Bosch (Score:2) Wednesday June 06 2001, @06:17AM
  • Re:UCS-4 by spitzak (Score:2) Tuesday June 05 2001, @08:52AM
  • Re:ASCII stupidity all over again... by spitzak (Score:2) Tuesday June 05 2001, @09:22AM
  • Re:ASCII stupidity all over again... by spitzak (Score:2) Tuesday June 05 2001, @08:18PM
  • Re:Arabic space by spitzak (Score:2) Tuesday June 05 2001, @08:27PM
  • Re:Arabic space by spitzak (Score:2) Sunday June 10 2001, @01:51PM
  • by spitzak (4019) on Tuesday June 05 2001, @09:39AM (#175390) Homepage
    Thanks for some more intelligent discussion about UTF-8.

    I might add a few things:

    In UTF-8 not just NULL or Escape are not in the multibyte characters, in face *all* 7-bit characters are not in the multibyte characters (the multibytes have the high bit set in all bytes). This means that *any* program that treats all bytes with the high bit set as a "letter" will work and can parse, hash, match, search, etc identifiers/words with foreign letters in them!

    In addition the UTF-8 encoding is just heavy enough that random line noise is very unlikely to match a UTF-8 encoding. If programs treat "illegal" UTF-8 encodings as individual bytes in the ISO-8859-1 character set, it will display virtually all existing ASCII/ISO-8859-1 documents unchanged!

    The end result is that it should be easy to switch all interfaces (not just over the network, but inside programs and to libraries) to UTF-8. This will vastly simplify the handling of Unicode because there will be no need for ASCII back compatability interfaces. We could also eliminate all the "locale" crap and make ctype.h the simple thing it once was.

    Even Arabic will encode smaller in UTF-8 than UTF-16. This is due to the fact that very common characters (not just English, but things like space and newline) are only one byte.

  • Re:All Character sets simultaneously?? by K-Man (Score:2) Tuesday June 05 2001, @10:51AM
  • Correct, also see link by K-Man (Score:2) Tuesday June 05 2001, @03:49PM
  • by K-Man (4117) on Tuesday June 05 2001, @09:46AM (#175393)
    If you read the article, you'll find a decent description of Korean Hangul, which has around the same number of characters as English (IIRC, it has 24).

    Hangul outdoes the latin alphabet in several ways. For one, as you mention, pronunciation in English is difficult, while in Hangul it is almost completely unambiguous. Each phoneme maps to one character, and vice-versa. There is no confusion over whether to write "cat" or "kat", for example. Only one letter has the "k" sound.

    Each Hangul character is a pictogram describing the position of the tongue, palate, and lips to use when pronouncing it. Whereas most phonetic alphabets consist of ideograms recycled as phonetic symbols, Hangul seems to be the only one to consist of symbols constructed purely for phonetic meaning.

    Since the job of a phonetic alphabet is only to represent phonemes, I would say that this alphabet does the job better than latin.
  • Re:Unicode's reply by dvdeug (Score:2) Wednesday June 06 2001, @09:15PM
  • Re:No, _n_ bytes per character! by dvdeug (Score:2) Wednesday June 06 2001, @11:45PM
  • Re:Unicode's reply by dvdeug (Score:2) Thursday June 07 2001, @10:18PM
  • Re:ISO-2022-JP and "alphabetical order" by dvdeug (Score:2) Friday June 08 2001, @08:14PM
  • Re:Unicode's reply by dvdeug (Score:2) Monday June 11 2001, @02:33PM
  • Re:Unicode's reply by dvdeug (Score:2) Tuesday June 12 2001, @11:30PM
  • Re:Unicode's reply by dvdeug (Score:2) Wednesday June 13 2001, @04:18PM
  • Re:umm by Ares (Score:1) Tuesday June 05 2001, @07:13AM
  • Re:Esperanto, Ido, lojban; BCE by unitron (Score:2) Saturday June 09 2001, @06:48PM
  • I wish it was so simple. by spacehunt (Score:1) Tuesday June 05 2001, @09:28AM
  • Quit whining about how hard it is to type by spacehunt (Score:1) Tuesday June 05 2001, @09:40AM
  • Re:Prejudice? Or technical hurdle... by spacehunt (Score:1) Tuesday June 05 2001, @09:50AM
  • Re:Some errors by MaufTarkie (Score:1) Tuesday June 05 2001, @05:09PM
  • Re:After some skimming... by Zagadka (Score:1) Tuesday June 05 2001, @06:40AM
  • Re:umm by Zagadka (Score:1) Tuesday June 05 2001, @06:55AM
  • Re:umm by Zagadka (Score:1) Tuesday June 05 2001, @07:59AM
  • Re:umm by Zagadka (Score:1) Tuesday June 05 2001, @11:26AM
  • Re:It works by h2odragon (Score:2) Tuesday June 05 2001, @06:44AM
  • Re:Alrighty by ajf (Score:1) Tuesday June 05 2001, @11:42AM
  • Re:umm by amorsen (Score:1) Tuesday June 05 2001, @09:56AM
  • No you didn't read the article, or even think by A nonymous Coward (Score:2) Tuesday June 05 2001, @07:57AM
  • Input devices are much more of an issue by sacherjj (Score:1) Tuesday June 05 2001, @06:27AM
  • Re:But for Java by sacherjj (Score:1) Tuesday June 05 2001, @06:32AM
  • Re:C programs by Luke (Score:2) Tuesday June 05 2001, @08:10AM
  • What about the artist formerly known as Prince? by mattkime (Score:2) Tuesday June 05 2001, @06:46AM
  • Re:Quit whining and move to a phonetic alphabet by scrytch (Score:2) Wednesday June 06 2001, @04:44AM
  • Re:Languages by scrytch (Score:2) Wednesday June 06 2001, @05:07AM
  • Re:Quit whining and move to a phonetic alphabet by DavidTC (Score:1) Tuesday June 05 2001, @07:34AM
  • Re:Solution - Everybody use Euro-English! by RSevrinsky (Score:1) Tuesday June 05 2001, @09:51PM
  • Re:Well DUH! It's not meant to have every characte by Mike Buddha (Score:2) Tuesday June 05 2001, @12:06PM
  • Re:After some skimming... by K. (Score:2) Tuesday June 05 2001, @06:34AM
  • Re:After some skimming... by BJH (Score:1) Tuesday June 05 2001, @07:15AM
  • Re:After some skimming... by BJH (Score:1) Tuesday June 05 2001, @07:18AM
  • Re:totally unconvinced by BJH (Score:1) Tuesday June 05 2001, @07:24AM
  • Re:Compaction and Traction by BJH (Score:1) Tuesday June 05 2001, @07:32AM
  • Re:Alrighty by BJH (Score:1) Tuesday June 05 2001, @07:42AM
  • Re:totally unconvinced by BJH (Score:1) Tuesday June 05 2001, @07:54AM
  • Re:After some skimming... by BJH (Score:1) Tuesday June 05 2001, @07:58AM
  • Re:Quit whining and move to a phonetic alphabet by BJH (Score:1) Tuesday June 05 2001, @08:04AM
  • Re:Quit whining and move to a phonetic alphabet by BJH (Score:1) Tuesday June 05 2001, @08:39AM
  • Re:Compaction and Traction by BJH (Score:1) Tuesday June 05 2001, @03:59PM
  • Re:Some errors by BJH (Score:1) Tuesday June 05 2001, @04:02PM
  • Re:Nonsense by BJH (Score:1) Thursday June 07 2001, @04:13AM
  • Some errors (Score:5)

    by BJH (11355) on Tuesday June 05 2001, @07:02AM (#175437)
    Hiragana, which is somewhat cursive, can be used to augment Kanji - in fact, everything in Kanji can be written in Hiragana. Katakana, which is much more fluid in appearance than is Hiragana, is used to write any word which does not have its roots in Kanji, such as the many foreign words and ideas which have drifted into general use over the centuries.

    In actual fact, Katakana is much more angular than Hiragana - definitely not "fluid" in appearance. Furthermore, anything that can be written in Kanji can be written (phonetically) in either Hiragana or Katakana - the use of Katakana for foreign words is nothing more than custom, not a limitation of the characters.

    Thus is can be said that Hiragana can form pictures but Katakana can only form sounds...

    That should probably read "Kanji can form pictures but Hiragana/Katakana can only form sounds..."

    Romaji is used to try and keep the whole written thing from getting out of control, with most Western concepts and necessary words being introduced into the language through this mechanism.

    Bollocks. Romaji is hardly ever used (except for advertisements, and then only rarely, or textbooks for foreigners). It's definitely not the main conduit for Western ideas.

    After a time these words (even though they will still maintain their "Roman" form for awhile longer) will become unrecognizable to the people they were originally borrowed from, such as the phrase, "Personal Computer," which is now "PersaCom" in Japan.

    Again, this is incorrect. Words don't *have* a Roman form in everyday use; sure, you can express them in Romaji but no-one ever does. As for "personal computer", the correct Romanization is 'pasokon', not 'PersaCom". (Where did he get that from?!)

    The rest of the 1,950 have to been memorized fully by the time of graduation from high school in Grade Twelve. Please remember that this total is only the legal minimum required threshold to be considered literate. And this is to be absorbed completely, along with a back-breaking load of other subjects.

    Ummm... that's actually not too hard. I (along with everyone else at my language school) memorized more than 1300 Kanji in less than a year... and none of us were Japanese. I know it must seem like an impossible total to people used to ASCII, but there are many common points between Kanji that simplify the learning process greatly.

    That said, I've long been against the current Unicode "standard", as have many technical people in Japan, for a number of reasons. Some of those are:

    - No standard conversion tables from existing character sets (SJIS, EUC-JP, ISO-2022-JP).
    Several conversion tables do exist, but there are minor differences between them that make it impossible to go from, say, SJIS to Unicode and back to SJIS without the possiblity of changing the characters used.

    - A draconian unification of CJK characters.
    The Unicode Consortium basically forced the standards bodies in China, Japan and Korea to unify certain similar Kanji onto single code points, which doesn't allow for cases where, say, Japanese actually has two or three distinctive writings that are used in different situations.

    - The ugly "extensions".
    Unicode has been effectively ruined as a method of data exchange by its treatment of characters not in the 60,000-character basic standard.

    I could go on, but I should get some sleep...
  • Re:Yes, it is by Delphis (Score:1) Tuesday June 05 2001, @07:23AM
  • Re:Prejudice? Or technical hurdle... by Delphis (Score:1) Tuesday June 05 2001, @07:55AM
  • Re:Esperanto, Ido, lojban; BCE by Detritus (Score:1) Tuesday June 05 2001, @09:08AM
  • printing [ Chinese ] vs dictionary [ Chinese } by peter303 (Score:2) Tuesday June 05 2001, @06:58AM
  • Re:Solution - Everybody use Euro-English! by Gulthek (Score:1) Tuesday June 05 2001, @08:17AM
  • totally unconvinced by kaisyain (Score:2) Tuesday June 05 2001, @06:33AM
  • Re:Solution - Everybody use Euro-English! by sharkey (Score:2) Tuesday June 05 2001, @07:50AM
  • There's no hard limit at 65k anyhow by boots@work (Score:1) Tuesday June 05 2001, @09:02PM
  • Re:Solution - Everybody use Euro-English! by augustz (Score:1) Tuesday June 05 2001, @07:08AM
  • by gleam (19528) on Tuesday June 05 2001, @07:17AM (#175447) Homepage
    The writing system with the smallest alphabet that is in current use is Hawaiian, with 12 letters. (aeiou hklmnpw) source [alternative-hawaii.com]

    A good source for your obscure questions is, as always, the Straight Dope, which answers the "Chinese Typewriter" question here [straightdope.com].

    Regards,
    gleam
  • Re:another drawback of unicode by Ole Marggraf (Score:1) Tuesday June 05 2001, @07:04AM
  • Re:Quit whining and move to a phonetic alphabet by dutky (Score:2) Tuesday June 05 2001, @07:41AM
  • Of course, even if you could get China, Taiwan, Japan and Korea to agree on a unified character encoding similar to the ISO-Roman character set (where identical or analogous characters in the different alphabets shared the same character code) you would still need more than 50,000 encodings just for the unified asian character set.

    I can see good reasons why language using similar alphabets should have overlapping encodings, but this is probably better solved by providing translation tables between related alphabets than by forcing multiple alphabets to share a single encoding. While I may be able to write the french coup de grâce in the english alphabet as coup de grace something has clearly been lost. Other europen languages are even worse, even those that nominally use the roman alphabet! Then there are questions of alphabetization between differnt languages and the questions of whether or not accented letters correspond to each other or to the unaccented letter.

    Call me a purist, but I think it is actaully much easier if we just had distinct representations for each language and had to perform some kind of mapping to display one language in another language's alphabet.

  • Re:I'll take that challenge by bears (Score:1) Thursday June 07 2001, @04:16AM
  • Re:Unicode Character Set vs Character Encoding by jholder (Score:1) Tuesday June 05 2001, @08:16AM
  • Re:Did you even read the article ? by jholder (Score:1) Tuesday June 05 2001, @08:24AM
  • Re:Unicode Character Set vs Character Encoding by jholder (Score:1) Tuesday June 05 2001, @09:03AM
  • Re:unicode does *not* encode 65,536 characters by jholder (Score:2) Tuesday June 05 2001, @08:40AM
  • Re:UTF8 by lordpixel (Score:1) Tuesday June 05 2001, @06:43AM
  • Re:UTF8 by lordpixel (Score:1) Tuesday June 05 2001, @07:03AM
  • Classical specialists need special tools by redvine (Score:1) Tuesday June 05 2001, @08:08AM
  • Re:Unicode != UCS-2 by divbyzero (Score:1) Wednesday June 06 2001, @05:54AM
  • Re:No you didn't read the article, or even think by WNight (Score:2) Tuesday June 05 2001, @12:40PM
  • Re:No you didn't read the article, or even think by WNight (Score:2) Wednesday June 06 2001, @06:59AM
  • by WNight (23683) on Tuesday June 05 2001, @12:21PM (#175462) Homepage
    Oh gawd, just listen to the feelings on entitlement in that messages...

    You want the ability to search through some insanely large character set, so to do so you're willing to force everyone else to make their communications much less efficient just so you can have a free ride.

    You know, it's not a coincidence that the western world (using small variations on the roman character set) pretty well invented modern technology. It's only about a thousand times easier to process a smaller and simpler alphabet.

    There's a reason we don't use prose to command computers, until all cheap desktop models come with the ability to understand natural language a stripped down and unambiguous command-set will be more efficient.

    I've got a lot of characters I'd find handy if we were to implement a new standard, and I'd want to expand into basic pictograms (standard symbols, etc) as well. Now I realize this isn't interesting to other people, so I'm not going to jump up and down and shout "Racist" just because people aren't anxious to bloat a new standard just to appease me. If I want those features I'll make my own font and make it available with any works that I produce which would require it.

    In short, grow up, the world does *not* own you anything. If you want it, do it yourself instead of crying when someone else doesn't.
  • Re:Helping the poor by mattsouthworth (Score:1) Tuesday June 05 2001, @11:20AM
  • Why do we have UTF-8, UTF-16 and UTF-32? by mcdurdin (Score:1) Tuesday June 05 2001, @02:23PM
  • Re:Quit whining and move to a phonetic alphabet by HenryFlower (Score:2) Tuesday June 05 2001, @01:02PM
  • This is so wrong by jfedor (Score:2) Tuesday June 05 2001, @06:53AM
  • Duh. Duh. by Ranger Nik (Score:1) Tuesday June 05 2001, @07:15AM
  • Re:umm by Ranger Nik (Score:1) Tuesday June 05 2001, @07:18AM
  • Compaction and Traction by JJ (Score:2) Tuesday June 05 2001, @06:48AM
  • Re:Compaction and Traction by JJ (Score:2) Tuesday June 05 2001, @12:53PM
  • Re:Quit whining and move to a phonetic alphabet by scruffy (Score:2) Tuesday June 05 2001, @08:26AM
  • by scruffy (29773) on Tuesday June 05 2001, @07:18AM (#175472)
    Phonetic writing is one of the greatest inventions of mankind. All a speaker needs to be literate is to learn the mapping between sounds and letters. Could anything be easier?

    But like companies who still maintain their legacy software written in Cobol and who knows what else, countries and cultures hold onto their legacy alphabets, despite all their disadvantages, and despite all the moaning and groaning about education, literacy, and how hard it is to type 10,000 characters on a 100-key keyboard.

    I agree there is a serious problem of understanding texts written in the "old way". There is a simple solution here, too, i.e., we just translate what's most important to the "new way" and let scholars work on the texts that don't get translated. Before anyone gets too hot here, the situation is not that much different than translating literature from one language to another. It is too much work to translate everything that is written in English into French, so one focuses on the texts that are important enough for translation.

    Also, English has a lot of problems here, as it is mostly phonetic, but a large percentage is not, large enough to make learning English a lot more difficult than say learning Spanish.

    I realize this is way too utopian. We Americans can't even move to metric, much less anything more "radical". I just needed to respond to the whining.

  • Unicode for aliens by WyldOne (Score:1) Tuesday June 05 2001, @08:20AM
  • Re:Quit whining and move to a phonetic alphabet by borkbork (Score:1) Tuesday June 05 2001, @10:26AM
  • Nordic Runes? by reverse solidus (Score:2) Tuesday June 05 2001, @06:51AM
  • Re:More Flamebait :) by tommyk (Score:1) Tuesday June 05 2001, @08:21AM
  • Re:Compaction and Traction by topham (Score:1) Tuesday June 05 2001, @08:03AM
  • Re:You bring up a good point by Another MacHack (Score:1) Tuesday June 05 2001, @12:19PM
  • Re:ASCII stupidity all over again... by Another MacHack (Score:1) Tuesday June 05 2001, @01:09PM
  • Re:Quit whining and move to a phonetic alphabet by Another MacHack (Score:1) Tuesday June 05 2001, @01:37PM
  • Re:Classical specialists need special tools by Another MacHack (Score:1) Tuesday June 05 2001, @01:45PM
  • Re:Some Article by Another MacHack (Score:1) Tuesday June 05 2001, @02:37PM
  • Re:"Extended ASCII" misnames ISO-8859-1 by Another MacHack (Score:1) Tuesday June 05 2001, @08:36PM
  • Re:A Plan for the Improvement of English Spelling by bgarcia (Score:1) Wednesday June 06 2001, @01:31AM
  • by bgarcia (33222) on Tuesday June 05 2001, @07:29AM (#175485) Homepage
    Go read the original [netfunny.com] story here, by Mark Twain.
  • Re:After some skimming... by robosmurf (Score:1) Tuesday June 05 2001, @11:58PM
  • Re:Perl in Hierogliphics by CharlieG (Score:2) Tuesday June 05 2001, @08:20AM
  • Re:Is this really such a problem? by revscat (Score:2) Tuesday June 05 2001, @06:54AM
  • Re:ASCII stupidity all over again... by gorilla (Score:2) Tuesday June 05 2001, @09:25AM
  • Re:I'll take that challenge by Tower (Score:1) Tuesday June 05 2001, @11:43AM
  • It works by Kohath (Score:2) Tuesday June 05 2001, @06:22AM
  • Re:Quit whining and move to a phonetic alphabet by si1k (Score:1) Tuesday June 05 2001, @01:37PM
  • technical critique by lucentshoe (Score:1) Tuesday June 05 2001, @08:19AM
  • Re:Wrong, wrong! by csbruce (Score:2) Tuesday June 05 2001, @11:26AM
  • Re:Mrrp, wrong by Morgo (Score:1) Tuesday June 05 2001, @07:40AM
  • Re:Unicode Character Set vs Character Encoding by Morgo (Score:1) Tuesday June 05 2001, @07:57AM
  • Re:Unicode Character Set vs Character Encoding by Morgo (Score:1) Tuesday June 05 2001, @08:24AM
  • Re:Overstating and misunderstanding the problem by Morgo (Score:1) Tuesday June 05 2001, @08:36AM
  • Re:Is this a problem? by cicho (Score:1) Tuesday June 05 2001, @01:42PM
  • Re:You bring up a good point by ncc74656 (Score:2) Tuesday June 05 2001, @03:42PM
  • Well DUH! It's not meant to have every character by stienman (Score:2) Tuesday June 05 2001, @08:07AM
  • What about Klingon? by sometwo (Score:1) Tuesday June 05 2001, @07:28AM
  • Re:After some skimming... by Old Wolf (Score:1) Tuesday June 05 2001, @10:44AM
  • Re:All Character sets simultaneously?? by Old Wolf (Score:1) Tuesday June 05 2001, @10:53AM
  • Re:Uh, I Don't Get It by Old Wolf (Score:1) Tuesday June 05 2001, @11:22AM
  • Re:Overstating and misunderstanding the problem by PylonHead (Score:1) Tuesday June 05 2001, @10:12AM
  • Never mind.. question answered elsewhere (nt) by PylonHead (Score:1) Tuesday June 05 2001, @10:16AM
  • Don't be fooled by olevy (Score:1) Tuesday June 05 2001, @12:07PM
  • Re:too sinocentric, but Unicode has problems by olevy (Score:2) Tuesday June 05 2001, @11:28AM
  • Re:After some skimming... by tytso (Score:2) Wednesday June 06 2001, @02:42AM
  • Unicode & a lot of characters by kune (Score:1) Tuesday June 05 2001, @08:10AM
  • Re:Alrighty by hernick (Score:2) Tuesday June 05 2001, @06:52AM
  • Re:UTF-8 should be fine for almost any application by AdamBa (Score:1) Tuesday June 05 2001, @11:48AM
  • UTF-8 should be fine for almost any application by AdamBa (Score:2) Tuesday June 05 2001, @06:59AM
  • Oh dear - error ridden by orblee (Score:1) Tuesday June 05 2001, @06:44AM
  • Re:Some errors by salyavin (Score:1) Thursday June 14 2001, @01:49AM
  • Re:Overstating and misunderstanding the problem by Kanasta (Score:1) Tuesday June 05 2001, @05:28PM
  • Re:After some skimming... by jmccay (Score:1) Tuesday June 05 2001, @07:38AM
  • Re:Quit whining and move to a phonetic alphabet by Baki (Score:2) Tuesday June 05 2001, @10:21AM
  • Re:Quit whining and move to a phonetic alphabet by Baki (Score:2) Tuesday June 05 2001, @10:26AM
  • Re:Quit whining and move to a phonetic alphabet by rkent (Score:1) Tuesday June 05 2001, @09:00AM
  • Re:Solution - Everybody use Euro-English! by rkent (Score:2) Tuesday June 05 2001, @08:55AM
  • Re: Simpler than English by Christopher Whitt (Score:1) Tuesday June 05 2001, @07:13AM
  • Re:After some skimming... by CloudWarrior (Score:1) Tuesday June 05 2001, @07:01AM
  • Re:It works by TommyW (Score:2) Tuesday June 05 2001, @06:50AM
  • Idographics Have Their Place In English Too.... by EXTomar (Score:2) Tuesday June 05 2001, @10:28AM
  • Re:Unicode Character Set vs Character Encoding by crath (Score:2) Tuesday June 05 2001, @07:58AM
  • Frequency rates for Chinese characters by willis (Score:1) Tuesday June 05 2001, @09:49AM
  • Re:You bring up a good point by mrogers (Score:2) Monday June 11 2001, @04:55AM
  • Does this means, all ancient chars should be maped by Artemis3 (Score:1) Tuesday June 05 2001, @02:40PM
  • Re:Flamebait :) by phunhippy (Score:1) Tuesday June 05 2001, @12:31PM
  • Flamebait :) (Score:3)

    by phunhippy (86447) <colin@nosPam.woot.us> on Tuesday June 05 2001, @06:24AM (#175532) Homepage Journal
    Learn english.. 26 letters 10 numerals.. assorted punctuation.. ;)

  • Re:Quit whining and move to a phonetic alphabet by potifar (Score:1) Tuesday June 05 2001, @08:00AM
  • Re:After some skimming... by kevin@ank.com (Score:2) Tuesday June 05 2001, @07:01AM
  • Re:What about the artist formerly known as Prince? by kevin@ank.com (Score:2) Tuesday June 05 2001, @07:15AM
  • Re:Is this a problem? by pompomtom (Score:1) Tuesday June 05 2001, @09:01PM
  • Why not encode brush strokes? by MountainLogic (Score:1) Tuesday June 05 2001, @07:56AM
  • Technically Illiterate by tbray (Score:2) Tuesday June 05 2001, @09:19AM
  • Some Article by ahde (Score:1) Tuesday June 05 2001, @08:42AM
  • Re:You bring up a good point by ahde (Score:1) Tuesday June 05 2001, @09:48AM
  • Re:After some skimming... by TheReverand (Score:2) Tuesday June 05 2001, @11:58AM
  • Re:Danny Boy? by TheReverand (Score:2) Wednesday June 06 2001, @05:20AM
  • How simple is English? by The Trinidad Kid (Score:1) Tuesday June 05 2001, @07:06AM
  • Re:How simple is English? by The Trinidad Kid (Score:1) Tuesday June 05 2001, @11:38PM
  • 10100010100 by 4of12 (Score:2) Tuesday June 05 2001, @08:04AM
  • Did you even read the article ? by dingbat_hp (Score:1) Tuesday June 05 2001, @07:29AM
  • Try reading the article by dingbat_hp (Score:1) Tuesday June 05 2001, @07:34AM
  • Re:Did you even read the article ? by dingbat_hp (Score:1) Tuesday June 05 2001, @09:07AM
  • Solresol is even smaller by dingbat_hp (Score:1) Tuesday June 05 2001, @09:26AM
  • by rjh3 (99390) on Tuesday June 05 2001, @08:25AM (#175550)
    Ah, the horrors of Unicode. The referenced article is too Sinocentric. Unicode's problems go further. Unicode is both a european solution to european problems and a european solution to asian problems.

    The Japanese hate Unicode. If you bother to ask them, which the web did not, you find a loud and impolite dislike for Unicode. The Japanese want their ISO 2022 solution, aka shift-JIS.

    The history of encodings is roughly:
    1. There was chaos.
    2. Then there was ASCII (the roman alphabet) pleasing to latin and english speakers.
    3. Then there were all the ISO 8859 and ISO 2022 encodings. These let all the european languages mix together with ASCII.
    4. Then Japan, Korea, and Vietnam define their own ISO 2022 encodings that make sense in the local language, and let these languages mix together with the european languages and ASCII.
    5. But ISO 2022 is a complex patchwork of special cases. So at the same time the Asians were inventing their ISO 2022 solutions, Unicode was being invented.
    Unicode 1.0 provided a viable solution to modern european languages, but could not encode historical documents or asian languages properly. The Unicode 2.0 effort fixed the historical european language problem by adding in the alphabets for these "dead" languages. Unicode 2.0 brought the asian encodings to the point where they were usable.

    Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

    Meanwhile China has a unique problem. They do not have an agreed alphabet. The Japanese all around the world agree on what characters define Kanji. There may be different fonts, but there is one agreed alphabet. Similarly, the Koreans and the Vietnamese have one agreed alphabet. These alphabets are huge, with thousands of characters, but they are fixed and agreed worldwide.

    China has not agreed on an alphabet. Different regions use different alphabets. Chinese speak numerous different languages and have invented an amazing alphabet that works as a single writing form for all those languages. But there are disagreements. Furthermore, some regions of China are still inventing new letters for the alphabet. It is not a fixed and stable thing like european alphabets. You can invent new letters. (These really are new letters, not just new fonts.)

    The Chinese have invented many encodings as a result. The two most popular (Big5 and GB2312) are not ISO 2022 compatible. There is a new, less widely used encoding that is a superset encoding of BIG5, GB2312, and other encodings, and that is ISO 2022 compatible.

    Unicode did not accept the approach of leaving all these alphabets as different. They share most of their glyphs. Giving each region and language its own complete section would have blown the 50K limit of Unicode 2.0. They smushed all these different alphabets into one blob by combining anything that had similar glyphs into one character.

    This left Unicode 2.0 telling the Chinese, ignore all those letters we don't like. You don't use them much anyhow. It destroyed any notion of alphabetic order in the encodings for any asian language. And it is usable for modern text communication. Unicode 3.0 promises to do better, and probably will.

    But since all these languages can use the ISO 2022 encodings with fully compatable mixture of languages, why not just use ISO 2022 and forget Unicode? The problem is the patchwork nature of ISO 2022. The encoding rules are complex. ISO 2022 is a terrible internal format. A chinese character may take from 2 to 9 bytes to encode. And it gets worse as you dig further. UCS-2 and UCS-4 are very nice friendly internal formats for computers. It is trivial to convert from UCS-2 or UCS-4 into UTF-8 for transmission.

    It is also pretty simple to translate from UCS-2 or UCS-4 into ISO 2022 encodings. So the ISO 2022 encodings actually can make sense for network transmission.

    These issues will just get worse as you include other languages, like historical chinese, chinese border languages, and south asian languages. As with chinese, some of these have the fundamentally hard problem that they do not agree on a single alphabet.
  • Re:Wrong, wrong, WRONG, WRONG! by Mendax Veritas (Score:1) Tuesday June 05 2001, @07:22AM
  • Re:Wrong, wrong! by Mendax Veritas (Score:1) Tuesday June 05 2001, @11:53AM
  • Re:UTF8 by Mendax Veritas (Score:2) Tuesday June 05 2001, @06:35AM
  • Wrong, wrong! (Score:4)

    by Mendax Veritas (100454) on Tuesday June 05 2001, @06:32AM (#175554) Homepage
    UCS-2 is not the only form of Unicode, and it's well known that 64k characters isn't enough. Besides, why should ordinary ISO-8859 (Latin-1) text be doubled in size by making every character 16 bits? UTF-8 is a much better solution, and it is good enough. Granted, string handling with variable-length characters is a bit of a pain (especially if you're used to assuming that a buffer of N bytes is long enough for a string of N characters, or you want to scan the string backwards), but it's the best solution we've got. It's the recommended encoding for XML documents, and is used today in web browsers (check out that "Always send URLs as UTF-8" option in Internet Explorer).

    It is a shame that there are so many different Unicode encodings. I think we ought to just standardize on UTF-8.

  • Re:Hmm.. I must have been using something else the by ClarkEvans (Score:2) Tuesday June 05 2001, @07:40AM
  • Re:Hmm.. I must have been using something else the by ClarkEvans (Score:2) Tuesday June 05 2001, @07:44AM
  • Re:Unicode Character Set vs Character Encoding by ClarkEvans (Score:2) Tuesday June 05 2001, @07:58AM
  • Re:Unicode Character Set vs Character Encoding by ClarkEvans (Score:2) Tuesday June 05 2001, @08:18AM
  • Re:No you didn't read the article, or even think by Troed (Score:1) Tuesday June 05 2001, @11:22PM
  • Language ID? by kreyg (Score:2) Tuesday June 05 2001, @07:29AM
  • Re:UTF8 by egomaniac (Score:1) Tuesday June 05 2001, @07:00AM
  • Re:Unicode has this covered. by egomaniac (Score:1) Tuesday June 05 2001, @07:03AM
  • Re:unicode does *not* encode 65,536 characters by egomaniac (Score:1) Tuesday June 05 2001, @08:25AM
  • Re:Compaction and Traction by egomaniac (Score:1) Tuesday June 05 2001, @08:31AM
  • by egomaniac (105476) on Tuesday June 05 2001, @06:46AM (#175565) Homepage
    It encodes over one million codepoints, actually (the erroneous statements of other posters notwithstanding). All currently assigned Unicode characters exist within the basic Unicode Plane 0, as it's called, which handles ~50,000 characters. Twenty-some-odd-thousand of those characters are in the CJK block (Chinese, Japanese, and Korean characters).

    Now, a range of Unicode characters is set aside for so-called "surrogates", and a high surrogate and a low surrogate character placed next to one another form a "surrogate pair" which specifies an extended character in UCS Plane 1. None of UCS Plane 1 codepoints are actually assigned to anything yet, but since there are about 2^20 (~one million) Plane 1 codepoints, they will easily handle all remaining glyphs with a ton left over. Tengwar, Klingon and others have all been considered for Plane 1 encoding (although I just checked and Klingon has been rejected. Sorry folks).

    So, the simple fact is that anyone who says Unicode can't support enough characters has been smoking a bit too much crack lately. Do yourself a favor and go read the spec before getting your panties in a twist.
  • Westerner's attitude? by gaemon (Score:1) Wednesday June 06 2001, @02:50AM
  • Unicode by ralmeida (Score:1) Tuesday June 05 2001, @06:23AM
  • 2 + 1 bytes? by whovian (Score:1) Tuesday June 05 2001, @06:29AM
  • next version of unicode should be 24 bit by j0nb0y (Score:1) Tuesday June 05 2001, @08:10AM
  • Language geeking by persist1 (Score:1) Wednesday June 06 2001, @10:43AM
  • Re:I had no trouble reading that at all by quahog (Score:1) Tuesday June 05 2001, @05:59PM
  • Re:You bring up a good point by d-rock (Score:1) Tuesday June 05 2001, @07:43PM
  • by tjwhaynes (114792) on Tuesday June 05 2001, @06:58AM (#175573)

    Had this researcher bothered to read the Unicode technical introduction, the following would have been obvious.

    In all, the Unicode Standard, Version 3.0 provides codes for 49,194 characters from the world's alphabets, ideograph sets, and symbol collections. These all fit into the first 64K characters, an area of the codespace that is called basic multilingual plane, or BMP for short.

    There are about 8,000 unused code points for future expansion in the BMP, plus provision for another 917,476 supplementary code points. Approximately 46,000 characters are slated to be added to the Unicode Standard in upcoming versions.

    The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.

    Plenty of room.

    Cheers,

    Toby Haynes

  • Re:This article is stupid by MrResistor (Score:1) Wednesday June 06 2001, @10:30AM
  • This article is stupid by MrResistor (Score:2) Tuesday June 05 2001, @07:24AM
  • Re:All Character sets simultaneously?? by spiro_killglance (Score:1) Tuesday June 05 2001, @07:32AM
  • Re:After some skimming... by RFC959 (Score:1) Tuesday June 05 2001, @09:21AM
  • Minor nitpick by HalfFlat (Score:1) Tuesday June 05 2001, @08:34AM
  • by HalfFlat (121672) on Tuesday June 05 2001, @08:26AM (#175579)

    As a preliminary, Unicode and ISO 10646 aren't the same standard, but are kept pretty much in synchronisation. ISO 10646 provides a character set with a 4-byte representation, and a compatible smaller set with a 2-byte representation. These representations have encodings such as UTF-8, UTF-16, and UTF-32. UTF-32 encodes every Unicode character in 32 bits and can represent the full 2^31 codepoints, while UTF-8 and UTF-16 as described in the Unicode 3.1 document [unicode.org] are variable length representations that can represent approximately 2,100,000 and 1,100,000 codepoints respectively.

    One of the design principles was to provide a lossless representation of any currently used character set in Unicode, so that a round-trip re-encoding of text from one encoding to Unicode and back again would lose no information. Another was to keep distinct code-points for any characters that had different semantics, or different 'abstract shapes'.

    It turns out that one can satisfy these requirements for the Japanese kanji, Chinese hanzi (traditional and simplified) and Korean hanja without requiring a seperate code-point for each; in Unicode version 2.0, approximately 121,000 such characters were able to be represented in 20,902 code points. Note that those characters which have distinct shapes but the same meaning, and those which are similar enough to be classified as calligraphic variants but have distinct meanings, are all represented by distinct code-points. (One caveat: in practice there are some exceptions as regards the preservation of information after a round-trip encoding to Unicode and back. For example, the CCCII encoding of hanzi explicitly catalogues calligraphic variations, and as such doesn't map 1-1 onto Unicode.)

    Of course, the actual glyph that corresponds to one of these unified codes will change depending upon the context in which it is rendered. For example the character 0x6d77 corresponding to the character for sea in both Chinese (Mandarin 'hai3') and Japanese ('umi') is drawn with one fewer stroke in Japanese than in Chinese. These typographical details are important, but can (and debatably, should) be dealt with outside the context of character encoding. Unicode has support for language tags which in the absence of any higher-level information can indicate the language context of the characters following them. Typically though, this information should be stored as part of a richer document structure (as is possible in XML for example.) Correct display of characters will require the presence of the appropriate font and a mechanism (such as LOCALE in a simple one language case) for selecting this font.

    Given this unification then, one really can fit most of the characters for which there already extant (non-Unicode) encodings into 16 bits. With Unicode 3.1/ISO 10646-2 (which uses more than 65536 codepoints) this representation is AFAIK pretty much complete, including for example all of the hanzi of CNS 11643-1992 and CNS 11643-1986 plane 15 (the most complete hanzi encoding outside of CCCII.)

    With this in mind, one can argue against the points raised in the article:

    1. The unification scheme, allows the representation of the 170,000 characters the author calculates in 70,000 or so codepoints. Which it now does with Unicode 3.1. The use of external context is still necessary for correct rendering, but if the document has no structure for representing language context, there are Unicode language tags that can fill this role. Similarly, context would be required for the presentation of different calligraphic variants of Roman characters (e.g. fraktur.)
    2. Unification is quite unlike the analogy described 'in Western Terms'. 'M' and 'N' could not be identified, as they semanticly distinguish words (e.g., 'rum' and 'run' have very different meanings.) Traditional characters and their simplified analogues are not identified under Unicode, so even if 'Q' were simply a fancier 'C' (which of course it is not), it wouldn't be given the same codepoint.
    3. Unicode is not limited to 16 bits as stated in the introduction to the article. There are over 2000 million available codepoints in UCS-4 and UTF-8, and UTF-16 can represent approximately 1 million of these. There is plenty of room - even in UTF-16 - to encode more characters as the need arises.
    4. With the exception of calligraphic variants in CCCII, Unicode can already faithfully represent characters in the major Chinese, Japanese and Korean character encoding standards.

    A little bit of research by the article author would have made the article unnecessary.

    References:
    Unicode 3.1 document;
    CJKV Information Processing, Ken Lunde.

    PS: In the time it took me to read the article, do some research and write this response, there have been over 300 slashdot comments. Wow.

  • Re:Is this a problem? by autechre (Score:2) Tuesday June 05 2001, @07:50AM
  • Nonsense by GCP (Score:1) Wednesday June 06 2001, @11:28PM
  • No, no by GCP (Score:1) Wednesday June 06 2001, @11:34PM
  • Re:Nonsense by GCP (Score:1) Friday June 08 2001, @08:10PM
  • Re:You bring up a good point by Kaiwen (Score:1) Wednesday June 06 2001, @04:10PM
  • UCS by markbthomas (Score:1) Wednesday June 06 2001, @02:43AM
  • ¹The point of c -> k and c -> s by yerricde (Score:1) Tuesday June 05 2001, @05:45PM
  • Tengwar is SCRIPT not language by yerricde (Score:2) Tuesday June 05 2001, @03:59PM
  • Tengwar: Another alphabet designed on phonetics by yerricde (Score:2) Tuesday June 05 2001, @04:52PM
  • "Ye Olde" typo and Walt Disneþ^WDisney by yerricde (Score:2) Tuesday June 05 2001, @05:01PM
  • Basic English by yerricde (Score:2) Tuesday June 05 2001, @05:12PM
  • Likewise, for Latin-1 based languages... by yerricde (Score:2) Tuesday June 05 2001, @05:30PM
  • And Unicode distinguishes those. by yerricde (Score:2) Tuesday June 05 2001, @05:41PM
  • UTF-8 by yerricde (Score:2) Tuesday June 05 2001, @06:05PM
  • "Extended ASCII" misnames ISO-8859-1 by yerricde (Score:2) Tuesday June 05 2001, @06:10PM
  • Bummer by SpanishInquisition (Score:1) Tuesday June 05 2001, @06:35AM
  • Re:You bring up a good point by timbu2 (Score:1) Tuesday June 05 2001, @10:40AM
  • Re:Wrong, wrong! (Score:3)

    by hackbod (133575) <hackbod@enteract.com> on Tuesday June 05 2001, @07:38AM (#175597) Homepage
    People who think there is a problem with the number of different Unicode encodings -- including the authors of this article -- completely misunderstand how unicode works. The different encodings are -not- different character sets -- in fact, they are different ways to write the -same- standard Unicode character set. The transformation between UTF-8, UTF-16, and UTF-32 is only a simple bit minipulation -- it is completely independent of the character set.

    An implication of this is that UTF-8, UTF-16, and UTF-32 can all express the EXACT SAME NUMBER OF CHARACTER CODES. So, if you think UTF-32 is good enough for you, then UTF-16 and UTF-8 are just as good. The latter two simply use multi-word or multi-byte sequences to express the upper character values.

    After using BeOS for a number of years, where all character strings are natively handled as UTF-8, I am a very strong believer in Unicode. Having a Western perspective I may be missing something, but none of the "problems" mentioned in this article are actually problems that Unicode has.

    Of course, once you start using Unicode, the main problem you are going to run in to is having fonts with the characters you need. And if the Chinese, Japenese, etc. really need 50,000 of their very own characters, then this is going to be that much more of a problem. Unforunately, there is no easy solution to this -- but it doesn't have anything to do with the encoding you use, so changing to another encoding is not going to help here.
  • Re:You bring up a good point by joto (Score:2) Tuesday June 05 2001, @12:06PM
  • Re:Bummer by joto (Score:2) Tuesday June 05 2001, @12:47PM
  • Re:You bring up a good point by Doomdark (Score:1) Tuesday June 05 2001, @04:06PM
  • Re:You bring up a good point by Doomdark (Score:1) Tuesday June 05 2001, @04:17PM
  • Re:one from finnish? by Doomdark (Score:1) Friday June 08 2001, @09:23AM
  • All Character sets simultaneously?? by -tji (Score:1) Tuesday June 05 2001, @06:29AM
  • 1st Posts of the World by Shocker69 (Score:1) Tuesday June 05 2001, @11:41AM
  • Re:In other news... by Shocker69 (Score:1) Tuesday June 05 2001, @11:43AM
  • Esperanto, Ido, lojban; BCE by mrBlond (Score:1) Tuesday June 05 2001, @07:18AM
  • BCE by mrBlond (Score:1) Tuesday June 05 2001, @07:50AM
  • Mrrp, wrong by Srin Tuar (Score:1) Tuesday June 05 2001, @07:17AM
  • funny by Srin Tuar (Score:1) Tuesday June 05 2001, @07:27AM
  • UTF8 by Srin Tuar (Score:2) Tuesday June 05 2001, @06:27AM
  • You bring up a good point by Srin Tuar (Score:2) Tuesday June 05 2001, @06:47AM
  • (reply to AC) by Srin Tuar (Score:2) Tuesday June 05 2001, @06:53AM
  • Re:Unicode Character Set vs Character Encoding by TekPolitik (Score:2) Tuesday June 05 2001, @02:27PM
  • by torokun (148213) on Tuesday June 05 2001, @10:39AM (#175614) Homepage
    There are some good comments here, clarifying why this article is fundamentally wrong in its assumption that Unicode only encodes 2^16 characters. This is the first reason why this article is wrong.

    The other reasons are more subtle, and I'm not sure that everyone here understands what's going on with CJK characters, so here's a little background.

    The characters we're talking about originated in china, and spread to Korea, Vietnam, and Japan. Vietnam has switched to a western alphabet now, so let's leave them out. ;) At one point, although there have always been alternative forms for some characters, there was a reasonably standard set of Chinese characters used throughout these countries (recorded in the KangXi dictionary)...

    The Japanese invented a number of their own characters, which I'm sure number less than 1000. Up until World War II, this was basically the situation. (So at this time, the required number of characters to encode would have been less than 50,000 -- Chinese characters and Japanese additions.) Then all hell broke loose, so to speak.

    The Japanese simplified a large number of their characters systematically, immediately following WWII ( So they started substituting simpler characters for the disallowed ones in these compounds, and thereby subtly changed the meaning of the words.

    On to China -- they also began a campaign of character simplification, which would span quite a few years, although theirs was much more radical than the Japanese approach. In fact, some of the simplified versions the government came out with were so repulsive, they were eventually retracted because everyone refused to use them. ;) So they ended up with a few thousand ( Finally, Korea, Taiwan, and Hong-Kong basically kept the traditional chinese characters.

    So, that gives us the basic 40,000, plus 3000 Japanese (kokuji and shinjitai), plus maybe 10,000 chinese (jiantizi), plus some other stuff not mentioned here, giving a grand estimate of around 55,000.

    The key to this is that the vast majority of characters used are common among all 5 locales. This was the only reason that anyone even attempted to encode the CJK characters in the first place. The re-unification of all the disparate character sets was called Han-Unification during the Unicode development process.

    This, combined with the surrogate encoding area, ensures that there will be plenty of space for everyone... :)

  • Re:Quit whining and move to a phonetic alphabet by molybdenum (Score:1) Tuesday June 05 2001, @08:22AM
  • A Short History of Character Encoding by KidSock (Score:1) Tuesday June 05 2001, @12:00PM
  • Helping the poor by peccary (Score:2) Tuesday June 05 2001, @07:53AM
  • by peccary (161168) on Tuesday June 05 2001, @07:30AM (#175618)
    I mean, imagine how much pooerer you would be if you had been unable to read the epic poems of early Anglo-Saxon culture in their original form! Or the early Judaic and Greek writings on which much of our more recent culture is based.

    You *have* read Beowulf, and the Canterbury Tales, haven't you? Along with Plato's Republic in Greek, and the Dead Sea scrolls?

    Now imagine how hard this would be if your computer didn't support the full character set in which they were written.
  • Ignorant nonsense. by fm6 (Score:2) Wednesday June 06 2001, @06:02PM
  • Re:After some skimming... by kurisuto (Score:1) Tuesday June 05 2001, @06:55AM
  • Re:All Character sets simultaneously?? by kurisuto (Score:1) Tuesday June 05 2001, @06:57AM
  • Re:Is this really such a problem? by kurisuto (Score:1) Tuesday June 05 2001, @07:11AM
  • Re:I'll take that challenge by kurisuto (Score:1) Tuesday June 05 2001, @07:37AM
  • Re:Overstating and misunderstanding the problem by kurisuto (Score:1) Tuesday June 05 2001, @07:42AM
  • Re:another drawback of unicode by kurisuto (Score:2) Tuesday June 05 2001, @07:02AM
  • Re:Unicode includes all common Asian character set by kurisuto (Score:2) Tuesday June 05 2001, @07:15AM
  • Re:Define two unicode escape chars = 196000 chars. by kurisuto (Score:2) Tuesday June 05 2001, @07:24AM
  • by kurisuto (165784) on Tuesday June 05 2001, @06:33AM (#175628) Homepage
    This article mischaracterizes the issue concerning the Chinese characters. To take a western example as an illustration, the number one is handwritten in America as a vertical stroke, but in Germany as an upside-down V. However, folks in America and Germany agree that this is "the same character"; we simply have a different way of writing it. Unicode recognizes this sameness by assigning the same code for character for "one"; the way to display it locally is a presentation issue, not an encoding one.

    This is exactly the issue with the Chinese characters. For a given character, there might be a difference between the Taiwanese way of writing it, the Japanese way, and the mainland Chinese way; but the character is still recognized as being the same, despite these presentation-level differences.

    For someone to demand that each national presentation form have its own character code is to misunderstand what Unicode is designed for. It encodes abstract characters, not presentation forms. Unicode does not have separate codes for "A" in Garamond and "A" in Helvetica.

  • People didn't get your joke by GodSpiral (Score:1) Tuesday June 05 2001, @12:16PM
  • Re:After some skimming... by 2Bits (Score:1) Tuesday June 05 2001, @08:36AM
  • Re:Quit whining and move to a phonetic alphabet by 2Bits (Score:1) Tuesday June 05 2001, @09:21AM
  • Re:How simple is English? by de Selby (Score:1) Tuesday June 05 2001, @10:37AM
  • Re:You bring up a good point by de Selby (Score:1) Tuesday June 05 2001, @10:51AM
  • Re:You bring up a good point by de Selby (Score:1) Tuesday June 05 2001, @11:02AM
  • Re:All Character sets simultaneously?? by OhPlz (Score:1) Tuesday June 05 2001, @06:37AM
  • Re:Solution - Everybody use Euro-English! by Skuto (Score:2) Tuesday June 05 2001, @06:45AM
  • Re:Some errors by ProfBooty (Score:1) Tuesday June 05 2001, @12:32PM
  • by saider (177166) on Tuesday June 05 2001, @06:34AM (#175638)
    The European Commission has just announced an agreement whereby English will be the official language of the EU rather than German which was the other possibility. As part of the negotiations, Her Majesty's Government conceded that English spelling had some room for improvement and has accepted a 5 year phase-in plan that would be known as "Euro-English".

    In the first year, "s" will replace the soft "c". Sertainly, this will make the sivil servants jump with joy. The hard "c" will be dropped in favour of the"k". This should klear up konfusion and keyboards kan have 1 less letter.

    There will be growing publik enthusiasm in the sekond year, when the troublesome "ph" will be replaced with "f". This will make words like "fotograf" 20% shorter.

    In the 3rd year, publik akseptanse of the new spelling kan be ekspekted to reach the stage where more komplikated changes are possible. Governments will enkorage the removal of double letters, which have always ben a deterent to akurate speling. Also, al wil agre that the horible mes of the silent "e"s in the language is disgraseful, and they should go away.

    By the fourth year, peopl wil be reseptiv to steps such as replasing "th" with "z" and "w" with "v". During ze fifz year, ze unesesary "o" kan be dropd from vords kontaining "ou" and similar changes vud of kors be aplid to ozer kombinations of leters.

    After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubl or difikultis and evrivun vil find it ezi to understand ech ozer. Ze drem vil finali kum tru!


  • Characters, not glyphs! by Wulfstan (Score:1) Tuesday June 05 2001, @07:34AM
  • binary by SkyLeach (Score:1) Tuesday June 05 2001, @09:17AM
  • Re:Is this a problem? by lingsb (Score:1) Tuesday June 05 2001, @12:16PM
  • Re:After some skimming... by gea (Score:2) Tuesday June 05 2001, @08:04AM
  • Re:No it's not by Pru (Score:1) Tuesday June 05 2001, @06:39AM
  • by achurch (201270) on Tuesday June 05 2001, @05:36PM (#175644) Homepage

    >>Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

    I can't speak for Korean, but there is no such thing as an alphabetic order for Kanji. In Japanese, Kanji almost always have at least two pronunciations, and often more.

    While it is true that most all kanji have multiple pronunciations, the kanji in ISO-2022-JP are most definitely in order. Level 1 characters (0x3021-0x4F7E) are ordered by their primary reading, and Level 2 characters (0x5021-0x7426?) are ordered first by radical and then by number of strokes. In both cases it's easy to locate a character if for some reason you can't type it normally (e.g. it's not in your IME dictionary)--I've had to do this on occasion, in fact.

    Unicode is, for all intents and purposes, completely random. Even without the problems of characters being inappropriately merged, there is no way you could try and find a character in Unicode; if your dictionary doesn't have it, tough luck. To me, that's an even scarier concept: for all practical purposes it could eliminate characters from the language. After all, if nobody can type it who's going to use it?

    Have you ever tried to program in shift-JIS? It is horrific.

    I will agree with this. Leaving aside the original poster's confusion of ISO-2022-JP and shi[f]t-JIS (the former is the official standard, aka JIS, while the latter is a poorly-thought-out Microsoft hack), dealing with strings that contain both half-width (1-byte) and full-width (2-byte) characters is a major PITA. About the only thing that can be said for it is the number of bytes is equal to the number of half-width character positions needed; and even that only applies to EUC and SJIS, since JIS has escape sequences to squeeze everything into 7-bit characters.

    On the other hand, there's the character order consideration, which along with the problem of merged characters seems to be what draws so much dislike for Unicode from Japanese.

    --
    BACKNEXTFINISHCANCEL

  • Re: Simpler than English by jpm242 (Score:1) Tuesday June 05 2001, @07:19AM
  • Re:Is this a problem? by cougio (Score:1) Tuesday June 05 2001, @02:39PM
  • Re: Simpler than English by cougio (Score:1) Tuesday June 05 2001, @03:39PM
  • Ancient Latin... by IngramJames (Score:1) Wednesday June 06 2001, @12:01AM
  • Re:After some skimming... by Decado (Score:1) Tuesday June 05 2001, @06:47AM
  • Re:Esperanto, Ido, lojban; BCE by GunFodder (Score:1) Tuesday June 05 2001, @07:57AM
  • Re:Prejudice? Or technical hurdle... by GunFodder (Score:1) Tuesday June 05 2001, @08:58AM
  • Prejudice? Or technical hurdle... by fleeb_fantastique (Score:1) Tuesday June 05 2001, @06:45AM
  • Re:I'll take that challenge by easyfrag (Score:1) Tuesday June 05 2001, @09:30AM
  • Re:I had no trouble reading that at all by tswinzig (Score:1) Tuesday June 05 2001, @06:10PM
  • Re:A Plan for the Improvement of English Spelling by tswinzig (Score:2) Tuesday June 05 2001, @06:15PM
  • Re:I had no trouble reading that at all by tswinzig (Score:2) Thursday June 07 2001, @06:13PM
  • Is this really such a problem? by Dan Hayes (Score:1) Tuesday June 05 2001, @06:33AM
  • Mac OSX has character mapping problems by wmulvihillDxR (Score:1) Tuesday June 05 2001, @06:28AM
  • Re:After some skimming... by update() (Score:1) Tuesday June 05 2001, @07:03AM
  • Re:More Flamebait :) by update() (Score:1) Tuesday June 05 2001, @07:21AM
  • by update() (217397) on Tuesday June 05 2001, @06:26AM (#175661) Homepage
    I planned to read this through before posting. I really did. But then, in the second paragraph I hit:
    Wieger's seminal book about the characters and construction of China, published in 1915, was to become the defacto source against which all others would (and still should) be compared - with several caveats. Amongst these is a noticeable bias on his part against Taoism which becomes more evident in his analysis of the Tao Tsang (i.e., Taoist Canon of Official Writings [written 'DaoZang' in the PinYin Romanization of Mainland China] )
    and I decided to skim the rest.

    To summarize, for those whose eyes completely glazed over, his point is that Unicode doesn't sufficiently cover the full range of Chinese characters and that not using a larger set is a result of a longstanding Western prejudice that the Chinese don't need so many characters.

    Now, I'm not Chinese so my opinion counts for little here, but my impression is that Unicode isn't nearly as controversial as he makes it out. His analogy "To express it in Western terms, how would English-speakers like it if we were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" - why, they are the nothing more a fancier "C" and an "Z")." ignores the fact that Chinese orthography has a tradition of simplification and variants. I suspect Unicode is a lot more upsetting to a "reference writer specializing in rare Taoist religious texts and medical works" than to ordinary Chinese users who want to run Photoshop or put their wedding pictures on a web page.

    Unsettling MOTD at my ISP.

  • In other news... (Score:3)

    by ackthpt (218170) on Tuesday June 05 2001, @06:45AM (#175662) Homepage Journal

    Bush bolts GOP to join Democrats, fires entire Whitehouse staff

    Linus Torvalds to join Microsoft as OfficeXP advocate

    NASA on Moonshots, "Ok, ok, they were all actually faked on a soundstage in Toledo, Ohio and the ISS is really in a warehouse in Newark, New Jersey"

    Oracle CEO, Larry Ellison to give fortune to charity, dumps japanese kimonos for Dockers and GAP T-shirts

    RIAA to drop all charges against Napster, "All a big fsck-up, we'll all get rich together"

    Taiwan throws in towel, joins PRC, turning over massive US military and intelligence assets

    Rob Malda signed by Disney, epic picture planned, based upon this short. Sez Malda, "Anime's not mainstream enough anyway." [cmdrtaco.net]

    --
    All your .sig are belong to us!

  • So... (Score:4)

    by ackthpt (218170) on Tuesday June 05 2001, @06:15AM (#175663) Homepage Journal
    4av3 3v3r0n3 1n t4e w0r1d 13arn t0 typ3 l33t!

    --
    All your .sig are belong to us!

  • ASCII stupidity all over again... by Matthias Wiesmann (Score:2) Tuesday June 05 2001, @06:35AM
  • Re:Duh. by devnullkac (Score:1) Tuesday June 05 2001, @06:50AM
  • Alrighty by rabtech (Score:2) Tuesday June 05 2001, @06:37AM
  • umm by zephc (Score:1) Tuesday June 05 2001, @06:25AM
  • Re:You bring up a good point by zephc (Score:1) Tuesday June 05 2001, @06:51AM
  • Re:Is this a problem? by kenthorvath (Score:1) Tuesday June 05 2001, @07:16AM
  • side topic by deXela (Score:1) Tuesday June 05 2001, @07:41AM
  • by The Monster (227884) on Tuesday June 05 2001, @10:51AM (#175671) Homepage
    Sort of. You define a 32-bit space for now, then use something like UTF-8 to encode it.

    Personally, I think UTF-8 is just a wee bit inefficient. I worked out a scheme long ago that defines a theoretically infinite namespace, and encodes 7-bit ASCII exactly the same as it is now. If anyone cares, it's as simple as this:

    A "character" is defined as a sequence of bytes ("octets" for the RFC-phile) that
    ends with a value which has the most-significant bit clear. (If you treat byte as unsigned, this means nonnegative; if signed, it's < 128, whichever test you'd prefer to code. I have my preference...)
    This gives 2^(7 * n)possible characters of length n:
    1. 128.
    2. 16,384, cumulative 16,512.
    3. 2,097,152, cumulative 2,113,664.
    4. 268,435,456, cumulative 270,549,120.
    5. 34,359,738,368, cumulative 34,630,287,488.
    6. 4,398,046,511,232, cumulative 4,432,676,798,720.
    7. ...
    As you can see, 3 bytes allow encoding that covers pretty much every estimate I've seen here.

    The system can be arbitrarily extended any time it's necessary, and existing agents that understand the fundamental rule would know how to parse these extended characters; although they would not know how to present the characters, they would be able to present an appropriate token indicating this fact, rather than displaying gibberish composed of the 8-bit "ascii" encoding they do understand.

  • Re:After some skimming... by bmongar (Score:2) Tuesday June 05 2001, @06:38AM
  • Not China, Greece. by titaniumball2000 (Score:1) Tuesday June 05 2001, @07:56AM
  • Re:Wrong, wrong! by GordoSlasher (Score:1) Tuesday June 05 2001, @03:22PM
  • A tough problem... by RareHeintz (Score:2) Tuesday June 05 2001, @06:50AM
  • Not my problem by pkesel (Score:1) Tuesday June 05 2001, @10:46AM
  • Unicode's reply (Score:4)

    by roozbeh (247046) on Tuesday June 05 2001, @02:49PM (#175677) Homepage

    It's probably too late, but following is a reponse from on of the editors of the Unicode Standard:

    Dear Mr. Carroll,

    I have just finished reading the article you published today on the Hastings Research website, authored by Norman Goundry, entitled "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations."

    Mr. Goundry's grounding in Chinese is evident, and I will not quibble with his background East Asian historical discussion, but his understanding of the Unicode Standard in particular and of the history of Han character encoding standardization is woefully inadequate. He make a number of egregiously incorrect statements about both, which call into question the quality of research which went into the Unicode side of this article. And as they are based on a number of false premises, the article's main conclusions are also completely unreliable.

    Here are some specific comments on items in the article which are either misleading or outright false.

    Before getting into Unicode per se, Mr. Goundry provides some background on East Asian writing systems. The Chinese material seems accurate to me. However, there is an inaccurate statement about Hangul: "Technically, it was designed from the start to be able to describe *any sound* the human throat and mouth is capable of producing in speech, ..." This is false. The Hangul system was closely tied to the Old Korean sound system. It has a rather small number of primitives for consonants and vowels, and then mechanisms for combining them into consonantal and vocalic nuclei clusters and then into syllables. However, the inventory of sounds represented by the Jamo pieces of the Hangul are not even remotely close to describing any sound of human speech. Hangul is not and never was a rival for IPA (the International Phonetic Alphabet).

    In the section on "The Inability of Unicode To Fully Address Oriental Characters", Mr. Goundry states that "Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate *every single written language* on the planet." While the intended scope of the Unicode Standard is indeed to include all significant writing systems, present and past, as well as major collections of symbols, the Unicode Standard is *not* about creating "formalized font systems", whatever that might mean. Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.

    Immediately thereafter, Mr. Goundry starts making false statements about the architecture of the Unicode Standard, making tyro's mistakes in confusing codespace with the repertoire of encoded characters. In fact the codespace of the Unicode Standard contains 1,114,112 code points -- positions where characters can be encoded. The number he then cites, 49,194, was the number of standardized, encoded characters in the Unicode Standard, Version 3.0; that number has (as he notes below) risen to 94,140 standardized, encoded characters in the *current* version of the Unicode Standard, i.e., Version 3.1. After taking into account code points set aside for private use characters, there are still 882,373 code points unassigned but available for future encoding of characters as needed for writing systems as yet unencoded or for the extension of sets such as the Han characters.

    *Even if* Mr. Goundry's calculation of 170,000 characters needed for China, Taiwan, Japan, and Korea were accurate, the Unicode Standard could accomodate that number of characters easily. (Note that it already includes 70,207 unified Han ideographs.) However, Mr. Goundry apparently has no understanding of the implications or history of Han unification as it applies to the Unicode Standard (and ISO/IEC 10646). Furthermore, he makes a completely false assertion when he states that Mainland China, Taiwan, Korea, and Japan "were not invited to the initial party."

    Starting with the second problem first, a perusal of the Han Unification History, Appendix A of the Unicode Standard, Version 3.0, will show just how utterly false Mr. Goundry's implication that the Asian countries were left out of the consideration of encoding of Han characters in the Unicode Standard is. Appendix A is available online, so there really is no valid research excuse for not having considered it before haring off to invent nonexistent history about the project, even if Mr. Goundry didn't have a copy of the standard sitting on his desk. See:

    http://www.unicode.org/unicode/uni2book/appA.pdf [unicode.org]

    The "historical" discussion which follows in Mr. Goundry's account, starting with "The reaction was predictable..." is nothing less than fantasy history that has nothing to do with the actual involvement of the standardization bodies of China, Japan, Korea, Taiwan, Hong Kong, Singapore, Vietnam, and the United States in Han character encoding in 10646 and the Unicode Standard over the last 11 years.

    Furthermore, Mr. Goundry's assertions about the numbers of characters to be encoded show a complete misunderstanding of the basics of Han unification for character encoding. The principles of Han unification were developed on the model of the main *Japanese* national character encoding, and were fully assented to by the Chinese, Korean, and other national bodies involved. So assertions such as "they [Taiwan] could not use the same number [for their 50,000 characters] as those assigned over to the Communists on the Mainland" is not only false but also scurrilously misrepresents the actual cooperation that took place among all the participants in the process.

    Your (Mr. Carroll's) editorial observation that "It is only when you get *all* the nationalities in the same room that the problem becomes manifest," runs afoul of this fantasy history. All the nationalities have been participating in the Han unification for over a decade now. The effort is led by China, which has the greatest stakeholding in Han characters, of course, but Japan, Korea, Taiwan and the others are full participants, and their character requirements have *not* been neglected.

    And your assertion that many Westerners have a "tendency .. to dismiss older Oriental characters as 'classic,'" is also a fantasy that has nothing to do with the reality of the encoding in the Unicode Standard. If you would bother to refer to the documentation for the Unicode Standard, Version 3.1, you would find that among the sources exhaustively consulted for inclusion in the Unicode Standard are the KangXi dictionary (cited by Mr. Goundry), but also Hanyu Da Zidian, Ci Yuan, Ci Hai, the Chinese Encyclopedia, and the Siku Quanshu. Those are *the* major references for Classical Chinese -- the Siku Quanshu *is* the Classical canon, a massive collection of Classical Chinese works which is now available on CDROM using Unicode. In fact, the company making it available is led by the same man who represents the Chinese national standards body for character encoding and who chairs the Ideographic Rapporteur Group (the international group that assists the ISO working group in preparing the Han character encoding for 10646 and the Unicode Standard).

    Mr. Goundry's argument for "Why Unicode 3.1 Does Not Solve the Problem" is merely that "[94,140 characters] still falls woefully short of the 170,000+ characters needed"-- and is just bogus. First of all the number 170,000 is pulled out of the air by considering Chinese, Japanese, and Korean repertoires *without* taking Han unification into account. In fact, many *more* than 170,000 candidate characters were considered by the IRG for encoding -- see the lists of sources in the standard itself. The 70,207 unified Han ideographs (and 832 CJK compatibility ideographs) already in the Unicode Standard more than cover the kinds of national sources Mr. Goundry is talking about.

    Next Mr. Goundry commits an error in misunderstanding the architecture of the Unicode Standard, claiming that "two *separate* 16 bit blocks do not solve the problem at all." That is not how the Unicode Standard is built. Mr. Goundry claims that "18 bits wide" would be enough -- but in fact, the Unicode Standard codespace is 21 bits wide (see the numbers cited above). So this argument just falls to pieces.

    The next section on "The Political Significance Of This Expressed In Western Terms" is a complete farce based on false premises. I can only conclude that the aim of this rhetoric is to convince some ignorant Westerners who don't actually know anything about East Asian writing systems -- or the Unicode Standard, for that matter -- that what is going on is comparable to leaving out five or six letters of the Latin alphabet or forcing "the French ... to use the German alphabet". Oh my! In fact, nothing of the kind is going on, and these are completely misleading metaphors.

    The problem of URL encodings for the Web is a significant problem, but it is not a problem *created* by the Unicode Standard. It is a problem which is being actively worked on my the IETF currently, and it is quite likely that the Unicode Standard will be a significant part of the *solution* to the problem, enabling worldwide interoperability, rather than obstructing it.

    And it isn't clear where Mr. Goundry comes up with asides about "Ascii-dependent browsers". I would counter that Mr. Goundry is naive if he hasn't examined recently the internationalized capabilities of major browsers such as Internet Explorer -- which themselves depend on the Unicode Standard.

    Mr. Goundry's conclusion then presents a muddled summary of Unicode encoding forms, completely missing the point that UTF-8, UTF-16, and UTF-32 are each completely interoperable encoding forms, each of which can express the entire range of the Unicode Standard. It is incorrect to state that "Unicode 3.1 has increased the complexity of UCS-2." The architecture of the Unicode Standard has included UTF-16 (not UCS-2) since the publication of Unicode 2.0 in 1996; Unicode 3.1 merely started the process of standardizing characters beyond the Basic Multilingual Plane.

    And if Mr. Goundry (or anyone else) dislikes the architectural complexity of UTF-16, UTF-32 is *precisely* the kind of flat encoding that he seems to imply would be preferable because it would not "exacerbate the complexity of font mapping".

    In sum, I see no point in Mr. Goundry's FUD-mongering about the Unicode Standard and East Asian writing systems.

    Finally, the editorial conclusion, to wit, "Hastings [has] been experimenting with workarounds, which we believe can be language- and device-compatible for all nationalities," leads me to believe that there may be hidden agenda for Hastings in posting this piece of so-called research about Unicode. Post a seemingly well-researched white paper with a scary headline about how something doesn't work, convince some ignorant souls that they have a "problem" that Unicode doesn't address and which is "politically explosive", and then turn around and sell them consulting and vaporware to "fix" their problem. Uh-huh. Well, I'm not buying it.

    --Ken Whistler, B.A. (Chinese), Ph.D. (Linguistics),
    Technical Director, Unicode, Inc.
    Co-Editor, The Unicode Standard, Version 3.0

    --

  • Re:Alrighty by Pembers (Score:1) Tuesday June 05 2001, @07:23AM
  • Re:UTF-8 should be fine for almost any application by Keick (Score:1) Tuesday June 05 2001, @12:12PM
  • moderators dude! by invalid_user (Score:1) Tuesday June 05 2001, @12:41PM
  • Are you Chinese? by invalid_user (Score:1) Tuesday June 05 2001, @01:40PM
  • Re:funny by matrix29 (Score:1) Wednesday June 06 2001, @09:53AM
  • Re:Languages by tristan f. (Score:1) Tuesday June 05 2001, @04:58PM
  • Duh. by Shoten (Score:2) Tuesday June 05 2001, @06:25AM
  • Re:Is this a problem? by Hektor_Troy (Score:1) Tuesday June 05 2001, @06:37AM
  • But for Java by Husaria (Score:1) Tuesday June 05 2001, @06:24AM
  • Unicode != UCS-2 by Snowhare (Score:1) Tuesday June 05 2001, @09:45AM
  • IOW, Unicode can't do everthing. by AnotherBlackHat (Score:1) Tuesday June 05 2001, @07:22AM
  • Re:Did you even read the article ? by rst2003 (Score:1) Tuesday June 05 2001, @02:10PM
  • 4 bytes per character? by oogoody (Score:1) Tuesday June 05 2001, @06:22AM
  • Unicode Surrogates by Mumbly_Joe (Score:1) Tuesday June 05 2001, @07:12AM
  • Re:More Flamebait :) by cavemanf16 (Score:1) Tuesday June 05 2001, @11:45AM
  • Re:I had no trouble reading that at all by cryptochrome (Score:2) Thursday June 07 2001, @11:26AM
  • Pictographs suck (Score:3)

    by cryptochrome (303529) on Tuesday June 05 2001, @07:42AM (#175694) Homepage Journal

    For crying out loud, somebody tries and do something nice for somebody and they come back and accuse them of cultural chauvanism. The powers that be didn't have to develop unicode or UCF at all. They only developed it because of the proliferation of language protocols was making the internet difficult to use for foreign languages and multinational businesses in general.

    And besides which, the point of the article is moot. As this article states:

    ISO 10646 defines formally a 31-bit character set. However, of this huge code space, so far characters have been assigned only to the first 65534 positions (0x0000 to 0xFFFD). This 16-bit subset of UCS is called the Basic Multilingual Plane (BMP) or Plane 0. The characters that are expected to be encoded outside the 16-bit BMP belong all to rather exotic scripts (e.g., Hieroglyphs) that are only used by specialists for historic and scientific purposes. Current plans suggest that there will never be characters assigned outside the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over one million potential future characters.

    The italics and bold are mine. The 16 bit system was not meant to be completely comprehensive - it was meant to be useful for everyday use. Which, since it covers the characters literate people are expected to know in these systems, it does. The rest of the characters are academic (literally). If these characters are so important why don't they expect all of their own countrymen to know them?

    The proprietors of the internet could have happily stuck with the regular 8-bit Roman alphabet system forever (the internet being an American military invention in the first place). The roman alphabet was just part of the system. Hell, even a 16-bit code would have covered all script-based writing and scientific/miscellaneous notation systems easily, while leaving codes or a dedicated bit for the eastern pictograph systems to signal an extension of the protocol and letting them work out their own standard amongst themselves. It would have been fun to watch them (particularly Taiwan and China) squabble for dominance over it too. No one is forcing these eastern nations (or any non-roman-alphabet users) to use unicode or UCF, or the internet or computers for that matter. If they really wanted to, they could come up with their own systems based on their own languages. They just hopped on board and adapted it to their own needs like everyone else because it's a good idea, and it would be way to difficult to build around their own languages. But isn't it funny how every one of these eastern countries (except Japan thanks to hiragana and katakana) adapted the phonic roman alphabet to simplify the teaching of their own languages? With at least 170,000 characters between them, defenders of these languages claim they are a rich cultural heritage and a beautiful illustrated system. You could just as easily say that modern use of these pictograph-based written languages are oppressively difficult and ensure a lot of time and effort wasted just trying to learn to write at best, and a stratifying system which guarantees high rates of illiteracy at worst. Erosion of these rigid and limited pictographic writing systems in favor of flexible and encompassing phonic ones is no accident or western conspiracy. Just as UCF was developed to make computer communication universal, the adaptation of phonic systems is the tendency to make literacy universal.

    cryptochrome

    P.S. Some may think that ISO 10646 (aka UCF-2) is not Unicode, but in fact as that same article points out "They joined their efforts and worked together on creating a single code table. Both projects still exist and publish their respective standards independently, however the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and they closely coordinate any further extensions. "

  • by cryptochrome (303529) on Tuesday June 05 2001, @08:18AM (#175695) Homepage Journal

    The irony of that message being marked as funny(adapted as it is from Mark Twain) is that after a few seconds to adjust, I had no trouble reading that statement at all.

    We tend to forget that there have been a lot of different spelling and notation systems for english. Even today, the british and american methods aren't identical. For all the fun we make and fear we have of the idea that the english (or any other language's) orthographic system should be simplified and made consistent with pronunciation, it is not a bad idea [whowhere.com]. It would greatly simplify the process of becoming literate and save tons of effort spent trying to learn irregular spellings. Beyond that, applying the same principles to pronunciation, the alphabetic letters (children's difficulty distinguishing b and d is universal), and vocabulary would accomplish the same goals with learning and using language.

    cryptochrome

    P.S. You forgot to mention dropping that pesky capitalization system. of course half the messages on the net don't both with it. same thing goes for dealing with contractions, a la dont, wont, ill, and so on.

  • Re:Unicode Character Set vs Character Encoding by blair1q (Score:2) Tuesday June 05 2001, @08:12AM
  • Re:Pictographic icons are not letters! by Alanus (Score:1) Tuesday June 05 2001, @07:10AM
  • Re:Alrighty by vidarh (Score:2) Tuesday June 05 2001, @06:45AM
  • Re:ASCII stupidity all over again... by vidarh (Score:2) Tuesday June 05 2001, @06:48AM
  • Re:totally unconvinced by vidarh (Score:2) Tuesday June 05 2001, @06:52AM
  • Re:2 + 1 bytes? by vidarh (Score:2) Tuesday June 05 2001, @06:57AM
  • Re:Uh, I Don't Get It by vidarh (Score:2) Tuesday June 05 2001, @07:06AM
  • Re:unicode does *not* encode 65,536 characters by vidarh (Score:2) Tuesday June 05 2001, @07:12AM
  • Re:Alrighty by vidarh (Score:2) Tuesday June 05 2001, @07:18AM
  • Re:All Character sets simultaneously?? by vidarh (Score:2) Tuesday June 05 2001, @07:21AM
  • Re:What about the artist formerly known as Prince? by vidarh (Score:2) Tuesday June 05 2001, @07:28AM
  • Re:totally unconvinced by vidarh (Score:2) Tuesday June 05 2001, @07:36AM
  • Re:Did you even read the article ? by vidarh (Score:2) Tuesday June 05 2001, @07:45AM
  • Re:Hmm.. I must have been using something else the by vidarh (Score:2) Tuesday June 05 2001, @07:50AM
  • Re:Hmm.. I must have been using something else the by vidarh (Score:2) Tuesday June 05 2001, @07:54AM
  • I've been using Unicode in various incarnations for a long time. And UCS-2 is not the only way to encode Unicode. UTF-8 is perhaps a lot more widespread, as it is the defacto standard encoding for exchange of XML documents over the web.

    UCS-4 is also quite common, and allows for the new extensions.

    UTF-16 is used by some that needs to extend their UCS-2 applications to UTF-16, or that mostly need text that work with UCS-2, but wants to be prepared for more.

    Yes, a lot of things are difficult with Unicode. But if you look at most recent internationalization efforts, unicode is what people use.

  • Re:More Flamebait :) by tb3 (Score:2) Tuesday June 05 2001, @07:00AM
  • Re:All Character sets simultaneously?? by Ubi_NL (Score:1) Tuesday June 05 2001, @06:45AM
  • Re:You bring up a good point by SpeelingChekka (Score:1) Tuesday June 05 2001, @08:24AM
  • Re:You bring up a good point by SpeelingChekka (Score:1) Tuesday June 05 2001, @12:13PM
  • Re:You bring up a good point by SpeelingChekka (Score:1) Thursday June 07 2001, @11:21AM
  • by SpeelingChekka (314128) on Tuesday June 05 2001, @08:19AM (#175717) Homepage

    Does anyone know a a real language that has a simpler writing system than english?

    Spoken like a true English-is-my-home-language person. English is NOT a simple language by any means, ask any foreigner who has learned English. Almost every rule in English has several exceptions, and many things in English cannot be deduced from rules, they must simply each be learned, and there are hundreds of these. Pronunciation is ridiculous, which you've mentioned, but apart from pronunciation is grammar, spelling, plural forms, tenses and possessive forms, all of these have strange nuances in English. The plural of dish is dishes, but the plural of fish is fish - sorry, no rule you can deduce that from, you must just learn that. The past tense of "hang" depends on what is getting hung/hanged. The rule says "add an apostrophe s" for possessive form, but of course there are exceptions, e.g. "it" "her" etc, or when the subject is a plural already, then you add an apostrophe but no "s". And the rules for when something is a plural "are" not always clear (and thus even educated people often aren't sure whether to use "are" or "is"). "Bananas are nice" is easy, but "A bunch of bananas" or "a group of individuals", are these plural or singular? And the examples get more and more complex. And there are obscure rules such as '"their" may be used in place of "his/her". And there are so many exceptions to rules like "i before e except after c", rules which many educated people even sometimes struggle to remember. I can name many University educated adults with English as their first language who still don't even know the difference between "lend" and "borrow" - that says something about the language.

    I'm glad English is my home language, but I feel sorry for foreigners who have to learn English as a second language.

    Is it just a coincidence that the simplest writing system was the first to be digitized

    Yes, actually, it is. ASCII was probably the first wide-scale character set standard used in computing - what does the "A" stand for?

  • Re:You bring up a good point by linca (Score:1) Tuesday June 05 2001, @07:15AM
  • Re:You bring up a good point by linca (Score:1) Tuesday June 05 2001, @08:01AM
  • 16-bit Should Be Enough. by robbyjo (Score:2) Tuesday June 05 2001, @07:17AM
  • Re:This article is stupid by dot11 (Score:1) Wednesday June 06 2001, @07:43AM
  • Re:technical critique by dot11 (Score:1) Wednesday June 06 2001, @08:22AM
  • The reason why Microsoft don't like unicode: by DavidJA (Score:1) Tuesday June 05 2001, @12:52PM
  • More Flamebait :) (Score:3)

    by bark76 (410275) on Tuesday June 05 2001, @06:40AM (#175724) Homepage
    Maybe if people didn't try to get character sets like Klingon [dkuug.dk], Cirth and Tengwar [unicode.org] added into unicode we wouldn't have this problem!
  • Danny Boy? by Flying Headless Goku (Score:1) Tuesday June 05 2001, @01:17PM
  • Re:umm by Computer! (Score:1) Tuesday June 05 2001, @10:52AM
  • Re:Cultural Heritage is important! by Computer! (Score:1) Tuesday June 05 2001, @11:06AM
  • Re:Perl in Hierogliphics by Magumbo (Score:2) Tuesday June 05 2001, @08:39AM
  • by Magumbo (414471) on Tuesday June 05 2001, @06:32AM (#175729)
    And we must not forget about hierogliphics. Unicode certainly has forgotten about them. That would be so cool to write perl code with little cats, birds, ankhs, and various other squiggles.

    --
  • Re:binary by thinkit (Score:1) Tuesday June 05 2001, @07:19PM
  • Re:Unicode Character Set vs Character Encoding by ek_adam (Score:1) Tuesday June 05 2001, @12:43PM
  • Re:You bring up a good point by 21mhz (Score:1) Thursday June 07 2001, @12:57AM
  • (correction to reply to AC) by haruharaharu (Score:1) Tuesday June 05 2001, @09:08AM
  • Re:Alrighty by haruharaharu (Score:1) Tuesday June 05 2001, @09:10AM
  • Re:Well DUH! It's not meant to have every characte by haruharaharu (Score:2) Tuesday June 05 2001, @09:22AM
  • I'll take that challenge by MarkusQ (Score:1) Tuesday June 05 2001, @07:08AM
  • It allready works by Greedy (Score:1) Wednesday June 06 2001, @05:54PM
  • Yes, it is by absurd_spork (Score:1) Tuesday June 05 2001, @06:51AM
  • You don't really KNOW about unicode, do you? by absurd_spork (Score:2) Tuesday June 05 2001, @07:02AM
  • Workaround by oliveloaf (Score:1) Tuesday June 05 2001, @06:56AM
  • Why get all upset about it? by m08593 (Score:1) Tuesday June 05 2001, @11:30AM
  • you can't make the Chinese happy anyway by m08593 (Score:1) Tuesday June 05 2001, @12:59PM
  • Re:Why get all upset about it? by m08593 (Score:1) Tuesday June 05 2001, @06:32PM
  • Re: No, _n_ bytes per character! by Kearwood (Score:1) Tuesday June 05 2001, @07:24PM
  • Re:You bring up a good point by Epicure (Score:1) Wednesday June 06 2001, @12:10AM
  • Re: Is this a problem by slashtop (Score:1) Tuesday June 05 2001, @06:14PM
  • You are SO naive... by brendano (Score:1) Tuesday June 05 2001, @05:37PM
  • Re:Is this really such a problem? by trash eighty (Score:1) Tuesday June 05 2001, @06:44AM
  • Re:totally unconvinced by trash eighty (Score:1) Tuesday June 05 2001, @06:47AM
  • Re:Yes, it is by trash eighty (Score:1) Tuesday June 05 2001, @10:08AM
  • Re:Quit whining and move to a phonetic alphabet by trash eighty (Score:1) Tuesday June 05 2001, @10:16AM
  • Re:totally unconvinced by trash eighty (Score:1) Tuesday June 05 2001, @10:19AM
  • Re:Is this a problem? by trash eighty (Score:2) Tuesday June 05 2001, @06:39AM
  • Re:Is this a problem? by dl_j10 (Score:1) Tuesday June 05 2001, @09:01AM
  • Use Chinese for English Data compression! by eknuds (Score:2) Tuesday June 05 2001, @09:14AM
  • Re:Is this a problem? by ruzulo (Score:1) Tuesday June 05 2001, @12:34PM
  • Re:Nonsense by kwhistler (Score:1) Thursday June 07 2001, @01:47PM
  • Re:Unicode's reply by kwhistler (Score:1) Thursday June 07 2001, @04:11PM
  • Conspiracy Theories and Unicode by kwhistler (Score:1) Thursday June 07 2001, @04:49PM
  • Unicode closed to participation? by kwhistler (Score:1) Saturday June 09 2001, @12:22PM
  • Statelessness of text by kwhistler (Score:1) Saturday June 09 2001, @12:37PM
  • Re:Conspiracy Theories and Unicode by kwhistler (Score:1) Saturday June 09 2001, @01:04PM
  • Unicode character allocation (was Unicode's reply) by Mokurai (Score:1) Thursday June 07 2001, @03:19PM
(1) | 2 | 3 | 4 | 5 | 6