Slashdot Log In
Why Unicode Won't Work on the Internet
Posted by
Hemos
on Tue Jun 05, 2001 10:11 AM
from the -Linguistic,-Political,-and-Technical-Limitations dept.
from the -Linguistic,-Political,-and-Technical-Limitations dept.
We reeived this interesting submission from N. Carroll: "Unicode, the commercial equivalent of UCS-2 (ISO 10646-1) , has been widely
assumed to be a comprehensive solution for electronically mapping all the
characters of the world's languages, being a 16-bit character definition
allowing a theoretical total of over 65,000 characters. However, the
complete character sets of the world add up to approximately 170,000
characters. This paper summarizes the political turmoil and technical
incompatibilities that are beginning to manifest themselves on the Internet
as a consequence of that oversight. (For the more technical: the recently
announced Unicode 3.1 won't work either.)" Read the full article.
This discussion has been archived.
No new comments can be posted.
Why Unicode Won't Work on the Internet WORKING ON
|
Log In/Create an Account
| Top
| 416 comments
(Spill at 50!) | Index Only
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Unicode Character Set vs Character Encoding (Score:5)
ISO 10646, the Universal Character Set defines a 31 bit character set (2,147,483,648 character codes), not a 16 bit character set. Unicode 3.0's character set corresponds to ISO 10646-1:2000. Unicode 3.1 which was recently released goes a bit further.
UCS-2, as mentioned by this article, is the same as UTF-16 and is severely limited by it's 16 bit implementation. UTF-16 is unfortunately used by Windows and Java, but is rarely used on the web. The article claims UTF-16 can only map 65,000 characters, but using surrogate pairs can actually map over 1 million characters.
Thankfully, there are several other encoding methods for Unicode. UTF-8, which is a variable length encoding most commonly used on the web allows a mapping of Unicode from U-00000000 to U-7FFFFFFF (all 2^31 character codes). It also has a nice feature of the lower 7 bits being ASCII, so there is no conversion necessary from ASCII to UTF-8.
UTF-32 or UCS-4 is a 32 bit character encoding used by a number of Unix systems. It's not exactly the most space efficient form (UTF-8 requires roughly 1.1 bytes per character for most Latin languages), but it can handle the entire Unicode character set.
A good document on this is available at UTF-8 And Unicode FAQ [cam.ac.uk]
Unicode includes all common Asian character sets (Score:3)
Unicode does not contain all characters ever used, for example it does not contain the Nordic runes. These are not used today except by scolars, who will need special software (most likely using the "reserved to the user" part of Unicode). The same is true for many ancient Asian characters.
Use UTF-8, don't worry about sizes (Score:3)
The only issue is that there's not a good way to set a mask for the characters such that 0-127 (which take up a single byte) are the common characters for the language, and so on, so English is more compact than other languages, even languages which don't require more characters.
Re:UTF-8 should be fine for almost any application (Score:5)
I might add a few things:
In UTF-8 not just NULL or Escape are not in the multibyte characters, in face *all* 7-bit characters are not in the multibyte characters (the multibytes have the high bit set in all bytes). This means that *any* program that treats all bytes with the high bit set as a "letter" will work and can parse, hash, match, search, etc identifiers/words with foreign letters in them!
In addition the UTF-8 encoding is just heavy enough that random line noise is very unlikely to match a UTF-8 encoding. If programs treat "illegal" UTF-8 encodings as individual bytes in the ISO-8859-1 character set, it will display virtually all existing ASCII/ISO-8859-1 documents unchanged!
The end result is that it should be easy to switch all interfaces (not just over the network, but inside programs and to libraries) to UTF-8. This will vastly simplify the handling of Unicode because there will be no need for ASCII back compatability interfaces. We could also eliminate all the "locale" crap and make ctype.h the simple thing it once was.
Even Arabic will encode smaller in UTF-8 than UTF-16. This is due to the fact that very common characters (not just English, but things like space and newline) are only one byte.
Re:You bring up a good point (Score:3)
Hangul outdoes the latin alphabet in several ways. For one, as you mention, pronunciation in English is difficult, while in Hangul it is almost completely unambiguous. Each phoneme maps to one character, and vice-versa. There is no confusion over whether to write "cat" or "kat", for example. Only one letter has the "k" sound.
Each Hangul character is a pictogram describing the position of the tongue, palate, and lips to use when pronouncing it. Whereas most phonetic alphabets consist of ideograms recycled as phonetic symbols, Hangul seems to be the only one to consist of symbols constructed purely for phonetic meaning.
Since the job of a phonetic alphabet is only to represent phonemes, I would say that this alphabet does the job better than latin.
Some errors (Score:5)
In actual fact, Katakana is much more angular than Hiragana - definitely not "fluid" in appearance. Furthermore, anything that can be written in Kanji can be written (phonetically) in either Hiragana or Katakana - the use of Katakana for foreign words is nothing more than custom, not a limitation of the characters.
Thus is can be said that Hiragana can form pictures but Katakana can only form sounds...
That should probably read "Kanji can form pictures but Hiragana/Katakana can only form sounds..."
Romaji is used to try and keep the whole written thing from getting out of control, with most Western concepts and necessary words being introduced into the language through this mechanism.
Bollocks. Romaji is hardly ever used (except for advertisements, and then only rarely, or textbooks for foreigners). It's definitely not the main conduit for Western ideas.
After a time these words (even though they will still maintain their "Roman" form for awhile longer) will become unrecognizable to the people they were originally borrowed from, such as the phrase, "Personal Computer," which is now "PersaCom" in Japan.
Again, this is incorrect. Words don't *have* a Roman form in everyday use; sure, you can express them in Romaji but no-one ever does. As for "personal computer", the correct Romanization is 'pasokon', not 'PersaCom". (Where did he get that from?!)
The rest of the 1,950 have to been memorized fully by the time of graduation from high school in Grade Twelve. Please remember that this total is only the legal minimum required threshold to be considered literate. And this is to be absorbed completely, along with a back-breaking load of other subjects.
Ummm... that's actually not too hard. I (along with everyone else at my language school) memorized more than 1300 Kanji in less than a year... and none of us were Japanese. I know it must seem like an impossible total to people used to ASCII, but there are many common points between Kanji that simplify the learning process greatly.
That said, I've long been against the current Unicode "standard", as have many technical people in Japan, for a number of reasons. Some of those are:
- No standard conversion tables from existing character sets (SJIS, EUC-JP, ISO-2022-JP).
Several conversion tables do exist, but there are minor differences between them that make it impossible to go from, say, SJIS to Unicode and back to SJIS without the possiblity of changing the characters used.
- A draconian unification of CJK characters.
The Unicode Consortium basically forced the standards bodies in China, Japan and Korea to unify certain similar Kanji onto single code points, which doesn't allow for cases where, say, Japanese actually has two or three distinctive writings that are used in different situations.
- The ugly "extensions".
Unicode has been effectively ruined as a method of data exchange by its treatment of characters not in the 60,000-character basic standard.
I could go on, but I should get some sleep...
Re:You bring up a good point (Score:5)
A good source for your obscure questions is, as always, the Straight Dope, which answers the "Chinese Typewriter" question here [straightdope.com].
Regards,
gleam
Re:UTF-8 should be fine for almost any application (Score:3)
Of course, even if you could get China, Taiwan, Japan and Korea to agree on a unified character encoding similar to the ISO-Roman character set (where identical or analogous characters in the different alphabets shared the same character code) you would still need more than 50,000 encodings just for the unified asian character set.
I can see good reasons why language using similar alphabets should have overlapping encodings, but this is probably better solved by providing translation tables between related alphabets than by forcing multiple alphabets to share a single encoding. While I may be able to write the french coup de grâce in the english alphabet as coup de grace something has clearly been lost. Other europen languages are even worse, even those that nominally use the roman alphabet! Then there are questions of alphabetization between differnt languages and the questions of whether or not accented letters correspond to each other or to the unaccented letter.
Call me a purist, but I think it is actaully much easier if we just had distinct representations for each language and had to perform some kind of mapping to display one language in another language's alphabet.
Re:After some skimming... (Score:3)
You want the ability to search through some insanely large character set, so to do so you're willing to force everyone else to make their communications much less efficient just so you can have a free ride.
You know, it's not a coincidence that the western world (using small variations on the roman character set) pretty well invented modern technology. It's only about a thousand times easier to process a smaller and simpler alphabet.
There's a reason we don't use prose to command computers, until all cheap desktop models come with the ability to understand natural language a stripped down and unambiguous command-set will be more efficient.
I've got a lot of characters I'd find handy if we were to implement a new standard, and I'd want to expand into basic pictograms (standard symbols, etc) as well. Now I realize this isn't interesting to other people, so I'm not going to jump up and down and shout "Racist" just because people aren't anxious to bloat a new standard just to appease me. If I want those features I'll make my own font and make it available with any works that I produce which would require it.
In short, grow up, the world does *not* own you anything. If you want it, do it yourself instead of crying when someone else doesn't.
Quit whining and move to a phonetic alphabet (Score:3)
But like companies who still maintain their legacy software written in Cobol and who knows what else, countries and cultures hold onto their legacy alphabets, despite all their disadvantages, and despite all the moaning and groaning about education, literacy, and how hard it is to type 10,000 characters on a 100-key keyboard.
I agree there is a serious problem of understanding texts written in the "old way". There is a simple solution here, too, i.e., we just translate what's most important to the "new way" and let scholars work on the texts that don't get translated. Before anyone gets too hot here, the situation is not that much different than translating literature from one language to another. It is too much work to translate everything that is written in English into French, so one focuses on the texts that are important enough for translation.
Also, English has a lot of problems here, as it is mostly phonetic, but a large percentage is not, large enough to make learning English a lot more difficult than say learning Spanish.
I realize this is way too utopian. We Americans can't even move to metric, much less anything more "radical". I just needed to respond to the whining.
A Plan for the Improvement of English Spelling (Score:4)
Flamebait :) (Score:3)
too sinocentric, but Unicode has problems (Score:4)
The Japanese hate Unicode. If you bother to ask them, which the web did not, you find a loud and impolite dislike for Unicode. The Japanese want their ISO 2022 solution, aka shift-JIS.
The history of encodings is roughly:
1. There was chaos.
2. Then there was ASCII (the roman alphabet) pleasing to latin and english speakers.
3. Then there were all the ISO 8859 and ISO 2022 encodings. These let all the european languages mix together with ASCII.
4. Then Japan, Korea, and Vietnam define their own ISO 2022 encodings that make sense in the local language, and let these languages mix together with the european languages and ASCII.
5. But ISO 2022 is a complex patchwork of special cases. So at the same time the Asians were inventing their ISO 2022 solutions, Unicode was being invented.
Unicode 1.0 provided a viable solution to modern european languages, but could not encode historical documents or asian languages properly. The Unicode 2.0 effort fixed the historical european language problem by adding in the alphabets for these "dead" languages. Unicode 2.0 brought the asian encodings to the point where they were usable.
Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.
Meanwhile China has a unique problem. They do not have an agreed alphabet. The Japanese all around the world agree on what characters define Kanji. There may be different fonts, but there is one agreed alphabet. Similarly, the Koreans and the Vietnamese have one agreed alphabet. These alphabets are huge, with thousands of characters, but they are fixed and agreed worldwide.
China has not agreed on an alphabet. Different regions use different alphabets. Chinese speak numerous different languages and have invented an amazing alphabet that works as a single writing form for all those languages. But there are disagreements. Furthermore, some regions of China are still inventing new letters for the alphabet. It is not a fixed and stable thing like european alphabets. You can invent new letters. (These really are new letters, not just new fonts.)
The Chinese have invented many encodings as a result. The two most popular (Big5 and GB2312) are not ISO 2022 compatible. There is a new, less widely used encoding that is a superset encoding of BIG5, GB2312, and other encodings, and that is ISO 2022 compatible.
Unicode did not accept the approach of leaving all these alphabets as different. They share most of their glyphs. Giving each region and language its own complete section would have blown the 50K limit of Unicode 2.0. They smushed all these different alphabets into one blob by combining anything that had similar glyphs into one character.
This left Unicode 2.0 telling the Chinese, ignore all those letters we don't like. You don't use them much anyhow. It destroyed any notion of alphabetic order in the encodings for any asian language. And it is usable for modern text communication. Unicode 3.0 promises to do better, and probably will.
But since all these languages can use the ISO 2022 encodings with fully compatable mixture of languages, why not just use ISO 2022 and forget Unicode? The problem is the patchwork nature of ISO 2022. The encoding rules are complex. ISO 2022 is a terrible internal format. A chinese character may take from 2 to 9 bytes to encode. And it gets worse as you dig further. UCS-2 and UCS-4 are very nice friendly internal formats for computers. It is trivial to convert from UCS-2 or UCS-4 into UTF-8 for transmission.
It is also pretty simple to translate from UCS-2 or UCS-4 into ISO 2022 encodings. So the ISO 2022 encodings actually can make sense for network transmission.
These issues will just get worse as you include other languages, like historical chinese, chinese border languages, and south asian languages. As with chinese, some of these have the fundamentally hard problem that they do not agree on a single alphabet.
Wrong, wrong! (Score:4)
It is a shame that there are so many different Unicode encodings. I think we ought to just standardize on UTF-8.
unicode does *not* encode 65,536 characters (Score:4)
Now, a range of Unicode characters is set aside for so-called "surrogates", and a high surrogate and a low surrogate character placed next to one another form a "surrogate pair" which specifies an extended character in UCS Plane 1. None of UCS Plane 1 codepoints are actually assigned to anything yet, but since there are about 2^20 (~one million) Plane 1 codepoints, they will easily handle all remaining glyphs with a ton left over. Tengwar, Klingon and others have all been considered for Plane 1 encoding (although I just checked and Klingon has been rejected. Sorry folks).
So, the simple fact is that anyone who says Unicode can't support enough characters has been smoking a bit too much crack lately. Do yourself a favor and go read the spec before getting your panties in a twist.
Unicode has this covered. (Score:3)
Had this researcher bothered to read the Unicode technical introduction, the following would have been obvious.
In all, the Unicode Standard, Version 3.0 provides codes for 49,194 characters from the world's alphabets, ideograph sets, and symbol collections. These all fit into the first 64K characters, an area of the codespace that is called basic multilingual plane, or BMP for short.
There are about 8,000 unused code points for future expansion in the BMP, plus provision for another 917,476 supplementary code points. Approximately 46,000 characters are slated to be added to the Unicode Standard in upcoming versions.
The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.
Plenty of room.
Cheers,
Toby Haynes
Misconceptions in article (Score:3)
As a preliminary, Unicode and ISO 10646 aren't the same standard, but are kept pretty much in synchronisation. ISO 10646 provides a character set with a 4-byte representation, and a compatible smaller set with a 2-byte representation. These representations have encodings such as UTF-8, UTF-16, and UTF-32. UTF-32 encodes every Unicode character in 32 bits and can represent the full 2^31 codepoints, while UTF-8 and UTF-16 as described in the Unicode 3.1 document [unicode.org] are variable length representations that can represent approximately 2,100,000 and 1,100,000 codepoints respectively.
One of the design principles was to provide a lossless representation of any currently used character set in Unicode, so that a round-trip re-encoding of text from one encoding to Unicode and back again would lose no information. Another was to keep distinct code-points for any characters that had different semantics, or different 'abstract shapes'.
It turns out that one can satisfy these requirements for the Japanese kanji, Chinese hanzi (traditional and simplified) and Korean hanja without requiring a seperate code-point for each; in Unicode version 2.0, approximately 121,000 such characters were able to be represented in 20,902 code points. Note that those characters which have distinct shapes but the same meaning, and those which are similar enough to be classified as calligraphic variants but have distinct meanings, are all represented by distinct code-points. (One caveat: in practice there are some exceptions as regards the preservation of information after a round-trip encoding to Unicode and back. For example, the CCCII encoding of hanzi explicitly catalogues calligraphic variations, and as such doesn't map 1-1 onto Unicode.)
Of course, the actual glyph that corresponds to one of these unified codes will change depending upon the context in which it is rendered. For example the character 0x6d77 corresponding to the character for sea in both Chinese (Mandarin 'hai3') and Japanese ('umi') is drawn with one fewer stroke in Japanese than in Chinese. These typographical details are important, but can (and debatably, should) be dealt with outside the context of character encoding. Unicode has support for language tags which in the absence of any higher-level information can indicate the language context of the characters following them. Typically though, this information should be stored as part of a richer document structure (as is possible in XML for example.) Correct display of characters will require the presence of the appropriate font and a mechanism (such as LOCALE in a simple one language case) for selecting this font.
Given this unification then, one really can fit most of the characters for which there already extant (non-Unicode) encodings into 16 bits. With Unicode 3.1/ISO 10646-2 (which uses more than 65536 codepoints) this representation is AFAIK pretty much complete, including for example all of the hanzi of CNS 11643-1992 and CNS 11643-1986 plane 15 (the most complete hanzi encoding outside of CCCII.)
With this in mind, one can argue against the points raised in the article:
A little bit of research by the article author would have made the article unnecessary.
References:
Unicode 3.1 document;
CJKV Information Processing, Ken Lunde.
PS: In the time it took me to read the article, do some research and write this response, there have been over 300 slashdot comments. Wow.
Re:Wrong, wrong! (Score:3)
An implication of this is that UTF-8, UTF-16, and UTF-32 can all express the EXACT SAME NUMBER OF CHARACTER CODES. So, if you think UTF-32 is good enough for you, then UTF-16 and UTF-8 are just as good. The latter two simply use multi-word or multi-byte sequences to express the upper character values.
After using BeOS for a number of years, where all character strings are natively handled as UTF-8, I am a very strong believer in Unicode. Having a Western perspective I may be missing something, but none of the "problems" mentioned in this article are actually problems that Unicode has.
Of course, once you start using Unicode, the main problem you are going to run in to is having fonts with the characters you need. And if the Chinese, Japenese, etc. really need 50,000 of their very own characters, then this is going to be that much more of a problem. Unforunately, there is no easy solution to this -- but it doesn't have anything to do with the encoding you use, so changing to another encoding is not going to help here.
Unicode and CJK Characters (Score:3)
The other reasons are more subtle, and I'm not sure that everyone here understands what's going on with CJK characters, so here's a little background.
The characters we're talking about originated in china, and spread to Korea, Vietnam, and Japan. Vietnam has switched to a western alphabet now, so let's leave them out. ;) At one point, although there have always been alternative forms for some characters, there was a reasonably standard set of Chinese characters used throughout these countries (recorded in the KangXi dictionary)...
The Japanese invented a number of their own characters, which I'm sure number less than 1000. Up until World War II, this was basically the situation. (So at this time, the required number of characters to encode would have been less than 50,000 -- Chinese characters and Japanese additions.) Then all hell broke loose, so to speak.
The Japanese simplified a large number of their characters systematically, immediately following WWII ( So they started substituting simpler characters for the disallowed ones in these compounds, and thereby subtly changed the meaning of the words.
On to China -- they also began a campaign of character simplification, which would span quite a few years, although theirs was much more radical than the Japanese approach. In fact, some of the simplified versions the government came out with were so repulsive, they were eventually retracted because everyone refused to use them. ;) So they ended up with a few thousand (
Finally, Korea, Taiwan, and Hong-Kong basically kept the traditional chinese characters.
So, that gives us the basic 40,000, plus 3000 Japanese (kokuji and shinjitai), plus maybe 10,000 chinese (jiantizi), plus some other stuff not mentioned here, giving a grand estimate of around 55,000.
The key to this is that the vast majority of characters used are common among all 5 locales. This was the only reason that anyone even attempted to encode the CJK characters in the first place. The re-unification of all the disparate character sets was called Han-Unification during the Unicode development process.
This, combined with the surrogate encoding area, ensures that there will be plenty of space for everyone... :)
Cultural Heritage is important! (Score:3)
You *have* read Beowulf, and the Canterbury Tales, haven't you? Along with Plato's Republic in Greek, and the Dead Sea scrolls?
Now imagine how hard this would be if your computer didn't support the full character set in which they were written.
Overstating and misunderstanding the problem (Score:4)
This is exactly the issue with the Chinese characters. For a given character, there might be a difference between the Taiwanese way of writing it, the Japanese way, and the mainland Chinese way; but the character is still recognized as being the same, despite these presentation-level differences.
For someone to demand that each national presentation form have its own character code is to misunderstand what Unicode is designed for. It encodes abstract characters, not presentation forms. Unicode does not have separate codes for "A" in Garamond and "A" in Helvetica.
Solution - Everybody use Euro-English! (Score:4)
In the first year, "s" will replace the soft "c". Sertainly, this will make the sivil servants jump with joy. The hard "c" will be dropped in favour of the"k". This should klear up konfusion and keyboards kan have 1 less letter.
There will be growing publik enthusiasm in the sekond year, when the troublesome "ph" will be replaced with "f". This will make words like "fotograf" 20% shorter.
In the 3rd year, publik akseptanse of the new spelling kan be ekspekted to reach the stage where more komplikated changes are possible. Governments will enkorage the removal of double letters, which have always ben a deterent to akurate speling. Also, al wil agre that the horible mes of the silent "e"s in the language is disgraseful, and they should go away.
By the fourth year, peopl wil be reseptiv to steps such as replasing "th" with "z" and "w" with "v". During ze fifz year, ze unesesary "o" kan be dropd from vords kontaining "ou" and similar changes vud of kors be aplid to ozer kombinations of leters.
After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubl or difikultis and evrivun vil find it ezi to understand ech ozer. Ze drem vil finali kum tru!
ISO-2022-JP and "alphabetical order" (Score:4)
>>Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.
I can't speak for Korean, but there is no such thing as an alphabetic order for Kanji. In Japanese, Kanji almost always have at least two pronunciations, and often more.
While it is true that most all kanji have multiple pronunciations, the kanji in ISO-2022-JP are most definitely in order. Level 1 characters (0x3021-0x4F7E) are ordered by their primary reading, and Level 2 characters (0x5021-0x7426?) are ordered first by radical and then by number of strokes. In both cases it's easy to locate a character if for some reason you can't type it normally (e.g. it's not in your IME dictionary)--I've had to do this on occasion, in fact.
Unicode is, for all intents and purposes, completely random. Even without the problems of characters being inappropriately merged, there is no way you could try and find a character in Unicode; if your dictionary doesn't have it, tough luck. To me, that's an even scarier concept: for all practical purposes it could eliminate characters from the language. After all, if nobody can type it who's going to use it?
Have you ever tried to program in shift-JIS? It is horrific.
I will agree with this. Leaving aside the original poster's confusion of ISO-2022-JP and shi[f]t-JIS (the former is the official standard, aka JIS, while the latter is a poorly-thought-out Microsoft hack), dealing with strings that contain both half-width (1-byte) and full-width (2-byte) characters is a major PITA. About the only thing that can be said for it is the number of bytes is equal to the number of half-width character positions needed; and even that only applies to EUC and SJIS, since JIS has escape sequences to squeeze everything into 7-bit characters.
On the other hand, there's the character order consideration, which along with the problem of merged characters seems to be what draws so much dislike for Unicode from Japanese.
--
BACKNEXTFINISHCANCEL
After some skimming... (Score:4)
Wieger's seminal book about the characters and construction of China, published in 1915, was to become the defacto source against which all others would (and still should) be compared - with several caveats. Amongst these is a noticeable bias on his part against Taoism which becomes more evident in his analysis of the Tao Tsang (i.e., Taoist Canon of Official Writings [written 'DaoZang' in the PinYin Romanization of Mainland China] )
and I decided to skim the rest.
To summarize, for those whose eyes completely glazed over, his point is that Unicode doesn't sufficiently cover the full range of Chinese characters and that not using a larger set is a result of a longstanding Western prejudice that the Chinese don't need so many characters.
Now, I'm not Chinese so my opinion counts for little here, but my impression is that Unicode isn't nearly as controversial as he makes it out. His analogy "To express it in Western terms, how would English-speakers like it if we were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" - why, they are the nothing more a fancier "C" and an "Z")." ignores the fact that Chinese orthography has a tradition of simplification and variants. I suspect Unicode is a lot more upsetting to a "reference writer specializing in rare Taoist religious texts and medical works" than to ordinary Chinese users who want to run Photoshop or put their wedding pictures on a web page.
Unsettling MOTD at my ISP.
In other news... (Score:3)
Bush bolts GOP to join Democrats, fires entire Whitehouse staff
Linus Torvalds to join Microsoft as OfficeXP advocate
NASA on Moonshots, "Ok, ok, they were all actually faked on a soundstage in Toledo, Ohio and the ISS is really in a warehouse in Newark, New Jersey"
Oracle CEO, Larry Ellison to give fortune to charity, dumps japanese kimonos for Dockers and GAP T-shirts
RIAA to drop all charges against Napster, "All a big fsck-up, we'll all get rich together"
Taiwan throws in towel, joins PRC, turning over massive US military and intelligence assets
Rob Malda signed by Disney, epic picture planned, based upon this short. Sez Malda, "Anime's not mainstream enough anyway." [cmdrtaco.net]
-- .sig are belong to us!
All your
So... (Score:4)
-- .sig are belong to us!
All your
No, _n_ bytes per character! (Score:3)
Personally, I think UTF-8 is just a wee bit inefficient. I worked out a scheme long ago that defines a theoretically infinite namespace, and encodes 7-bit ASCII exactly the same as it is now. If anyone cares, it's as simple as this:
This gives 2^(7 * n)possible characters of length n:- 128.
- 16,384, cumulative 16,512.
- 2,097,152, cumulative 2,113,664.
- 268,435,456, cumulative 270,549,120.
- 34,359,738,368, cumulative 34,630,287,488.
- 4,398,046,511,232, cumulative 4,432,676,798,720.
...
As you can see, 3 bytes allow encoding that covers pretty much every estimate I've seen here.The system can be arbitrarily extended any time it's necessary, and existing agents that understand the fundamental rule would know how to parse these extended characters; although they would not know how to present the characters, they would be able to present an appropriate token indicating this fact, rather than displaying gibberish composed of the 8-bit "ascii" encoding they do understand.
Unicode's reply (Score:4)
It's probably too late, but following is a reponse from on of the editors of the Unicode Standard:
Dear Mr. Carroll,
I have just finished reading the article you published today on the Hastings Research website, authored by Norman Goundry, entitled "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations."
Mr. Goundry's grounding in Chinese is evident, and I will not quibble with his background East Asian historical discussion, but his understanding of the Unicode Standard in particular and of the history of Han character encoding standardization is woefully inadequate. He make a number of egregiously incorrect statements about both, which call into question the quality of research which went into the Unicode side of this article. And as they are based on a number of false premises, the article's main conclusions are also completely unreliable.
Here are some specific comments on items in the article which are either misleading or outright false.
Before getting into Unicode per se, Mr. Goundry provides some background on East Asian writing systems. The Chinese material seems accurate to me. However, there is an inaccurate statement about Hangul: "Technically, it was designed from the start to be able to describe *any sound* the human throat and mouth is capable of producing in speech, ..." This is false. The
Hangul system was closely tied to the Old Korean sound
system. It has a rather small number of primitives for
consonants and vowels, and then mechanisms for combining them
into consonantal and vocalic nuclei clusters and then into
syllables. However, the inventory of sounds represented by
the Jamo pieces of the Hangul are not even remotely close to
describing any sound of human speech. Hangul is not and never
was a rival for IPA (the International Phonetic Alphabet).
In the section on "The Inability of Unicode To Fully Address Oriental Characters", Mr. Goundry states that "Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate *every single written language* on the planet." While the intended scope of the Unicode Standard is indeed to include all significant writing systems, present and past, as well as major collections of symbols, the Unicode Standard is *not* about creating "formalized font systems", whatever that might mean. Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.
Immediately thereafter, Mr. Goundry starts making false statements about the architecture of the Unicode Standard, making tyro's mistakes in confusing codespace with the repertoire of encoded characters. In fact the codespace of the Unicode Standard contains 1,114,112 code points -- positions where characters can be encoded. The number he then cites, 49,194, was the number of standardized, encoded characters in the Unicode Standard, Version 3.0; that number has (as he notes below) risen to 94,140 standardized, encoded characters in the *current* version of the Unicode Standard, i.e., Version 3.1. After taking into account code points set aside for private use characters, there are still 882,373 code points unassigned but available for future encoding of characters as needed for writing systems as yet unencoded or for the extension of sets such as the Han characters.
*Even if* Mr. Goundry's calculation of 170,000 characters needed for China, Taiwan, Japan, and Korea were accurate, the Unicode Standard could accomodate that number of characters easily. (Note that it already includes 70,207 unified Han ideographs.) However, Mr. Goundry apparently has no understanding of the implications or history of Han unification as it applies to the Unicode Standard (and ISO/IEC 10646). Furthermore, he makes a completely false assertion when he states that Mainland China, Taiwan, Korea, and Japan "were not invited to the initial party."
Starting with the second problem first, a perusal of the Han Unification History, Appendix A of the Unicode Standard, Version 3.0, will show just how utterly false Mr. Goundry's implication that the Asian countries were left out of the consideration of encoding of Han characters in the Unicode Standard is. Appendix A is available online, so there really is no valid research excuse for not having considered it before haring off to invent nonexistent history about the project, even if Mr. Goundry didn't have a copy of the standard sitting on his desk. See:
http://www.unicode.org/unicode/uni2book/appA.pdf [unicode.org]
The "historical" discussion which follows in Mr. Goundry's account, starting with "The reaction was predictable..." is nothing less than fantasy history that has nothing to do with the actual involvement of the standardization bodies of China, Japan, Korea, Taiwan, Hong Kong, Singapore, Vietnam, and the United States in Han character encoding in 10646 and the Unicode Standard over the last 11 years.
Furthermore, Mr. Goundry's assertions about the numbers of characters to be encoded show a complete misunderstanding of the basics of Han unification for character encoding. The principles of Han unification were developed on the model of the main *Japanese* national character encoding, and were fully assented to by the Chinese, Korean, and other national bodies involved. So assertions such as "they [Taiwan] could not use the same number [for their 50,000 characters] as those assigned over to the Communists on the Mainland" is not only false but also scurrilously misrepresents the actual cooperation that took place among all the participants in the process.
Your (Mr. Carroll's) editorial observation that "It is only when you get *all* the nationalities in the same room that the problem becomes manifest," runs afoul of this fantasy history. All the nationalities have been participating in the Han unification for over a decade now. The effort is led by China, which has the greatest stakeholding in Han characters, of course, but Japan, Korea, Taiwan and the others are full participants, and their character requirements have *not* been neglected.
And your assertion that many Westerners have a "tendency ..
to dismiss older Oriental characters as 'classic,'" is
also a fantasy that has nothing to do with the reality
of the encoding in the Unicode Standard. If you would
bother to refer to the documentation for the Unicode
Standard, Version 3.1, you would find that among the
sources exhaustively consulted for inclusion in the
Unicode Standard are the KangXi dictionary (cited by
Mr. Goundry), but also Hanyu Da Zidian, Ci Yuan, Ci Hai,
the Chinese Encyclopedia, and the Siku Quanshu. Those are
*the* major references for Classical Chinese --
the Siku Quanshu *is* the Classical canon, a massive
collection of Classical Chinese works which is now
available on CDROM using Unicode. In fact, the company
making it available is led by the same man who represents
the Chinese national standards body for character encoding
and who chairs the Ideographic Rapporteur Group (the
international group that assists the ISO working group
in preparing the Han character encoding for 10646
and the Unicode Standard).
Mr. Goundry's argument for "Why Unicode 3.1 Does Not Solve the Problem" is merely that "[94,140 characters] still falls woefully short of the 170,000+ characters needed"-- and is just bogus. First of all the number 170,000 is pulled out of the air by considering Chinese, Japanese, and Korean repertoires *without* taking Han unification into account. In fact, many *more* than 170,000 candidate characters were considered by the IRG for encoding -- see the lists of sources in the standard itself. The 70,207 unified Han ideographs (and 832 CJK compatibility ideographs) already in the Unicode Standard more than cover the kinds of national sources Mr. Goundry is talking about.
Next Mr. Goundry commits an error in misunderstanding the architecture of the Unicode Standard, claiming that "two *separate* 16 bit blocks do not solve the problem at all." That is not how the Unicode Standard is built. Mr. Goundry claims that "18 bits wide" would be enough -- but in fact, the Unicode Standard codespace is 21 bits wide (see the numbers cited above). So this argument just falls to pieces.
The next section on "The Political Significance Of This Expressed In Western Terms" is a complete farce based on false premises. I can only conclude that the aim of this rhetoric is to convince some ignorant Westerners who don't actually know anything about East Asian writing systems -- or the Unicode Standard, for that matter -- that what is going on is comparable to leaving out five or six letters of the Latin alphabet or forcing "the French ... to use the German alphabet". Oh my!
In fact, nothing of the kind is going on, and these are
completely misleading metaphors.
The problem of URL encodings for the Web is a significant problem, but it is not a problem *created* by the Unicode Standard. It is a problem which is being actively worked on my the IETF currently, and it is quite likely that the Unicode Standard will be a significant part of the *solution* to the problem, enabling worldwide interoperability, rather than obstructing it.
And it isn't clear where Mr. Goundry comes up with asides about "Ascii-dependent browsers". I would counter that Mr. Goundry is naive if he hasn't examined recently the internationalized capabilities of major browsers such as Internet Explorer -- which themselves depend on the Unicode Standard.
Mr. Goundry's conclusion then presents a muddled summary of Unicode encoding forms, completely missing the point that UTF-8, UTF-16, and UTF-32 are each completely interoperable encoding forms, each of which can express the entire range of the Unicode Standard. It is incorrect to state that "Unicode 3.1 has increased the complexity of UCS-2." The architecture of the Unicode Standard has included UTF-16 (not UCS-2) since the publication of Unicode 2.0 in 1996; Unicode 3.1 merely started the process of standardizing characters beyond the Basic Multilingual Plane.
And if Mr. Goundry (or anyone else) dislikes the architectural complexity of UTF-16, UTF-32 is *precisely* the kind of flat encoding that he seems to imply would be preferable because it would not "exacerbate the complexity of font mapping".
In sum, I see no point in Mr. Goundry's FUD-mongering about the Unicode Standard and East Asian writing systems.
Finally, the editorial conclusion, to wit, "Hastings [has] been experimenting with workarounds, which we believe can be language- and device-compatible for all nationalities," leads me to believe that there may be hidden agenda for Hastings in posting this piece of so-called research about Unicode. Post a seemingly well-researched white paper with a scary headline about how something doesn't work, convince some ignorant souls that they have a "problem" that Unicode doesn't address and which is "politically explosive", and then turn around and sell them consulting and vaporware to "fix" their problem. Uh-huh. Well, I'm not buying it.
--Ken Whistler, B.A. (Chinese), Ph.D. (Linguistics),
Technical Director, Unicode, Inc.
Co-Editor, The Unicode Standard, Version 3.0
--
Pictographs suck (Score:3)
For crying out loud, somebody tries and do something nice for somebody and they come back and accuse them of cultural chauvanism. The powers that be didn't have to develop unicode or UCF at all. They only developed it because of the proliferation of language protocols was making the internet difficult to use for foreign languages and multinational businesses in general.
And besides which, the point of the article is moot. As this article states:
ISO 10646 defines formally a 31-bit character set. However, of this huge code space, so far characters have been assigned only to the first 65534 positions (0x0000 to 0xFFFD). This 16-bit subset of UCS is called the Basic Multilingual Plane (BMP) or Plane 0. The characters that are expected to be encoded outside the 16-bit BMP belong all to rather exotic scripts (e.g., Hieroglyphs) that are only used by specialists for historic and scientific purposes. Current plans suggest that there will never be characters assigned outside the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over one million potential future characters.
The italics and bold are mine. The 16 bit system was not meant to be completely comprehensive - it was meant to be useful for everyday use. Which, since it covers the characters literate people are expected to know in these systems, it does. The rest of the characters are academic (literally). If these characters are so important why don't they expect all of their own countrymen to know them?
The proprietors of the internet could have happily stuck with the regular 8-bit Roman alphabet system forever (the internet being an American military invention in the first place). The roman alphabet was just part of the system. Hell, even a 16-bit code would have covered all script-based writing and scientific/miscellaneous notation systems easily, while leaving codes or a dedicated bit for the eastern pictograph systems to signal an extension of the protocol and letting them work out their own standard amongst themselves. It would have been fun to watch them (particularly Taiwan and China) squabble for dominance over it too. No one is forcing these eastern nations (or any non-roman-alphabet users) to use unicode or UCF, or the internet or computers for that matter. If they really wanted to, they could come up with their own systems based on their own languages. They just hopped on board and adapted it to their own needs like everyone else because it's a good idea, and it would be way to difficult to build around their own languages. But isn't it funny how every one of these eastern countries (except Japan thanks to hiragana and katakana) adapted the phonic roman alphabet to simplify the teaching of their own languages? With at least 170,000 characters between them, defenders of these languages claim they are a rich cultural heritage and a beautiful illustrated system. You could just as easily say that modern use of these pictograph-based written languages are oppressively difficult and ensure a lot of time and effort wasted just trying to learn to write at best, and a stratifying system which guarantees high rates of illiteracy at worst. Erosion of these rigid and limited pictographic writing systems in favor of flexible and encompassing phonic ones is no accident or western conspiracy. Just as UCF was developed to make computer communication universal, the adaptation of phonic systems is the tendency to make literacy universal.
cryptochrome
P.S. Some may think that ISO 10646 (aka UCF-2) is not Unicode, but in fact as that same article points out "They joined their efforts and worked together on creating a single code table. Both projects still exist and publish their respective standards independently, however the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and they closely coordinate any further extensions. "
I had no trouble reading that at all (Score:3)
The irony of that message being marked as funny(adapted as it is from Mark Twain) is that after a few seconds to adjust, I had no trouble reading that statement at all.
We tend to forget that there have been a lot of different spelling and notation systems for english. Even today, the british and american methods aren't identical. For all the fun we make and fear we have of the idea that the english (or any other language's) orthographic system should be simplified and made consistent with pronunciation, it is not a bad idea [whowhere.com]. It would greatly simplify the process of becoming literate and save tons of effort spent trying to learn irregular spellings. Beyond that, applying the same principles to pronunciation, the alphabetic letters (children's difficulty distinguishing b and d is universal), and vocabulary would accomplish the same goals with learning and using language.
cryptochrome
P.S. You forgot to mention dropping that pesky capitalization system. of course half the messages on the net don't both with it. same thing goes for dealing with contractions, a la dont, wont, ill, and so on.
Hmm.. I must have been using something else then? (Score:4)
UCS-4 is also quite common, and allows for the new extensions.
UTF-16 is used by some that needs to extend their UCS-2 applications to UTF-16, or that mostly need text that work with UCS-2, but wants to be prepared for more.
Yes, a lot of things are difficult with Unicode. But if you look at most recent internationalization efforts, unicode is what people use.
Re:You bring up a good point (Score:4)
Does anyone know a a real language that has a simpler writing system than english?
Spoken like a true English-is-my-home-language person. English is NOT a simple language by any means, ask any foreigner who has learned English. Almost every rule in English has several exceptions, and many things in English cannot be deduced from rules, they must simply each be learned, and there are hundreds of these. Pronunciation is ridiculous, which you've mentioned, but apart from pronunciation is grammar, spelling, plural forms, tenses and possessive forms, all of these have strange nuances in English. The plural of dish is dishes, but the plural of fish is fish - sorry, no rule you can deduce that from, you must just learn that. The past tense of "hang" depends on what is getting hung/hanged. The rule says "add an apostrophe s" for possessive form, but of course there are exceptions, e.g. "it" "her" etc, or when the subject is a plural already, then you add an apostrophe but no "s". And the rules for when something is a plural "are" not always clear (and thus even educated people often aren't sure whether to use "are" or "is"). "Bananas are nice" is easy, but "A bunch of bananas" or "a group of individuals", are these plural or singular? And the examples get more and more complex. And there are obscure rules such as '"their" may be used in place of "his/her". And there are so many exceptions to rules like "i before e except after c", rules which many educated people even sometimes struggle to remember. I can name many University educated adults with English as their first language who still don't even know the difference between "lend" and "borrow" - that says something about the language.
I'm glad English is my home language, but I feel sorry for foreigners who have to learn English as a second language.
Is it just a coincidence that the simplest writing system was the first to be digitized
Yes, actually, it is. ASCII was probably the first wide-scale character set standard used in computing - what does the "A" stand for?
More Flamebait :) (Score:3)
another drawback of unicode (Score:3)
--