Why Unicode Won't Work on the Internet 416

Posted by Hemos on Tuesday June 05, 2001 @11:11AM from the -Linguistic,-Political,-and-Technical-Limitations dept.

We reeived this interesting submission from N. Carroll: "Unicode, the commercial equivalent of UCS-2 (ISO 10646-1) , has been widely assumed to be a comprehensive solution for electronically mapping all the characters of the world's languages, being a 16-bit character definition allowing a theoretical total of over 65,000 characters. However, the complete character sets of the world add up to approximately 170,000 characters. This paper summarizes the political turmoil and technical incompatibilities that are beginning to manifest themselves on the Internet as a consequence of that oversight. (For the more technical: the recently announced Unicode 3.1 won't work either.)" Read the full article.

This discussion has been archived. No new comments can be posted.

Why Unicode Won't Work on the Internet WORKING ON

Load All Comments

Search 416 Comments Log In/Create an Account

Comments Filter:

Languages (Score:2)

by Alex Belits ( 437 ) writes:

"Unicoders" ignore the fact that any multilingual text is inherently stateful, so their idea of stateless stream of giant "characters" that will be easy to process is flawed at the core. In fact it's useful for decorative purposes only -- while it's easy to _display_ a unicode text (given in any of countless Unicode encodings), it's impossible to process or edit it without at least some state (current language) information to determine, what input method, dictionary, grammar rule, etc. to apply to any substring, so the goal of stateless text is just as misguided as the initial stateless filesystem representation in NFS. But if statelessness is kicked out of the window (like it should for anything multilingual), then information about both language and charset can be easily added to any substring, so all national charsets, ones that were specifically designed to be used in some particular language, and to which all processing rules and dictionaries were already written, can be used -- programs that don't care about charsets and languages will just handle them transparently as sequence of bytes, and programs that care should use state information anyway.

How to include state information is a good question -- there are a lot of posibilities, and one of them is modification of HTML and XML specs to add charset attribute to everything that can have LANG. The problem is, for purely political reasons those specs specify only global charset for the whole document, and include LANG but don't include charset as an attribute for everything to make it impossible to use any non-Unicode charset for multilingual documents in them. This does not serve any legitimate purpose, and is an example of blatant sabotage of the specs to serve the interests of small but very influential and vocal group of companies that are interested in making multilingual processing as complicated as possible, so every simple task requires huge bloated application just to comply with the sabotaged specs, instead of simple byte-value transparency that otherwise would be sufficient. Raising the barrier for entry, decommodification and contamination of the standards at its finest.

The reason why things like that are possible is, that in fact the demand for multilingual text processing (multilingual as one document that contains text in more than one language other than English because English is usually supported within non-Unicode national charsets and works just fine with them) is currently very low, and was even less when those "standards" were adopted, so obvious flaws did not cause immediate havoc. This is a commonly used strategy -- when no one needs something, write a standard for it that favors you, create a lot of noise around it, declare that it "dominates the industry" because no one else is doing it, and then wait until the need becomes more or less apparent. Then when it happens, everyone will somehow remember a piece of your noise, and you can loudly proclaim that all that time you was busy including new great standard into the innards of your software and lobbied all standards groups to include some reference into standards (that everyone, of course, ignored all that time because of the lack of the need for application). So, at the time when need is "more or less apparent" and the requirements to applications and standards quality is low, you can expand the "use" of your standard by people who don't need it or care about it, just because it was included into some of your products -- if features support in them is ridiculously poor, no one would notice because there isn't that much use anyway. The development of other, superior, standards will be stifled because you will always be able to claim that everyone is happy with your standard because there aren't many people complaining -- of course, there won't be many complaining because almost no one actually uses it for what it was supposed to be used it in the first place yet. At the time when real need arises so many products and standards will be contaminated with your standard that people will have to use it despite the obvious flaws. If standard stinks, you still can claim that no one made anything better anyway, so everyone should just use your POS, and if it breaks others' software design, they should just adopt yours.

If this sounds too close to some particular company's favorite strategy, it probably is -- Microsoft with its nauseating file/documents formats design, mediocre and bloated, display/printing-only oriented text editing software is one of the most enthusiastic backers of the Unicode, and they do it despite the fact that their software itself often gets into trouble because Unicode is both hard to use and hard to implement. It doesn't matter, important thing is, if we had trouble with it doing it half-assed, everyone that will try to do it better will have much more trouble. Scorched earth strategy.
Re:Unicode's reply (Score:2)

by Alex Belits ( 437 ) writes:

It does not matter, what Unicode in theory can have in -- the allocation of characters is handled by a single, and not in any way open, organization, so the standard is all that is allocated and not that in theory can be if Unicode consortium would be benevolent enough, that we all know that it is not. Even if it would be, there is always some need to represent, in some consistent and unambiguous manner, text in languages that can't be possibly accepted into Unicode, such as fictional languages -- they can be easily handled by any expandable charsets-handling system and it won't be a rocket science to develop one, however Unicode supporters do everything that is possible for humans and sometimes more, to prevent any competing system from being developed. Also it does not matter what stated goals of Unicode are -- in fact it is being hawked to be used as the required internal representation of all text in all applications, and as the origin for encodings used for data manipulation, storage and transmission. These are facts, and so are the real problems that Unicode generates if used in that way. I have no problem with Unicode standard being a big dusty book used as a simplified manual for world''s alphabets, or as an intermediate format for fonts handling and texts conversion between different charsets of the same language. The problem is, Unicode is being used for things it is inadequate for, and its existence is loudly proclaimed as the reason to make no progress in development of any solution for multilingual texts handling that is not entirely based on Unicode-derived representation over the wire and in storage. This is selfish and counterproductive.
Re:Languages (Score:2)

by Alex Belits ( 437 ) writes:

No. It's "Microsoft likes Unicode because it sucks, and because it's sticky enough to cause trouble for others".
Re:Unicode's reply (Score:2)

by Alex Belits ( 437 ) writes:

1. The standard is expandable if, and only if, it does not require a change of itself to adopt an expansion. For example, the addition of a new MIME type does not change the MIME standard, however the addition of a new tag does change HTML standard, therefore HTML is not expandable, what is pretty easy to notice while comparing different HTML renderers. XML is a near-absurd case because it's basically an umbrella that allows to declare all kinds of tags and therefore is supposed to be flexible and generate expandable standards, however the catch is, it does not provide any facility do automatically determine how to handle those tags' semantics in applications, so mere possibility to declare something new does not make it expandable either if applications' algorithms have to be modified. In Unicode however the situation is much more simple -- any addition of the characters IS a modification of the standard, and there is no possibility to automatically provide interoperability between older and newer versions.

The existence of the procedure TO change the standard does not make it expandable.

2. If Unicode will adopt all fictional languages/scripts/... it will become absolutely impossible to make complete fonts for it -- now it's merely a huge task, but then it will be plain impossible. The only real solution is to have standard that allows to name a language/charset combination, and leave the text in them intact until either user will install support for them, or application will automatically download it. Unicode doesn't help with it a single bit -- application encountered a character in unsupported range, and all it has is 16 or now 32 bits that it can only stuff in its virtual ass and report an error because no reasonable resolution can be made without some external assumption.

3. ISO 2022 is a very poor implementation of stateful multi-charset character stream, and Unicoders are very fond of mentioning it as a proof that all possible stateful systems are bad. However repeating something that is false does not make it any less false -- in fact, after Unicode was adopted by IETF (on meetings behind the closed doors) all work on stateful character streams standardization was stopped.

4. Computers can magically process all kinds of charsets. It's called byte-value transparency. Most of applications would work just fine if they just copied strings without making any assumptions about their structure or number of characters in them as long as bytes are bytes, and end of string is always 8-bit 0, what would be quite trivial for any stateful text system to implement. Tiny minority of programs need anything from a text that requires actual parsing other than finding newlines and, rarely, whitespaces. Display routines are different thing, however there aren't many of them, and all systems other than Windows support Unicode by combining and translating multiple fonts for multiple ranges, so supporting multiple fonts subsets for multiple marked charsets would be only easier to implement.

The problem is, there are too many Windows programmers writing internet drafts now, so semantics of text display routines got stirred up from system-specific and application-specific processing where they belong, and contaminated standards responsible for data transfer, where they don't belong, and a lot of people now believe that to transfer some data one has to know how to display it in some pretty letters. Shame on you.
Re:Unicode's reply (Score:2)

by Alex Belits ( 437 ) writes:

1. ISO is a closed standards body -- if it does anything, it makes standard less open.

2. Private use codes aren't standard -- they don't provide any guarantees of interoperability, and merely provide a way to break the standard while fooling a program that is compliant with it into behaving how the user wants. If there was a way to put somewhere even a name of a charset to map "private" codes to a font name, it would solve a piece of the problem, but alas -- Unicode is made under the slogan of total statelessness of text, so while applications' file formats may allow this, arbitrary substring in a text can't.
Re:Conspiracy Theories and Unicode (Score:2)

by Alex Belits ( 437 ) writes:

The major commercial Unix vendors have all made significant commitments to Unicode support, and even the Linux internationalization community is busy adding Unicode support to Linux. Apparently it doesn't matter to you that Sun, HP, Compaq, NCR, and major Linux I18N players participate in Unicode development, too. It isn't an either/or black and white issue. It isn't some gigantic conspiracy to use a bad standard to prevent the good guys from developing a good standard. But I guess you can believe whatever you want.

This is, to say the least, incorrect. While there is a lot of effort to shoehorn Unicode into Unix and Unix software, the actual results are beyond miserable, precisely because Unicode does not work. Unix vendors solved this problems by adding a small support for to/from unicode conversion and by declaring that their filesystems support UTF-8, thus getting blessed by Unicode consortium as compatible. Guess what, UTF-8 can be "supported" in that way even by abacus, if that abacus is long enough and has at least 8 stones in a row, however actual use of it is a completely different thing -- I have never in my life seen a filename in UTF-8 outside of Unicoders' demos, and I am Russian myself and have a lot of friends that speak Japanese. So, again, Unix vendors' support of Unicode is in fact a lip service, not unlike Microsoft's support of POSIX or claims that Internet would support OSI 7-layers model (what ended with "temporary solutions" known as TCP/IP and Berkeley sockets replacing it).
Re:Statelessness of text (Score:2)

by Alex Belits ( 437 ) writes:

Statelessness of text is something that Unicode tried to achieve, and still is using as their main argument toward its acceptance. Latin-1 is quite irrelevant here because its goals weren't as pretentious as Unicode, and impact on existing applications was near zero, and was basically "where are we going to use those values anyway?" Unicode actually is supposed to be used for serious multiple languages support, and requires fundamental changes in both applications and protocols -- with protocols causing a lot of infiltration of Unicode-based requirements into otherwise tansparent protocols. This would be at some extent justified if Unicode actually was a base for serious multilingual processing (what Latin-1 never claimed to) but otherwise it isn't worth the effort and problems that Unicode brings in. So, main advantage of Unicode over basically everything else imaginable (though not implemented because of pressure on IETF from Unicode), is statelessness of text stream.
Re:Unicode's reply (Score:2)

by Alex Belits ( 437 ) writes:

You've ranted on this everytime Unicode has came up on Unicode, but assertion does not a proof make. You've never sketched out a better system, or said what makes ISO 2022 a poor implementation of what it is. Write an RFC, create a rough implementation of the system and if it really is better, then and only then can people evaluate and decide whether or not to use it. Until then, the choices are basically ISO 2022 or Unicode, and people will pick the choice that works best for them, and not worry about what could be the optimal

I can do that if anyone will listen. The problem is, the actual problem that it will solve does not exist yet, its time didn't come. Multilingual documents, for all purposes, don't exist beyind demos. Unicoders are using this to create their own standard that definitely won't hold water if demand already existed, but they can with their propaganda flood everything involved with standars -- certain person, Martin Duerst, subscribes to EVERY mailing list that may in any way touch multilingual text handling and every time someone mentions Unicode, floods it with tons of messages in support, and fiercely fights against every argument against. I have no idea what else that person does beyond that, if any, and how many hours is in his day, but it's extremely hard to support any serious argument when one side is so active, and most of people are disinterested.

I have planned to do this when actually people will need multiple languages in their documents, and if someone can convince me that I have slept too long, and this time is now, I will happily start work, but otherwise it will be not just fighting with windmills, but fighting with windmills when there is no wind.
Re:Conspiracy Theories and Unicode (Score:2)

by Alex Belits ( 437 ) writes:

UTF-8 on an abacus -- yes, I guess that *is* a strawman that we should all take *real* seriously.

I merely tried to explain that UTF-8 is specifically designed to be used with any imaginable system -- what says nothing about its usefulness.

I presume you mean on Unix systems, where for most such systems, choice of UTF-8 for filenames would be problematical because they would run afoul of other parts of the system that don't handle them. Sure, such may be the case.

This is simply false. UTF-8 filenames and data can be used in any Unix if one wants to sacrifice functionality that people expect from a fixed-length characters representation (ex: regexps matching, cutting text at arbitrary offsets). However it's not a problem of Unix that users expect their encodings to be easier to use than a mess that UTF-8 is -- on other systems there isn't any counterpart to this functionality in utilities that are in common use.

On the other hand, UTF-8 databases are now running routinely on Unix systems, and they work just fine, thank you.

Show me. I have seen a shitload of data, marked as UTF-8, yet used exclusively as ASCII, or even with different encodings actually in the data, but never -- actual multilingual database in UTF-8. Again, it demonstrates my point that Unicoders are trying to sneak their "standard" in while there is no demand and therefore no scrutiny for the quality of things being introduced.

> and I am Russian myself and have a lot of
> friends that speak Japanese.

Umm. And the relevance of that comment is what?

It means that I am in my own experience familiar with handling of multiple encodings, with what people use in the real-life texts handling, and their willingness to use Unicode, that happens to be below zero. You can claim that their reasons are irrational, and Unicode is still the best solution for them, however I still don't see, why opinion of almost everyone who actually knows about the subject from practice, and is supposed to benefit from what Unicoders are proposing, can be dismissed so lightly.
Re:Unicode's reply (Score:2)

by Alex Belits ( 437 ) writes:

Because, gee, the need to communicate with someone in another language is new.

When people communicate, they choose one language for it -- usually one that both know best. No one speaks like "Ya odnowremenno trying goworit' po-english i russkomu, and esli ya by znal nihongo ya would simultaneously speak po-yaponski, too".

It's very important to see the distinction between the need to support "multilingual document" that contains multiple languages within one body of text and to support documents in multiple languages within one system or program. Also historically it happened that documents in all languages can painlessly include ASCII text, so non-English language + English is usually treated the same way as a text in non-English language, not requiring any special tools to be handled. One may claim that this is wrong, but this is how things happened to be developed over decades.

I've never seen VCR instructions in multiple languages

Those are multiple documents, not one document with multiple languages in it. There is clear separation between versions in different languages, and this is already being accomplished easily, even in MIME email.

, I've never seen a bilingual dictionary

Dictionaries are special cases, and they usually are distributed in either printed form, or as a database -- they almost never are seen as plain text documents. In both for-print-only formats and in databases there are plenty of ways to represent languages and charsets as metadata, and absolutely all computer dictionaries that I have seen chosen to use native encodings.

, and the EU driver licenses only have one language on them, not every language of the EU.

Again, I assume that the whole text of the license is repeated in multiple languages, not individual words are repeated in each language within one body of text, so the same definition of multiple documents applies.
Re:Unicode's reply (Score:2)

by Alex Belits ( 437 ) writes:

So Reta Vortaro , an Esperanto dictionary with translations to many languages, is a demo. (Click on the j^ in the left frame, and then on the j^audo in the same frame, for the translation of that word into English, German, Polish and Russian, among others.)

First, without any doubt it is a demo -- the set of languages to which trnaslations are available varies from word to word, and in real life one would never want to have translation into multiple languages to always appear, clogging the screen. Second, this is an application (even though a simple one), not a document, and there are plenty of ways for applications to handle multiple charsets even now. My point is, functionality that supports multiple languages within application is completely ortogonal to the support of multiple languages within a single document or string. Unicoders love to mix those two.

Or Freedict, a source of bilingual dictionaries for dict (including German and Greek, and German and Japanese) is just a demo too.

Again -- I don't see why this particular application used UTF-8, however neither its design requires it, nor those files are for any purposes normal text documents -- even uncompressed, they have strict formatting and are even indexed, so they could use just any charsets/encodings possible.

And the Debian main page , where it lists the names of the languages in which the page has been translated to in their own script at the bottom, is just a demo too.

Absolutely. This list of languages is obviously a gimmick that provides nothing that list of languages in English wouldn't provide -- everyone in the world, for whatever reason, knows how his language's name looks in English even if he can't read English. In the case of Debian page, if I was looking for Russian translation, I certainly would search for "Russian" string to find the link (it's interesting that the word "Russian" is the only one, where both "native" and English name of the language are mentioned in the Debian page -- I assume, because a lot of Russians actually use Russian translation but don't have UTF-8 enabled or supported in their browsers). Also, Debian home page automatically chooses the language if it's announced by the browser, so if I really wanted Russian version and set language preferences in the browser, I wouldn't even have to touch anything else. And lo and behold -- when I choose Russian, the page appears in koi8-r, what happens to be Russian local charset, not any form of Unicode.
Re:Unicode's reply (Score:2)

by Alex Belits ( 437 ) writes:

You're also missing the other selling point of Unicode: it's simple. Yes, there are plenty of ways for an application to handle multiple character sets, but they're all more complex then just using Unicode.

The simplicity of Unicode is only in its authors' imagination. Yes, it's easy to present Unicode to people who don't know the details as a simple solution -- the problem is, reality isn't as simple as it looks.

I'm sure when typing up "German and English Sounds" for Project Gutenberg, that I could switch between Latin-1, some IPA character set, a character set with o-macron, a character set with u-breve, and whatever I need for the rest of characters Dr. Grandgent used, but it's much easier for me to use Unicode.

When the goal is just to make a text that can be printed in pretty letters, anything is ok as long as it's implemented. This is why a lot of low-quality products such as MS Office are so popular -- in fact so popular that I often receive email with nothing but plain ASCII text as a MS Word file. However even in this case a complex typesetting system (that would most likely just use multiple fonts in whatever charsets they happen to be avilable because it cares more about fonts) would be more appropriate.

When I start on "Old High German", I could dig up some obscure High German character set and switch to a Greek character set when he uses Greek words as examples . . . or I could just use Unicode. No matter how much you would dismiss it, it is a real problem and some of us use Unicode because it is a real and a simple solution to the problems we face.

How deceptive. The implied assumption is that "obtaining" charset support is some kind of nonzero effort while using Unicode is smooth regardless of the language. Both things are incorrect -- in a system with multi-charset support the charsets support can be loaded automatically depending on the languages and charsets mentioned -- if someone wants to have support for everything Unicode supports at the extent Unicode supports it, he will only need fonts, and the amount of the information and resources used would be exactly the same as if he had their support in Unicode. However in practice usually the goal is different -- only few languages and charsets are in active use by the same user at the time, however he needs them to be supported with input methods (how to enter greek on this particular keyboard?), formatting rules, ordering, at least references to spellcheckers, etc.

Again, Unicode user still ends up having to somehow get something language-specific, except that his language-specific data and procedures also have to be designed to use Unicode, what differs from the procedures that are already in use, and often open source. Software vendors would love that -- they can either keep making localized versions of all software with Unicode support but with different language-specific procedures, or try to make tools that can handle all languages and spend man-millennia rewriting trivial things and then release them as the only way to use Unicode in practice. In either case they get their money because old software, Unicode-supporting or not, will not match the requirements for multilingual documents processing, and their new solution will be complex and therefore hard to reproduce.

My idea is that infrastructure for stateful text processing is as unavoidable as the existence of different languages and writing systems, so it would be foolish to try to decieve people into thinking that displaying pretty letters is the main problem of handling multiple languages or multilingual documents. I don't see how denying undeniable is justified. Most of people are ignorant about the details because at this moment the problem isn't evident, and problem isn't evident because the whole field of its application is not in any way related to their everyday life, however I don't think that every kind of ignorance deserves to be abused with such a long-lasting possible consequences.

Extending the idea that in multilingual text attributes that should be applied to substrings ("state" when text is treated as a stream) are necessary, I can say that since statefulness is unavoidable anyway, charset/encoding is just as good attribute as the language or, say, language-dependent parameter such as direction (for example, in Japanese left-to-right and up-to-down directions are both acceptable, even though modern texts use left-to-right). The implementation of "full unicode" text processing, even in a primitive display-only manner, is not any simplier, and certainly isn't any lighter on resources than a multiple charset support -- in fact multiple charsets support can be easily built on the top of any existing text displaying or printing procedure that supports multiple fonts and multibyte characters. The only "big question" is how to represent attributes in a text stream, but this is merely a question of formally declaring some decision to be standard -- one can design many of them easily, and almost everything that a sane human mind can create at this moment in history would be infinitely superior to iso 2022.
Re:Unicode's reply (Score:2)

by Alex Belits ( 437 ) writes:

Come on. I've read the Unicode standard, I read unicode@unicode.org, I've read most of the publicly accessable proposals and I'm familiar with all the Technical Reports. There is a lot of complexity in Unicode, but it's mostly derived from the inescapable complexity of the writing systems and compatibility with older systems, and most of the complexity can be ignored if you willing to support some subset (European systems, or European/Russian/CJKV systems). That complexity is going to exist whether you use Unicode or some other multilingual system. Supporting Unicode at the Xterm/Yudit-level is simple, and supporting Unicode in an application with Pango & GTK 2.0 should be just as simple.

Then what was the point of your argument? If implemented in the display-only library and used for displaying/printing only, Unicode is just as "simple" as would be any other system, with or without multiple charsets. If program does anything complex, it should handle various language-dependent stuff anyway, however bare Unicode support provides no such infrastructure, and a reasonable infrastructure can be implemented either with or without Unicode. Then what is the advantage of Unicode? Being self-proclaimed status quo in standards' backroom-politics, that no one supports properly anyway, that is hard to segment into subsets, non-expandable, maintained by a closed standards body and requires more resources?

I don't claim that Unicode theoretically can't be used as the base for languages support -- in theory it can, but the problem is, it provides no advantage compared to multi-charset system if used as a part of multilingual text support infrastructure. I have already explained why such infrastructure does not exist now, however I believe that when it will become necessary, someone will have to implement it anyway. So now, when no one needs it, Unicoders are busy to claim this "piece of noosphere", just like some people tried to sell land on Mars -- just because it's there, and before it will become obvious that it's not theirs.

The goal of Project Gutenberg is to transcribe public domain texts in a format readable for the largest audience possible.

By this logic it should use Microsoft Word or at least PDF -- both very widely supported, more wide than even plain text files in UTF-8 (yes, I know, Word can use unicode internally -- this isn't the point).

Unicode HTML and UTF-8 plain text are those formats.

Are they? Most of my boxes don't have them installed -- the one I am writing this message on is an exception, but only because it has Mozilla, what is still a bloatware. My handhelds most likely never will have them installed -- they don't have enough ram, and need rather nontrivial manipulations with characters size and formatting to keep texts in some languages readable, so plain stream of unicode text would be impossible to display without some heavy heuristics.

Some proprietary and/or obscure complex typesetting format is neither portable nor accessible to a wide audience.

I wouldn't dream to propose a non-open standard for this. However the trouble with open standards is that they never appear before they become necessary, and I, following the principle that standards and tools should be developed as the need arises, am not making any detailed proposals at this time. But when there will be a need, the standard that will be created must be open, expandable and easy to port and reimplement -- something that anything Unicode-based is not. If you mean that charset is "proprietary", I am not aware of any charset except, maybe, "klingon in private unicode" that was in any way declared to be someone's property. If multi-charset support infrastructure will be created, it would be reasonable to include some common facility into the libraries that will make it possible for users to allow programs, when they see an unknown language ar charset, to automatically download fonts, tables and even formatting/comparison/input methods/... source code automatically from some servers that keep directories of known charsets and languages, and this would be an open, expandable and flexible infrastructure, available to everyone. If someone wants his language that never had local charset in the first place to be represented by its range in Unicode, he should be able to do that, however in a system like that there should be no reason to prevent established language/charsets combinations from being used just because of someone's narrow view of the problem.

Maybe I am wrong is this traditionalism, and it will be better if I made an infrastructure for stateful text support just to demonstrate this point -- after all, even with all dynamic fonts/code/input methods/... it won't be in any way more complex than any other solution, merely useless because right now still almost no one uses multiple languages in a single document. But maybe the need to demonstrate the solution for a problem that no one experiences yet is now a good reason enough when someone else is trying to sneak in an impractical solution as the standard while no one is looking.

I see current advance of Unicode as something that may serve some simple need now, but can severely limit further progress if accepted as widely as Unicoders are trying to get accepted. That would not be "good enough" as TCP is "good enough", large SMP kernel lock was "good enough" or C pointers are "good enough" -- it's "good enough" as Windows, region codes, crippleware, etc. are "good enough" -- people accept them because those things are pushed, and the inconvenience they create isn't bad enough until it's too late, but when it's too late, people still use them because there is nothing else in sight.
Re:Unicode Character Set vs Character Encoding (Score:2)

by Jordy ( 440 ) writes:

Actually, the Unicode specification for UTF-8 places an artificial limit of 4 8 bit code units for variable length encoding as that is all that Unicode currently requires.

ISO 10646 defines UTF-8 as having up to 6 8 bit code units.

At 4 bytes, UTF-8 can only map to 0x10FFFF. At 6, it can map to 0x7FFFFFFF.

Of course, my math could be wrong.
Unicode Character Set vs Character Encoding (Score:5)

by Jordy ( 440 ) writes: <(jordan) (at) (snocap.com)> on Tuesday June 05, 2001 @08:03AM (#175370) Homepage

The current permutation of Unicode gives a theoretical maximum of approximately 65,000 characters (actually limited to 49,194 by the standard).

The biggest problem with Unicode is that no one understands what it is. Unicode defines two things, a character set that maps a character into a character code and a number of encoding methods that map a character code into a byte sequence.

ISO 10646, the Universal Character Set defines a 31 bit character set (2,147,483,648 character codes), not a 16 bit character set. Unicode 3.0's character set corresponds to ISO 10646-1:2000. Unicode 3.1 which was recently released goes a bit further.

UCS-2, as mentioned by this article, is the same as UTF-16 and is severely limited by it's 16 bit implementation. UTF-16 is unfortunately used by Windows and Java, but is rarely used on the web. The article claims UTF-16 can only map 65,000 characters, but using surrogate pairs can actually map over 1 million characters.

Thankfully, there are several other encoding methods for Unicode. UTF-8, which is a variable length encoding most commonly used on the web allows a mapping of Unicode from U-00000000 to U-7FFFFFFF (all 2^31 character codes). It also has a nice feature of the lower 7 bits being ASCII, so there is no conversion necessary from ASCII to UTF-8.

UTF-32 or UCS-4 is a 32 bit character encoding used by a number of Unix systems. It's not exactly the most space efficient form (UTF-8 requires roughly 1.1 bytes per character for most Latin languages), but it can handle the entire Unicode character set.

A good document on this is available at UTF-8 And Unicode FAQ [cam.ac.uk]

Share
twitter facebook
Re:Duh. (Score:2)

by jandrese ( 485 ) writes:

Just because you can't read other langauges doesn't mean multi-language support is useless. Oh, and inputting Kanji on a keyboard is quite feasable, try using the Windows IME sometime (It's built into 2000).

Down that path lies madness. On the other hand, the road to hell is paved with melting snowballs.
Unicode includes all common Asian character sets (Score:3)

by Per Abrahamsen ( 1397 ) writes: on Tuesday June 05, 2001 @07:35AM (#175375) Homepage

I.e. all the character sets *in common use* in Asia today, maps into a subset of Unicode. They even map into the 16 bit subset, but overlap in a way that make slightly different characters from different character sets share the same code point. That is why an extended version of Unicode is used, so Chinese/Japanese/Korean characters have different codepoints.

Unicode does not contain all characters ever used, for example it does not contain the Nordic runes. These are not used today except by scolars, who will need special software (most likely using the "reserved to the user" part of Unicode). The same is true for many ancient Asian characters.

Share
twitter facebook
Babel (Score:2)

by jafac ( 1449 ) writes:

He sure did do a good job when he slapped that old Tower of Babel bitch down.
Use UTF-8, don't worry about sizes (Score:3)

by iabervon ( 1971 ) writes: on Tuesday June 05, 2001 @08:01AM (#175380) Homepage Journal

UTF-8 encodes 7-bit ASCII characters as themselves and all of the rest of UCS-4 (the unicode extension to 32-bits) as sequences of non-ascii characters. This means that apps which can't handle anything but ascii can simply ignore non-ascii and get all of the ascii characters (and, with minimal work, report the correct number of unknown characters).

The only issue is that there's not a good way to set a mask for the characters such that 0-127 (which take up a single byte) are the common characters for the language, and so on, so English is more compact than other languages, even languages which don't require more characters.

Share
twitter facebook
Re:2 + 1 bytes? (Score:2)

by Jeremy Erwin ( 2054 ) writes:

Maybe use only 20 bits and leave 4 bits for something else (font style, inverse, etc.).
Typically, one shouldn't apply font styles on a character by character b as iS.
Chinese language(s) (Score:2)

by danny ( 2658 ) writes:

I highly recommend S Robert Ramsey's The Languages of China [dannyreviews.com] to anyone interested in language in China.
Danny.
Re:Overstating and misunderstanding the problem (Score:2)

by tjansen ( 2845 ) writes:

>>the number one is handwritten in America as a vertical stroke, but in Germany as an upside-down V No, the handwritten one in Germany looks more like the 1 in an Arial font. bye...
Case (Score:2)

by Pseudonymus Bosch ( 3479 ) writes:

26 letters

You mean 26 uppercase and 26 lowercase.
__
Re:UCS-4 (Score:2)

by spitzak ( 4019 ) writes:

UTF-8 can encode 31 bit characters with an obvious extension of the standard (or perhaps this is part of the standard, I'm not sure).
After that point it breaks down (if you continue the first byte is filled and the prefixes will have to go into the second byte). Alternatively you can quit at that point, use the remaining bit as the 32'nd bit, and say that UTF-8 cannot encode more than 32 bits.
Anyway, one big advantage of variable-sized encoding is that there is potentially no limit to the size of the data transferred.
My personal opinion is that UTF-8 should be used *everywhere*, including all internal interfaces to libraries and services like X. The sooner we stamp out these "wide characters" and all the complexity they cause by doubling or quadrupiling the number of interfaces we need, the better.
UTF-8 probably does not address the concerns of the article, which is about the fact that Unicode does not contain all possible scribbles drawn by humans. But the fact that English speakers were able to compensate for decades and even adapted to the rather arbitrary and limited 62-character ascii set would indicate that people will easily compensate for this as well.
Re:ASCII stupidity all over again... (Score:2)

by spitzak ( 4019 ) writes:

When ASCII was invented it was based on existing typewriters, including the ones sold in Europe. At that time output was on paper and it was rather easy to overstrike characters to produce accented characters. How else do you explain the existence of the '~', '^', and backquote characters (in addition the underscore code was originally a macron). They actually designed it so the countries of NATO could type as well as they could on a typewriter. They also deleted several characters Americans wanted (fractions, fl and fi ligatures, open and close quotes, cent sign were all very common on typewriters of that period).
Yes, only a small set of countries was considered, and only minimal support. But this claim on "no support for anything not USA" is false.
Re:ASCII stupidity all over again... (Score:2)

by spitzak ( 4019 ) writes:

I didn't claim they tried to support all European characters. What I meant is that they did not ignore them totally. They (rather stupidly) thought that a few accent marks would do the job.
The cent sign was replaced with the caret. That is why shift+6 prints a caret, if you look at old typewriters that was how you printed a cent sign. This is in fact the main reason I think they considered European support, since from an American point of view the cent sign is more important. The fractions were what were replaced by the square braces. The curly braces, vertical bar, and apparently the tilde were added later (originally they printed as square braces, slash, and caret, and devices that totally ignored the lower-case bit were allowed, and the original tilde was changed to underscore because that character was missing originally).
"Extended ASCII" usually refers to the replacement of several of the punctuation marks with European characters. This was pretty useless because by then most OS's had assigned meaning to those punctuation marks (like the square brackets), also only 5 or 6 new characters were available. This died almost immediately when people started supporting the 8th bit as data rather than parity.
Re:Arabic space (Score:2)

by spitzak ( 4019 ) writes:

Is there a good reason for this or is this due to some stupidity in MSWord? I assumme it has something to do with bidirectional scripts, but if normal space is not used for anything in Arabic then I would accuse MicroSoft of being stupid.
If this is the normal non-breaking space character (0xA0 in Unicode) then it takes 2 bytes in Unicode.
Re:Arabic space (Score:2)

by spitzak ( 4019 ) writes:

Yes, I am guessing that MS picked a character to mean "backwards wrapping space" or something, and it sounds like all Arabic must have the words seperated by this character rather than space. It apparently is not the "non breaking space" or some Arabic equivalent, if I understand the orginal poster correct.
The question was "did they do this for a good reason? Ie: doing this allows formatting control that could not be achieved otherwise. Or were they just stupid/lazy, and if normal spaces were used with a slightly smarter program would it be just as good?
I personally don't know anything about Arabic so I cannot answer these questions. My guess is that this is reasonable if there is a place that "normal spaces" are used in Arabic.
Re:UTF-8 should be fine for almost any application (Score:5)

by spitzak ( 4019 ) writes: on Tuesday June 05, 2001 @10:39AM (#175390) Homepage

Thanks for some more intelligent discussion about UTF-8.
I might add a few things:
In UTF-8 not just NULL or Escape are not in the multibyte characters, in face *all* 7-bit characters are not in the multibyte characters (the multibytes have the high bit set in all bytes). This means that *any* program that treats all bytes with the high bit set as a "letter" will work and can parse, hash, match, search, etc identifiers/words with foreign letters in them!
In addition the UTF-8 encoding is just heavy enough that random line noise is very unlikely to match a UTF-8 encoding. If programs treat "illegal" UTF-8 encodings as individual bytes in the ISO-8859-1 character set, it will display virtually all existing ASCII/ISO-8859-1 documents unchanged!
The end result is that it should be easy to switch all interfaces (not just over the network, but inside programs and to libraries) to UTF-8. This will vastly simplify the handling of Unicode because there will be no need for ASCII back compatability interfaces. We could also eliminate all the "locale" crap and make ctype.h the simple thing it once was.
Even Arabic will encode smaller in UTF-8 than UTF-16. This is due to the fact that very common characters (not just English, but things like space and newline) are only one byte.

Share
twitter facebook
Re:All Character sets simultaneously?? (Score:2)

by K-Man ( 4117 ) writes:

I got all sorts of spurious matches from the Latin words, which wouldn't happen if the Greek and Roman letters weren't sharing a single character space.

However, in Unicode, Chinese, Korean, and Japanese all share the same codepoints and glyphs, so you can't grep for one language or another.

For instance, if you were searching in Korean for "Kim Il Sung", this string in Unicode would be the same as the Chinese characters for "gold" (jin), "one" (yi), and "star" (sheng), so your search would get hits from other sino-based languages in addition to Korean.

It's difficult even to sort Unicode correctly without choosing some language or another, due to this overlap of characters. "Alphabetical order" is different for the different Asian languages, even though they use the same characters.
Correct, also see link (Score:2)

by K-Man ( 4117 ) writes:

Yes, the author was overbroad with that statement. All languages work on a restricted set of phonemes; there are some 200+ identified, but no one language uses near that number. Hangul covers all the Korean phonemes, but not much else.

Here's a good description of Hangul. [usu.edu] If you check this page, you'll notice I was wrong about the vowels; they don't seem to describe their own pronunciations at all, but rather the yin and yang elements of their sounds :-P.
Re:You bring up a good point (Score:3)

by K-Man ( 4117 ) writes: on Tuesday June 05, 2001 @10:46AM (#175393)

If you read the article, you'll find a decent description of Korean Hangul, which has around the same number of characters as English (IIRC, it has 24).

Hangul outdoes the latin alphabet in several ways. For one, as you mention, pronunciation in English is difficult, while in Hangul it is almost completely unambiguous. Each phoneme maps to one character, and vice-versa. There is no confusion over whether to write "cat" or "kat", for example. Only one letter has the "k" sound.

Each Hangul character is a pictogram describing the position of the tongue, palate, and lips to use when pronouncing it. Whereas most phonetic alphabets consist of ideograms recycled as phonetic symbols, Hangul seems to be the only one to consist of symbols constructed purely for phonetic meaning.

Since the job of a phonetic alphabet is only to represent phonemes, I would say that this alphabet does the job better than latin.

Share
twitter facebook
Re:Unicode's reply (Score:2)

by dvdeug ( 5033 ) writes:

> the allocation of characters is handled by a single [...] organization,

Slightly incorrect. It's handled by the Unicode Consortium AND the ISO 10646 standards group.

> not in any way open, organization

It's as open as, say, the ISO C++ standards group. That is, unless you're connected to the right corporation or country, you won't get a seat, but they still accept outside submissions and respect experts outside the group.

> Unicode consortium would be benevolent enough, that we all know that it is not.

Benevolent how? Benevolent enough for what? It took them less than a year to get LATIN CAPITAL N WITH LONG RIGHT LEG encoded, for a minor language with no political power (Lakota). They're constantly encoding new letters and scripts for groups with no political or economic clout (Z with hook below for Old High German, various Phillipine scripts in 3.2).

And no one's stopping you from hacking up your own multi-charset system, and using it whereever you want. But loudly claiming that you're being oppressed doens't prove that you are, and doesn't prove that your system would actually be superior to Unicode.
Re:No, _n_ bytes per character! (Score:2)

by dvdeug ( 5033 ) writes:

Part of the point of UTF-8 is that non-ASCII characters don't get encoded with ASCII characters. In your system, you can get an '/' or a '\0' or '\e' byte that doesn't represent that character, meaning that all Unix software needs to be changed to support your encoding. As it is, Linux accepts bytes for filenames without caring whether it's UTF-8 or some 8-bit code or some other multibyte code that obeys the same rule, knowing only that the byte '/' is uniquely the directory seperator.
Re:Unicode's reply (Score:2)

by dvdeug ( 5033 ) writes:

> ISO 2022 is a very poor implementation of stateful multi-charset character stream,

You've ranted on this everytime Unicode has came up on Unicode, but assertion does not a proof make. You've never sketched out a better system, or said what makes ISO 2022 a poor implementation of what it is. Write an RFC, create a rough implementation of the system and if it really is better, then and only then can people evaluate and decide whether or not to use it. Until then, the choices are basically ISO 2022 or Unicode, and people will pick the choice that works best for them, and not worry about what could be the optimal solution.
Re:ISO-2022-JP and "alphabetical order" (Score:2)

by dvdeug ( 5033 ) writes:

Why is the character order a problem for the Japenese, and not the Germans, the French, the Lithuanians, the Belarusians, and almost every other language in the world? Latin-* does not encode anything besides English in alphabetical order, and neither does Unicode. (It's theoritically impossible; the Lithuanians want the Y to precede the J, and the Danish and the Swedes disagree about where the a with ring above goes.)

If you go to the Unicode standard (found online at http://www.unicode.org/unicode/uni2book/u2.html ) they have an index with all the characters by radical and stroke. They also have an index with all the characters found in JIS sorted by their JIS index.
Re:Unicode's reply (Score:2)

by dvdeug ( 5033 ) writes:

The problem is, the actual problem that it will solve does not exist yet, its time didn't come.
Because, gee, the need to communicate with someone in another language is new. I've never seen VCR instructions in multiple languages, I've never seen a bilingual dictionary, and the EU driver licenses only have one language on them, not every language of the EU.
Multilingual documents, for all purposes, don't exist beyind demos.
So Reta Vortaro [uni-leipzig.de], an Esperanto dictionary with translations to many languages, is a demo. (Click on the j^ in the left frame, and then on the j^audo in the same frame, for the translation of that word into English, German, Polish and Russian, among others.) Or Freedict [freedict.de], a source of bilingual dictionaries for dict (including German and Greek, and German and Japanese) is just a demo too. And the Debian main page [debian.org], where it lists the names of the languages in which the page has been translated to in their own script at the bottom, is just a demo too.
Re:Unicode's reply (Score:2)

by dvdeug ( 5033 ) writes:

The fact that you chose to dismiss this stuff as demos does not change the fact that it's in actual use. Revo's author doesn't feel like changing the format of his dictionary because you don't agree with it. The web is full of gimmicks, but people like their gimmicks; why do you think Java took off? You can't just call it a gimmick and dismiss it; if that's what people to do, then that's what people want to do.

You're also missing the other selling point of Unicode: it's simple. Yes, there are plenty of ways for an application to handle multiple character sets, but they're all more complex then just using Unicode. I'm sure when typing up "German and English Sounds" for Project Gutenberg, that I could switch between Latin-1, some IPA character set, a character set with o-macron, a character set with u-breve, and whatever I need for the rest of characters Dr. Grandgent used, but it's much easier for me to use Unicode. When I start on "Old High German", I could dig up some obscure High German character set and switch to a Greek character set when he uses Greek words as examples . . . or I could just use Unicode. No matter how much you would dismiss it, it is a real problem and some of us use Unicode because it is a real and a simple solution to the problems we face.
Re:Unicode's reply (Score:2)

by dvdeug ( 5033 ) writes:

> The simplicity of Unicode is only in its authors' imagination.

Come on. I've read the Unicode standard, I read unicode@unicode.org, I've read most of the publicly accessable proposals and I'm familiar with all the Technical Reports. There is a lot of complexity in Unicode, but it's mostly derived from the inescapable complexity of the writing systems and compatibility with older systems, and most of the complexity can be ignored if you willing to support some subset (European systems, or European/Russian/CJKV systems). That complexity is going to exist whether you use Unicode or some other multilingual system. Supporting Unicode at the Xterm/Yudit-level is simple, and supporting Unicode in an application with Pango & GTK 2.0 should be just as simple.

> When the goal is just to make a text that can be printed in pretty letters [...] even in this case a complex typesetting system [...] would be more appropriate

The goal of Project Gutenberg is to transcribe public domain texts in a format readable for the largest audience possible. Unicode HTML and UTF-8 plain text are those formats. Some proprietary and/or obscure complex typesetting format is neither portable nor accessible to a wide audience. Project Gutenberg has existed for 30 years. What "complex typesetting system" format can claim the same? How many "complex typesetting system"s that could handle it are available on many different platforms? At least 70% of the people on the net can read Unicode HTML, and many of the rest could with little work and no cash expenditure. What "complex typesetting system" can say the same? How is a "complex typesetting system" simpler than Unicode plain text?

> he needs them to be supported with input methods (how to enter greek on this particular keyboard?), formatting rules, ordering, at least references to spellcheckers, etc.

Nonsense. For "Old High German", I will map ALT-Z to ȥ in XEmacs. Spellcheckers don't exist for this language, I'm not going to sort the data, and it's just a z with a hook, so there's no special formatting rules. The people who read the book don't even need a way to enter the character, any more than reading the original book precipitated a need to enter it into the computer.

Displaying pretty letters isn't the end all and be all of multilingual computing, but it's a damn good start. The only registered character set (ISO 2022 registry or IANA registry) that supports Lakota or the Cherokee syllablary is Unicode. No, on most systems, they can't get decent support; handcrafted keyboard maps must be used, there's no spelling or sorting support. But they can type the characters in and send them across the net and print up papers, which is better than nothing.
Re:Esperanto, Ido, lojban; BCE (Score:2)

by unitron ( 5733 ) writes:

He explained it as "before the Christian era", no doubt for the benefit of those only familiar with B.C. and A.D., but did not define it as that, although anyone who needs it explained no doubt also needs it defined as "Before Common Era" (and should also be told that what comes after is "C.E.", or "Common Era", and that B.C.E. and C.E. correspond to B.C. and A.D., respectively), so he did screw up just a tad.
Re:It works (Score:2)

by h2odragon ( 6908 ) writes:

No, that makes too much sense; it's not all inclusive so let's trash the whole thing, start over from scratch, and revert to 7bit ASCII in the meantime. We need a system that can handle every glyph that has ever had meaning to somebody, somewhere.

...for the sarcasm impaired, the above should be read as "good point".
No you didn't read the article, or even think (Score:2)

by A nonymous Coward ( 7548 ) writes:

You are so euro-centric it's not even laughable. As the article said, those who claim Unicode good enough for the masses are the same foreigners who would scream and howl if someone tried to remove redundacies from the English language such as pork and ham, or argue and dispute, or ...

I have read that an English language vocabulary of 300 words is good enough for most ordinary conversation. You are claiming the equivalent is good enough for ordinary use. You are mistaken.

Unicode is a classic case of (western) imperialism, in which the imperialists are completely blinded as to why it is imperialistic, and continue to mutter "it's good enough, and we know what's good for you smelly foreigners."

--
Re:C programs (Score:2)

by Luke ( 7869 ) writes:

I'd love to see the linux kernel coded in Python.
What about the artist formerly known as Prince? (Score:2)

by mattkime ( 8466 ) writes:

Does the artist formerly known as Prince get his own charcter space as well?

Will I need to download a new character set on windows to view it?
Re:Quit whining and move to a phonetic alphabet (Score:2)

by scrytch ( 9198 ) writes:

> The obsession with phonetic spelling is an unhealthy and rediculous pathology

Despite the fact that we move inexorably toward it anyway.
--
Re:Languages (Score:2)

by scrytch ( 9198 ) writes:

So basically your argument boils down to "Microsoft likes Unicode, therefore it sucks"? You come up with some fuzzy vague idea of encoding "language attributes" like grammar and dictionaries into character sets ... somehow, meanwhile conflating character sets with documents... I'm surprised you haven't asked for binary to be revamped. Try losing the scare quotes too, your sneering disdainful superiority for the subject and everyone associated with it was already fairly apparent.

As Rand would say, A is A. Whatever sort of semantic meaning the letter might have in the context it's used in is not Unicode's problem. I hear we have things like document formats that handle that.
--
Re:Well DUH! It's not meant to have every characte (Score:2)

by Mike Buddha ( 10734 ) writes:

Besides, translation software is coming along well enough that soon we will not have to worry about it too much.

How is this translation software supposed to work if there is no standard for interchange? Magic? How are we supposed to translate these characters that have no symbol for the computers to process?

There are well over 140,000 language characters on this earth, and there are many yet to have been entered into a computer.

What makes you think that we can't encode all these characters? Are we going to run out of numbers? A 32-bit number can hold 4 billion different values, and if that isn't enough, we can use a 64-bit number. We certainly aren't going to run out of numbers.
Re:After some skimming... (Score:2)

by K. ( 10774 ) writes:

I suspect Unicode is a lot more upsetting to
a "reference writer specializing in rare Taoist
religious texts and medical works" than to
ordinary Chinese users who want to run Photoshop
or put their wedding pictures on a web page.

Let me get this straight - you think people
should be prepared to accept having restricted
access to the literature that underpins their
culture in exchange for their very own
geocities.cn?

K.
-
Some errors (Score:5)

by BJH ( 11355 ) writes: on Tuesday June 05, 2001 @08:02AM (#175437)

Hiragana, which is somewhat cursive, can be used to augment Kanji - in fact, everything in Kanji can be written in Hiragana. Katakana, which is much more fluid in appearance than is Hiragana, is used to write any word which does not have its roots in Kanji, such as the many foreign words and ideas which have drifted into general use over the centuries.

In actual fact, Katakana is much more angular than Hiragana - definitely not "fluid" in appearance. Furthermore, anything that can be written in Kanji can be written (phonetically) in either Hiragana or Katakana - the use of Katakana for foreign words is nothing more than custom, not a limitation of the characters.

Thus is can be said that Hiragana can form pictures but Katakana can only form sounds...

That should probably read "Kanji can form pictures but Hiragana/Katakana can only form sounds..."

Romaji is used to try and keep the whole written thing from getting out of control, with most Western concepts and necessary words being introduced into the language through this mechanism.

Bollocks. Romaji is hardly ever used (except for advertisements, and then only rarely, or textbooks for foreigners). It's definitely not the main conduit for Western ideas.

After a time these words (even though they will still maintain their "Roman" form for awhile longer) will become unrecognizable to the people they were originally borrowed from, such as the phrase, "Personal Computer," which is now "PersaCom" in Japan.

Again, this is incorrect. Words don't *have* a Roman form in everyday use; sure, you can express them in Romaji but no-one ever does. As for "personal computer", the correct Romanization is 'pasokon', not 'PersaCom". (Where did he get that from?!)

The rest of the 1,950 have to been memorized fully by the time of graduation from high school in Grade Twelve. Please remember that this total is only the legal minimum required threshold to be considered literate. And this is to be absorbed completely, along with a back-breaking load of other subjects.

Ummm... that's actually not too hard. I (along with everyone else at my language school) memorized more than 1300 Kanji in less than a year... and none of us were Japanese. I know it must seem like an impossible total to people used to ASCII, but there are many common points between Kanji that simplify the learning process greatly.

That said, I've long been against the current Unicode "standard", as have many technical people in Japan, for a number of reasons. Some of those are:

- No standard conversion tables from existing character sets (SJIS, EUC-JP, ISO-2022-JP).
Several conversion tables do exist, but there are minor differences between them that make it impossible to go from, say, SJIS to Unicode and back to SJIS without the possiblity of changing the characters used.

- A draconian unification of CJK characters.
The Unicode Consortium basically forced the standards bodies in China, Japan and Korea to unify certain similar Kanji onto single code points, which doesn't allow for cases where, say, Japanese actually has two or three distinctive writings that are used in different situations.

- The ugly "extensions".
Unicode has been effectively ruined as a method of data exchange by its treatment of characters not in the 60,000-character basic standard.

I could go on, but I should get some sleep...

Share
twitter facebook
printing [ Chinese ] vs dictionary [ Chinese } (Score:2)

by peter303 ( 12292 ) writes:

Its a lot like the Oxford English Dictionary
versus Websters Collegiate- Chinese printers have
gotten by with 7-10K characters versus the 60-80K
in the full language. Synonyms and hononyms are
used for the more obscure words. The standard
modern Chinese dictionaries only have this smaller
number of characters.
totally unconvinced (Score:2)

by kaisyain ( 15013 ) writes:

One of the author's main propositions seems to be that Communist Chinese and Taiwanese/Overseas Chinese want different spaces in Unicode for the same characters.

I don't see every Western nation asking for it's own encoding of "w" or accented characters. The author doesn't give any explanation for why we should pay attention to IMHO silly political whining in this particular case.

The author further implicitly assumes that it is reasonable to include the deprecated K'ang Hsi characters in addition to the official characters, but gives no justification for this view. I don't see unicode trying to include all possible historical graphings of Western characters.
Re:Solution - Everybody use Euro-English! (Score:2)

by sharkey ( 16670 ) writes:

Das rubbernecken sightseenen keepen das cotten picken hands in das pockets, so relaxen und watchen das blinkenlights.

--
Re:You bring up a good point (Score:5)

by gleam ( 19528 ) writes: on Tuesday June 05, 2001 @08:17AM (#175447) Homepage

The writing system with the smallest alphabet that is in current use is Hawaiian, with 12 letters. (aeiou hklmnpw) source [alternative-hawaii.com]

A good source for your obscure questions is, as always, the Straight Dope, which answers the "Chinese Typewriter" question here [straightdope.com].

Regards,
gleam

Share
twitter facebook
Re:Quit whining and move to a phonetic alphabet (Score:2)

by dutky ( 20510 ) writes:

The obsession with phonetic spelling is an unhealthy and rediculous pathology: to understand why, have a look at Justin B. Rye [demon.co.uk]'s Spelling Reform [demon.co.uk] page (subtitled And the Real Reason It's Impossible).
Re:UTF-8 should be fine for almost any application (Score:3)

by dutky ( 20510 ) writes: on Tuesday June 05, 2001 @08:24AM (#175450) Homepage Journal

Of course, even if you could get China, Taiwan, Japan and Korea to agree on a unified character encoding similar to the ISO-Roman character set (where identical or analogous characters in the different alphabets shared the same character code) you would still need more than 50,000 encodings just for the unified asian character set.

I can see good reasons why language using similar alphabets should have overlapping encodings, but this is probably better solved by providing translation tables between related alphabets than by forcing multiple alphabets to share a single encoding. While I may be able to write the french coup de grâce in the english alphabet as coup de grace something has clearly been lost. Other europen languages are even worse, even those that nominally use the roman alphabet! Then there are questions of alphabetization between differnt languages and the questions of whether or not accented letters correspond to each other or to the unaccented letter.

Call me a purist, but I think it is actaully much easier if we just had distinct representations for each language and had to perform some kind of mapping to display one language in another language's alphabet.

Share
twitter facebook
Re:unicode does *not* encode 65,536 characters (Score:2)

by jholder ( 22001 ) writes:

This is incorrect, 44,946 surrogates were approved in March as part of Unicode 3.1.0.

Unicode 3.1 and 10646-2 define three new supplementary planes:

Supplementary Multilingual Plane (SMP) U+10000..U+1FFFF (1594 chars)
Supplementary Ideographic Plane (SIP) U+20000..U+2FFFF (43,253 chars)
Supplementary Special-purpose Plane (SSP) U+E0000..U+EFFFF (97 chars)

Or plane 1, 2, and 14. (from the Unicode 3.1 Technical report, #27)
Re:No you didn't read the article, or even think (Score:2)

by WNight ( 23683 ) writes:

Whoops, here comes the racism...

Anything a white guy wants, or someone who might be a white guy, is wrong, euro-centric, penis-dominated, and wrong.

Now anything a non-white, non-guy wants wants is automatically right.

Now a person whom is completely anonymous on the internet can be assumed to be white and male if they disagree with anything said by a non-white, non-male, or someone who lives outside of 'europe or north america'.

You know, there are a lot of reasons for disliking Unicode, and a lot of reasons for not wanting to waste time implementing a system which has 1) grown monstrously beyond original specs and 2) doesn't help you at all.

IMHO, you should use those ~65K characters and stop your pathetic sniveling. If you want a character set that supports more, make it yourself and get others to use it.

If you ever want anyone outside of your immediate family to use it, you'll have to make it worth their while.

What does the other 80% of the world get out of supporting your dead language? Uglier URLs? More bloated OSes? Slower web usage?

Sure sounds important to me.

And before you scream "Racist!", ask yourself if you have any proof, or if you're just pissed that I don't agree.
Re:No you didn't read the article, or even think (Score:2)

by WNight ( 23683 ) writes:

I'm sure those 65000 characters would better represent any asian language than German or French is represented without their accents, yet speakers of those languages didn't pull this entitlement crap to make people support larger character sets.

No language is going to be properly represented, especially when you consider ancient forms, so we'll have to accept that nothing is perfect. We've run into diminishing returns and now people want to increase the complexity of the system a hundred-fold just to get some characters than only a thousandth of one percent of the population will ever know are missing, let along want to use in conversation.

There's no written language that more than 80% of the world population uses, so I stand behind my original estimate.

I'm not at all racist in what I say, I'm merely sick of cattering to the special interest groups. Especially the special interest groups that claim to be part of a larger group. (In this case, 99.999% of the population of the original subject's country couldn't give a shit about having ancient characters from a dead language in their URLs, it's *his* issue, not theirs, but he's making it seem like a race issue and oppression of the little guy.)
Re:After some skimming... (Score:3)

by WNight ( 23683 ) writes: on Tuesday June 05, 2001 @01:21PM (#175462) Homepage

Oh gawd, just listen to the feelings on entitlement in that messages...

You want the ability to search through some insanely large character set, so to do so you're willing to force everyone else to make their communications much less efficient just so you can have a free ride.

You know, it's not a coincidence that the western world (using small variations on the roman character set) pretty well invented modern technology. It's only about a thousand times easier to process a smaller and simpler alphabet.

There's a reason we don't use prose to command computers, until all cheap desktop models come with the ability to understand natural language a stripped down and unambiguous command-set will be more efficient.

I've got a lot of characters I'd find handy if we were to implement a new standard, and I'd want to expand into basic pictograms (standard symbols, etc) as well. Now I realize this isn't interesting to other people, so I'm not going to jump up and down and shout "Racist" just because people aren't anxious to bloat a new standard just to appease me. If I want those features I'll make my own font and make it available with any works that I produce which would require it.

In short, grow up, the world does *not* own you anything. If you want it, do it yourself instead of crying when someone else doesn't.

Share
twitter facebook
Re:Quit whining and move to a phonetic alphabet (Score:2)

by HenryFlower ( 27286 ) writes:

But you wouldn't say that if you could read ancient Greek (I assume that's what you meant, not modern Greek). If you could, you would be happy that there need be no longer a half-dozen ideosyncratic methods for encoding ancient Greek, with equally ideosyncratic input methods. All you are really saying is: if I don't need it no-one does. And I don't see how a character set allowing faithful encoding of Greek characters and diacritics places any special burdens on you, who don't need to use them....
Or perhaps you are being very subtly sarcastic? Or trolling?
This is so wrong (Score:2)

by jfedor ( 27894 ) writes:

The author of the article and the guy who submitted the story clearly don't have a clue about Unicode. Unicode can encode over one million characters, as stated here [unicode.org].

Unicode may have its problems, but this is not one of them.

-jfedor
Compaction and Traction (Score:2)

by JJ ( 29711 ) writes:

The 64,000 should suffice. Ideographic scripts, like Chinese are were the problem arises. The number of characters in Chinese is not fixed, unlike the number in most alphabets. I have a Chinese novella which was written in just 300 characters. 10,000 would be a good place to start, a few thousand more would cover all but specialized texts. Japanese could fold into Chinese, since there are only 2000 kanji characters and a few hundred kana.
Throw in Arabic, Cryllic, Sanskrit, Dravidian, Hangul (Korean) and Navaho and you still add only a few thousand. The odd European characters (the 'ss' in German, the extra Danish vowels, . . .) add a few hundred tops. Even the special linguist marks and punctuation don't add much.
If you have to double the Chinese, now you run into trouble. Its classical characters vs. simplified. The later is for the PRC. If you also bloat the number of characters required so that specialized religous characters are required, now you start to push the system. 64K would be fine if a special marker character could be used which signify's that the next character is from the special table. Unicode has resisted this effort.
Re:Compaction and Traction (Score:2)

by JJ ( 29711 ) writes:

You know, this is actually the one topic that I am probably best versed on discussing. My info sci masters advisor was on the committee which established ASCII and my linguistics masters was on medieval Chinese dictionaries. Plus, I used to live in Japan.
There are _slightly_ more than 2000 kanji in Japanese, but Japanese printers, like my wife's father, don't use more than 2100 absolute tops.
Chinese characters obey Zipf's law on a near perfect logarithmic scale. As in, the first ten characters make up about 60% of written text. For each unit of ten up from that include about 60% of what is left. At 10,000 characters you have all but about 2.5% of most newspaper text. The few thousand extra that I spoke of covers mostly proper names.
Chinese most certainly can be written satifactorily in this manner.
Re:Quit whining and move to a phonetic alphabet (Score:2)

by scruffy ( 29773 ) writes:

Did you actually read what you linked to? Justin Rye is very sympathetic to "spelling reform", but he realizes it is utopian:

The flaws of the standard orthography are indefensible - but it has an extensive Installed User Base, and can thus afford to ignore criticism in exactly the same manner as Fahrenheit thermometers, QWERTY keyboards, and certain software packages, which can all rely on conformism, short-termism, and sheer laziness for their continued survival.
Quit whining and move to a phonetic alphabet (Score:3)

by scruffy ( 29773 ) writes: on Tuesday June 05, 2001 @08:18AM (#175472)

Phonetic writing is one of the greatest inventions of mankind. All a speaker needs to be literate is to learn the mapping between sounds and letters. Could anything be easier?
But like companies who still maintain their legacy software written in Cobol and who knows what else, countries and cultures hold onto their legacy alphabets, despite all their disadvantages, and despite all the moaning and groaning about education, literacy, and how hard it is to type 10,000 characters on a 100-key keyboard.
I agree there is a serious problem of understanding texts written in the "old way". There is a simple solution here, too, i.e., we just translate what's most important to the "new way" and let scholars work on the texts that don't get translated. Before anyone gets too hot here, the situation is not that much different than translating literature from one language to another. It is too much work to translate everything that is written in English into French, so one focuses on the texts that are important enough for translation.
Also, English has a lot of problems here, as it is mostly phonetic, but a large percentage is not, large enough to make learning English a lot more difficult than say learning Spanish.
I realize this is way too utopian. We Americans can't even move to metric, much less anything more "radical". I just needed to respond to the whining.

Share
twitter facebook
Nordic Runes? (Score:2)

by reverse solidus ( 30707 ) writes:

Nordic Runes. [megabaud.fi]
A Plan for the Improvement of English Spelling (Score:4)

by bgarcia ( 33222 ) writes: on Tuesday June 05, 2001 @08:29AM (#175485) Homepage Journal

Go read the original [netfunny.com] story here, by Mark Twain.

Share
twitter facebook
Re:Perl in Hierogliphics (Score:2)

by CharlieG ( 34950 ) writes:

Gee, I thought it already exists - they call it APL
Re:Is this really such a problem? (Score:2)

by revscat ( 35618 ) writes:

Firstly, many cultures are still too poverty-stricken to have electricity and running water, let alone net access. For these people, the thorny issue of whether Unicode has the capacity to represent their native language is totally irrelevent.

It's totally irrelevant for poor rural populations, true. But as more and more of the world's population moves towards being centered around urban areas this is indeed relevant. It is relevant to those who desire the full functionality of the Internet in their native character set. I believe (and this is a belief, not a fact) that one way to help out those who are poor is by opening them up to the modern economy and make it as accessible as possible. One way to do this is by making sure they can use the latest technology in their native tongue, lowering the slope of the learning curve.

Secondly, the rate at which languages are dying is still accelerating. Every year, we lose several languages as native speakers die of old age without their descendents having ever learned their original language.

This is indeed tragic, but it quite simply cannot be helped. It's so common as to be a cliche: "Life Sucks", or "Shit Happens", or even "C'est l'vie." I hope that there are linguists and philologists who are archiving these languages for future generations and our general cultural awareness. BUT: People must eat, and they have a strong desire to make themselves and their families prosperous. If, when all things are considered, making sure that you live your life only speaking language X turns out to be counterproductive, then that language will become less important. There have been many languages that have come and gone throughout the millenia; humanity continues to advance. Would the world be a richer place if all those languages were still around? Certainly. But it would also be more confusing. And remember: If people can speak to each other, there is less of a chance they'll start killing each other. (LESS of a chance, mind you.)

I'm a Taoist at heart in matters such as this. For every yin, there is a yang, for every good, there is a bad. Life goes on.
- Rev.
Re:ASCII stupidity all over again... (Score:2)

by gorilla ( 36491 ) writes:

ASCII is, and always was, a 7 bit standard, which encoded 95 printable characters and 33 control codes. 'high-ascii' just does not exist, and never did.
It works (Score:2)

by Kohath ( 38547 ) writes:

Imperfect != "does not work"
Re:Wrong, wrong! (Score:2)

by csbruce ( 39509 ) writes:

or you want to scan the string backwards

UTF-8 can indeed be scanned backwards. You could also locate the start of the current character given a random pointer into a byte buffer. RTFM [faqs.org]. UTF-8 can also directly encode 2 billion characters. UTF-8 is the right general solution to data interchange, and this is why it's catching on.
Re:You bring up a good point (Score:2)

by ncc74656 ( 45571 ) writes:

The plural of dish is dishes, but the plural of fish is fish

"Fishes" is also a valid plural form of "fish." "Fishes" refers to a group of different species, while the plural "fish" refers to a group that is all of the same species. The plecostomuses (sp?) and cichlids in my tank at home are fishes; the trout in a pond are fish.
(Your point that English has tons of rules and even more exceptions to those rules still stands, though.)

"A bunch of bananas" or "a group of individuals", are these plural or singular?

A bunch and a group are both singular, though some Brits would disagree (their usage used to treat a group as a plural object ("and the crowd are going wild!"), but that is starting to change in more recent usage).
Well DUH! It's not meant to have every character (Score:2)

by stienman ( 51024 ) writes:

Japanese alone learn some 50,000 symbols before they leave their 5th year of schooling. Unicode was never meant to hold one spot for every character. It was meant to be used as a set of code pages much like ascii was. But it had to be larger than 256 to hold a reasonably representative set of one language at one time (such as Japanese, or Chinese (two dialects), etc).

Most documents consist largely of one language, so you start the document by stating the code page you're using. Very few documents need more than one set of 65,536 characters, but you can intersperse sets if needed.

But the idea of having one universal character set is ludicrous. There are well over 140,000 language characters on this earth, and there are many yet to have been entered into a computer. Sure, we could use 4 bytes per character, but is it really necessary? Absolutely not! Talk about inefficient. The only case where that would be more efficient than code pages is when the majority of documents extensively use more than 64k characters within each document.

Besides, translation software is coming along well enough that soon we will not have to worry about it too much.

-Adam

This sig 80% recycled bits, 20% post user.
Re:too sinocentric, but Unicode has problems (Score:2)

by olevy ( 63189 ) writes:

I worked as a programmer in Japan for 4 year, and I've also done several projects in Unicode.

There are couple of things I would like to point out:

>>Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

I can't speak for Korean, but there is no such thing as an alphabetic order for Kanji. In Japanese, Kanji almost always have at least two pronunciations, and often more.

>>The Japanese hate Unicode. If you bother to ask them, which the web did not, you find a loud and impolite dislike for Unicode. The Japanese want their ISO 2022 solution, aka shift-JIS.

Have you ever tried to program in shift-JIS? It is horrific. Basically they mix one byte and two byte characters. The problem is that if you jump into the middle of the string there is no way to know if you are looking at a one byte character or the second byte of a two byte character. You also can't do tell the number of characters in a string simply by looking at the length. It is a *terrible* standard.
Re:After some skimming... (Score:2)

by tytso ( 63275 ) writes:

Now, I'm not Chinese so my opinion counts for little here, but my impression is that Unicode isn't nearly as controversial as he makes it out. His analogy "To express it in Western terms, how would English-speakers like it if we were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" - why, they are the nothing more a fancier "C" and an "Z")." ignores the fact that Chinese orthography has a tradition of simplification and variants. I suspect Unicode is a lot more upsetting to a "reference writer specializing in rare Taoist religious texts and medical works" than to ordinary Chinese users who want to run Photoshop or put their wedding pictures on a web page.
Actually his original nalogy was flawed and designed to yank people's chain... A better analogy of what's going on would be to say that the Germans and the French wanted to have their own Unicode code point for the letter "A", since obviously the German A is very different from the French A. Repeat for all the letters in the alphabet. The excuses for saying that the German "A" should have a different value than a French "A" is (a) The Germans and the French hate each other, and (b) French tend to use a sans-serif'ed font. When told by the standards committee that font issues were independent of Unicode assignment, the response was this was obviously anti-European imperialism....

That's basically what's going on here with the folks who are complaining about Han Unification. Many Asian languages are desended originally from Chinese, just as many European languages are descended from Latin and Germanic roots. So it's not surprising that the systems of orthography share a lot in common. The difference is that each Asian country refuses to share any codepoints with any other Asian country, because They Hate Each Other, and there seems to be some widespread belief that doing so would somehow be causing their national language to lose face.

As someone who's Chinese, I think I can safely say to those people who like to bitch and moan about Han Unification..... Grow up!
Re:Alrighty (Score:2)

by hernick ( 63550 ) writes:

Have you ever seen an IME ? The program a Japanese person would use to enter their 10,000 characters ?

You spell out the word phonetically, and press space as you complete each word - the computer will show possible kanji, and you can cycle through them with the space key.

It actually works pretty well. Their keyboards pretty much look just like ours.
Flamebait :) (Score:3)

by phunhippy ( 86447 ) writes: <zavoid AT gmail DOT com> on Tuesday June 05, 2001 @07:24AM (#175532) Journal

Learn english.. 26 letters 10 numerals.. assorted punctuation.. ;)

Share
twitter facebook
too sinocentric, but Unicode has problems (Score:4)

by rjh3 ( 99390 ) writes: on Tuesday June 05, 2001 @09:25AM (#175550)

Ah, the horrors of Unicode. The referenced article is too Sinocentric. Unicode's problems go further. Unicode is both a european solution to european problems and a european solution to asian problems.

The Japanese hate Unicode. If you bother to ask them, which the web did not, you find a loud and impolite dislike for Unicode. The Japanese want their ISO 2022 solution, aka shift-JIS.

The history of encodings is roughly:
1. There was chaos.
2. Then there was ASCII (the roman alphabet) pleasing to latin and english speakers.
3. Then there were all the ISO 8859 and ISO 2022 encodings. These let all the european languages mix together with ASCII.
4. Then Japan, Korea, and Vietnam define their own ISO 2022 encodings that make sense in the local language, and let these languages mix together with the european languages and ASCII.
5. But ISO 2022 is a complex patchwork of special cases. So at the same time the Asians were inventing their ISO 2022 solutions, Unicode was being invented.
Unicode 1.0 provided a viable solution to modern european languages, but could not encode historical documents or asian languages properly. The Unicode 2.0 effort fixed the historical european language problem by adding in the alphabets for these "dead" languages. Unicode 2.0 brought the asian encodings to the point where they were usable.

Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.

Meanwhile China has a unique problem. They do not have an agreed alphabet. The Japanese all around the world agree on what characters define Kanji. There may be different fonts, but there is one agreed alphabet. Similarly, the Koreans and the Vietnamese have one agreed alphabet. These alphabets are huge, with thousands of characters, but they are fixed and agreed worldwide.

China has not agreed on an alphabet. Different regions use different alphabets. Chinese speak numerous different languages and have invented an amazing alphabet that works as a single writing form for all those languages. But there are disagreements. Furthermore, some regions of China are still inventing new letters for the alphabet. It is not a fixed and stable thing like european alphabets. You can invent new letters. (These really are new letters, not just new fonts.)

The Chinese have invented many encodings as a result. The two most popular (Big5 and GB2312) are not ISO 2022 compatible. There is a new, less widely used encoding that is a superset encoding of BIG5, GB2312, and other encodings, and that is ISO 2022 compatible.

Unicode did not accept the approach of leaving all these alphabets as different. They share most of their glyphs. Giving each region and language its own complete section would have blown the 50K limit of Unicode 2.0. They smushed all these different alphabets into one blob by combining anything that had similar glyphs into one character.

This left Unicode 2.0 telling the Chinese, ignore all those letters we don't like. You don't use them much anyhow. It destroyed any notion of alphabetic order in the encodings for any asian language. And it is usable for modern text communication. Unicode 3.0 promises to do better, and probably will.

But since all these languages can use the ISO 2022 encodings with fully compatable mixture of languages, why not just use ISO 2022 and forget Unicode? The problem is the patchwork nature of ISO 2022. The encoding rules are complex. ISO 2022 is a terrible internal format. A chinese character may take from 2 to 9 bytes to encode. And it gets worse as you dig further. UCS-2 and UCS-4 are very nice friendly internal formats for computers. It is trivial to convert from UCS-2 or UCS-4 into UTF-8 for transmission.

It is also pretty simple to translate from UCS-2 or UCS-4 into ISO 2022 encodings. So the ISO 2022 encodings actually can make sense for network transmission.

These issues will just get worse as you include other languages, like historical chinese, chinese border languages, and south asian languages. As with chinese, some of these have the fundamentally hard problem that they do not agree on a single alphabet.

Share
twitter facebook
Wrong, wrong! (Score:4)

by Mendax Veritas ( 100454 ) writes: on Tuesday June 05, 2001 @07:32AM (#175554) Homepage

UCS-2 is not the only form of Unicode, and it's well known that 64k characters isn't enough. Besides, why should ordinary ISO-8859 (Latin-1) text be doubled in size by making every character 16 bits? UTF-8 is a much better solution, and it is good enough. Granted, string handling with variable-length characters is a bit of a pain (especially if you're used to assuming that a buffer of N bytes is long enough for a string of N characters, or you want to scan the string backwards), but it's the best solution we've got. It's the recommended encoding for XML documents, and is used today in web browsers (check out that "Always send URLs as UTF-8" option in Internet Explorer).
It is a shame that there are so many different Unicode encodings. I think we ought to just standardize on UTF-8.

Share
twitter facebook
unicode does *not* encode 65,536 characters (Score:4)

by egomaniac ( 105476 ) writes: on Tuesday June 05, 2001 @07:46AM (#175565) Homepage

It encodes over one million codepoints, actually (the erroneous statements of other posters notwithstanding). All currently assigned Unicode characters exist within the basic Unicode Plane 0, as it's called, which handles ~50,000 characters. Twenty-some-odd-thousand of those characters are in the CJK block (Chinese, Japanese, and Korean characters).

Now, a range of Unicode characters is set aside for so-called "surrogates", and a high surrogate and a low surrogate character placed next to one another form a "surrogate pair" which specifies an extended character in UCS Plane 1. None of UCS Plane 1 codepoints are actually assigned to anything yet, but since there are about 2^20 (~one million) Plane 1 codepoints, they will easily handle all remaining glyphs with a ton left over. Tengwar, Klingon and others have all been considered for Plane 1 encoding (although I just checked and Klingon has been rejected. Sorry folks).

So, the simple fact is that anyone who says Unicode can't support enough characters has been smoking a bit too much crack lately. Do yourself a favor and go read the spec before getting your panties in a twist.

Share
twitter facebook
Unicode has this covered. (Score:3)

by tjwhaynes ( 114792 ) writes: on Tuesday June 05, 2001 @07:58AM (#175573)

Had this researcher bothered to read the Unicode technical introduction, the following would have been obvious.
In all, the Unicode Standard, Version 3.0 provides codes for 49,194 characters from the world's alphabets, ideograph sets, and symbol collections. These all fit into the first 64K characters, an area of the codespace that is called basic multilingual plane, or BMP for short.
There are about 8,000 unused code points for future expansion in the BMP, plus provision for another 917,476 supplementary code points. Approximately 46,000 characters are slated to be added to the Unicode Standard in upcoming versions.
The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.
Plenty of room.
Cheers,
Toby Haynes

Share
twitter facebook
Misconceptions in article (Score:3)

by HalfFlat ( 121672 ) writes: on Tuesday June 05, 2001 @09:26AM (#175579)
As a preliminary, Unicode and ISO 10646 aren't the same standard, but are kept pretty much in synchronisation. ISO 10646 provides a character set with a 4-byte representation, and a compatible smaller set with a 2-byte representation. These representations have encodings such as UTF-8, UTF-16, and UTF-32. UTF-32 encodes every Unicode character in 32 bits and can represent the full 2^31 codepoints, while UTF-8 and UTF-16 as described in the Unicode 3.1 document [unicode.org] are variable length representations that can represent approximately 2,100,000 and 1,100,000 codepoints respectively.

One of the design principles was to provide a lossless representation of any currently used character set in Unicode, so that a round-trip re-encoding of text from one encoding to Unicode and back again would lose no information. Another was to keep distinct code-points for any characters that had different semantics, or different 'abstract shapes'.

It turns out that one can satisfy these requirements for the Japanese kanji, Chinese hanzi (traditional and simplified) and Korean hanja without requiring a seperate code-point for each; in Unicode version 2.0, approximately 121,000 such characters were able to be represented in 20,902 code points. Note that those characters which have distinct shapes but the same meaning, and those which are similar enough to be classified as calligraphic variants but have distinct meanings, are all represented by distinct code-points. (One caveat: in practice there are some exceptions as regards the preservation of information after a round-trip encoding to Unicode and back. For example, the CCCII encoding of hanzi explicitly catalogues calligraphic variations, and as such doesn't map 1-1 onto Unicode.)
Of course, the actual glyph that corresponds to one of these unified codes will change depending upon the context in which it is rendered. For example the character 0x6d77 corresponding to the character for sea in both Chinese (Mandarin 'hai3') and Japanese ('umi') is drawn with one fewer stroke in Japanese than in Chinese. These typographical details are important, but can (and debatably, should) be dealt with outside the context of character encoding. Unicode has support for language tags which in the absence of any higher-level information can indicate the language context of the characters following them. Typically though, this information should be stored as part of a richer document structure (as is possible in XML for example.) Correct display of characters will require the presence of the appropriate font and a mechanism (such as LOCALE in a simple one language case) for selecting this font.

Given this unification then, one really can fit most of the characters for which there already extant (non-Unicode) encodings into 16 bits. With Unicode 3.1/ISO 10646-2 (which uses more than 65536 codepoints) this representation is AFAIK pretty much complete, including for example all of the hanzi of CNS 11643-1992 and CNS 11643-1986 plane 15 (the most complete hanzi encoding outside of CCCII.)

With this in mind, one can argue against the points raised in the article:
1. The unification scheme, allows the representation of the 170,000 characters the author calculates in 70,000 or so codepoints. Which it now does with Unicode 3.1. The use of external context is still necessary for correct rendering, but if the document has no structure for representing language context, there are Unicode language tags that can fill this role. Similarly, context would be required for the presentation of different calligraphic variants of Roman characters (e.g. fraktur.)
2. Unification is quite unlike the analogy described 'in Western Terms'. 'M' and 'N' could not be identified, as they semanticly distinguish words (e.g., 'rum' and 'run' have very different meanings.) Traditional characters and their simplified analogues are not identified under Unicode, so even if 'Q' were simply a fancier 'C' (which of course it is not), it wouldn't be given the same codepoint.
3. Unicode is not limited to 16 bits as stated in the introduction to the article. There are over 2000 million available codepoints in UCS-4 and UTF-8, and UTF-16 can represent approximately 1 million of these. There is plenty of room - even in UTF-16 - to encode more characters as the need arises.
4. With the exception of calligraphic variants in CCCII, Unicode can already faithfully represent characters in the major Chinese, Japanese and Korean character encoding standards.
A little bit of research by the article author would have made the article unnecessary.

References:
Unicode 3.1 document;
CJKV Information Processing, Ken Lunde.

PS: In the time it took me to read the article, do some research and write this response, there have been over 300 slashdot comments. Wow.
Share
twitter facebook
Re:Wrong, wrong! (Score:3)

by hackbod ( 133575 ) writes: <hackbod@enteract.com> on Tuesday June 05, 2001 @08:38AM (#175597) Homepage

People who think there is a problem with the number of different Unicode encodings -- including the authors of this article -- completely misunderstand how unicode works. The different encodings are -not- different character sets -- in fact, they are different ways to write the -same- standard Unicode character set. The transformation between UTF-8, UTF-16, and UTF-32 is only a simple bit minipulation -- it is completely independent of the character set.

An implication of this is that UTF-8, UTF-16, and UTF-32 can all express the EXACT SAME NUMBER OF CHARACTER CODES. So, if you think UTF-32 is good enough for you, then UTF-16 and UTF-8 are just as good. The latter two simply use multi-word or multi-byte sequences to express the upper character values.

After using BeOS for a number of years, where all character strings are natively handled as UTF-8, I am a very strong believer in Unicode. Having a Western perspective I may be missing something, but none of the "problems" mentioned in this article are actually problems that Unicode has.

Of course, once you start using Unicode, the main problem you are going to run in to is having fonts with the characters you need. And if the Chinese, Japenese, etc. really need 50,000 of their very own characters, then this is going to be that much more of a problem. Unforunately, there is no easy solution to this -- but it doesn't have anything to do with the encoding you use, so changing to another encoding is not going to help here.

Share
twitter facebook
Unicode and CJK Characters (Score:3)

by torokun ( 148213 ) writes: on Tuesday June 05, 2001 @11:39AM (#175614) Homepage

There are some good comments here, clarifying why this article is fundamentally wrong in its assumption that Unicode only encodes 2^16 characters. This is the first reason why this article is wrong.
The other reasons are more subtle, and I'm not sure that everyone here understands what's going on with CJK characters, so here's a little background.
The characters we're talking about originated in china, and spread to Korea, Vietnam, and Japan. Vietnam has switched to a western alphabet now, so let's leave them out. ;) At one point, although there have always been alternative forms for some characters, there was a reasonably standard set of Chinese characters used throughout these countries (recorded in the KangXi dictionary)...
The Japanese invented a number of their own characters, which I'm sure number less than 1000. Up until World War II, this was basically the situation. (So at this time, the required number of characters to encode would have been less than 50,000 -- Chinese characters and Japanese additions.) Then all hell broke loose, so to speak.
The Japanese simplified a large number of their characters systematically, immediately following WWII ( So they started substituting simpler characters for the disallowed ones in these compounds, and thereby subtly changed the meaning of the words.
On to China -- they also began a campaign of character simplification, which would span quite a few years, although theirs was much more radical than the Japanese approach. In fact, some of the simplified versions the government came out with were so repulsive, they were eventually retracted because everyone refused to use them. ;) So they ended up with a few thousand ( Finally, Korea, Taiwan, and Hong-Kong basically kept the traditional chinese characters.
So, that gives us the basic 40,000, plus 3000 Japanese (kokuji and shinjitai), plus maybe 10,000 chinese (jiantizi), plus some other stuff not mentioned here, giving a grand estimate of around 55,000.
The key to this is that the vast majority of characters used are common among all 5 locales. This was the only reason that anyone even attempted to encode the CJK characters in the first place. The re-unification of all the disparate character sets was called Han-Unification during the Unicode development process.
This, combined with the surrogate encoding area, ensures that there will be plenty of space for everyone... :)

Share
twitter facebook
Cultural Heritage is important! (Score:3)

by peccary ( 161168 ) writes: on Tuesday June 05, 2001 @08:30AM (#175618)

I mean, imagine how much pooerer you would be if you had been unable to read the epic poems of early Anglo-Saxon culture in their original form! Or the early Judaic and Greek writings on which much of our more recent culture is based.

You *have* read Beowulf, and the Canterbury Tales, haven't you? Along with Plato's Republic in Greek, and the Dead Sea scrolls?

Now imagine how hard this would be if your computer didn't support the full character set in which they were written.

Share
twitter facebook
Overstating and misunderstanding the problem (Score:4)

by kurisuto ( 165784 ) writes: on Tuesday June 05, 2001 @07:33AM (#175628) Homepage

This article mischaracterizes the issue concerning the Chinese characters. To take a western example as an illustration, the number one is handwritten in America as a vertical stroke, but in Germany as an upside-down V. However, folks in America and Germany agree that this is "the same character"; we simply have a different way of writing it. Unicode recognizes this sameness by assigning the same code for character for "one"; the way to display it locally is a presentation issue, not an encoding one.
This is exactly the issue with the Chinese characters. For a given character, there might be a difference between the Taiwanese way of writing it, the Japanese way, and the mainland Chinese way; but the character is still recognized as being the same, despite these presentation-level differences.
For someone to demand that each national presentation form have its own character code is to misunderstand what Unicode is designed for. It encodes abstract characters, not presentation forms. Unicode does not have separate codes for "A" in Garamond and "A" in Helvetica.

Share
twitter facebook
Solution - Everybody use Euro-English! (Score:4)

by saider ( 177166 ) writes: on Tuesday June 05, 2001 @07:34AM (#175638)

The European Commission has just announced an agreement whereby English will be the official language of the EU rather than German which was the other possibility. As part of the negotiations, Her Majesty's Government conceded that English spelling had some room for improvement and has accepted a 5 year phase-in plan that would be known as "Euro-English".

In the first year, "s" will replace the soft "c". Sertainly, this will make the sivil servants jump with joy. The hard "c" will be dropped in favour of the"k". This should klear up konfusion and keyboards kan have 1 less letter.

There will be growing publik enthusiasm in the sekond year, when the troublesome "ph" will be replaced with "f". This will make words like "fotograf" 20% shorter.

In the 3rd year, publik akseptanse of the new spelling kan be ekspekted to reach the stage where more komplikated changes are possible. Governments will enkorage the removal of double letters, which have always ben a deterent to akurate speling. Also, al wil agre that the horible mes of the silent "e"s in the language is disgraseful, and they should go away.

By the fourth year, peopl wil be reseptiv to steps such as replasing "th" with "z" and "w" with "v". During ze fifz year, ze unesesary "o" kan be dropd from vords kontaining "ou" and similar changes vud of kors be aplid to ozer kombinations of leters.

After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubl or difikultis and evrivun vil find it ezi to understand ech ozer. Ze drem vil finali kum tru!

Share
twitter facebook
ISO-2022-JP and "alphabetical order" (Score:4)

by achurch ( 201270 ) writes: on Tuesday June 05, 2001 @06:36PM (#175644) Homepage

>>Japan and Korea get no benefit from Unicode. In fact, their ISO 2022 encodings are at least in "alphabetical order" for the relevant alphabets. Unicode is just a jumble.
I can't speak for Korean, but there is no such thing as an alphabetic order for Kanji. In Japanese, Kanji almost always have at least two pronunciations, and often more.
While it is true that most all kanji have multiple pronunciations, the kanji in ISO-2022-JP are most definitely in order. Level 1 characters (0x3021-0x4F7E) are ordered by their primary reading, and Level 2 characters (0x5021-0x7426?) are ordered first by radical and then by number of strokes. In both cases it's easy to locate a character if for some reason you can't type it normally (e.g. it's not in your IME dictionary)--I've had to do this on occasion, in fact.
Unicode is, for all intents and purposes, completely random. Even without the problems of characters being inappropriately merged, there is no way you could try and find a character in Unicode; if your dictionary doesn't have it, tough luck. To me, that's an even scarier concept: for all practical purposes it could eliminate characters from the language. After all, if nobody can type it who's going to use it?
Have you ever tried to program in shift-JIS? It is horrific.
I will agree with this. Leaving aside the original poster's confusion of ISO-2022-JP and shi[f]t-JIS (the former is the official standard, aka JIS, while the latter is a poorly-thought-out Microsoft hack), dealing with strings that contain both half-width (1-byte) and full-width (2-byte) characters is a major PITA. About the only thing that can be said for it is the number of bytes is equal to the number of half-width character positions needed; and even that only applies to EUC and SJIS, since JIS has escape sequences to squeeze everything into 7-bit characters.
On the other hand, there's the character order consideration, which along with the problem of merged characters seems to be what draws so much dislike for Unicode from Japanese.

--
BACKNEXTFINISHCANCEL

Share
twitter facebook
After some skimming... (Score:4)

by update() ( 217397 ) writes: on Tuesday June 05, 2001 @07:26AM (#175661) Homepage

I planned to read this through before posting. I really did. But then, in the second paragraph I hit:
Wieger's seminal book about the characters and construction of China, published in 1915, was to become the defacto source against which all others would (and still should) be compared - with several caveats. Amongst these is a noticeable bias on his part against Taoism which becomes more evident in his analysis of the Tao Tsang (i.e., Taoist Canon of Official Writings [written 'DaoZang' in the PinYin Romanization of Mainland China] )
and I decided to skim the rest.
To summarize, for those whose eyes completely glazed over, his point is that Unicode doesn't sufficiently cover the full range of Chinese characters and that not using a larger set is a result of a longstanding Western prejudice that the Chinese don't need so many characters.
Now, I'm not Chinese so my opinion counts for little here, but my impression is that Unicode isn't nearly as controversial as he makes it out. His analogy "To express it in Western terms, how would English-speakers like it if we were suddenly restricted to an alphabet which is missing five or six of its letters because they could be considered "similar" (such as "M" and "N" sounding and looking so much like each other) and too "complex" ("Q" and "X" - why, they are the nothing more a fancier "C" and an "Z")." ignores the fact that Chinese orthography has a tradition of simplification and variants. I suspect Unicode is a lot more upsetting to a "reference writer specializing in rare Taoist religious texts and medical works" than to ordinary Chinese users who want to run Photoshop or put their wedding pictures on a web page.

Unsettling MOTD at my ISP.

Share
twitter facebook
In other news... (Score:3)

by ackthpt ( 218170 ) writes: on Tuesday June 05, 2001 @07:45AM (#175662) Homepage Journal

Bush bolts GOP to join Democrats, fires entire Whitehouse staff
Linus Torvalds to join Microsoft as OfficeXP advocate
NASA on Moonshots, "Ok, ok, they were all actually faked on a soundstage in Toledo, Ohio and the ISS is really in a warehouse in Newark, New Jersey"
Oracle CEO, Larry Ellison to give fortune to charity, dumps japanese kimonos for Dockers and GAP T-shirts
RIAA to drop all charges against Napster, "All a big fsck-up, we'll all get rich together"
Taiwan throws in towel, joins PRC, turning over massive US military and intelligence assets
Rob Malda signed by Disney, epic picture planned, based upon this short. Sez Malda, "Anime's not mainstream enough anyway." [cmdrtaco.net]

--
All your .sig are belong to us!

Share
twitter facebook
So... (Score:4)

by ackthpt ( 218170 ) writes: on Tuesday June 05, 2001 @07:15AM (#175663) Homepage Journal

4av3 3v3r0n3 1n t4e w0r1d 13arn t0 typ3 l33t!

--
All your .sig are belong to us!

Share
twitter facebook
No, _n_ bytes per character! (Score:3)

by The Monster ( 227884 ) writes: on Tuesday June 05, 2001 @11:51AM (#175671) Homepage
Sort of. You define a 32-bit space for now, then use something like UTF-8 to encode it.
Personally, I think UTF-8 is just a wee bit inefficient. I worked out a scheme long ago that defines a theoretically infinite namespace, and encodes 7-bit ASCII exactly the same as it is now. If anyone cares, it's as simple as this:

A "character" is defined as a sequence of bytes ("octets" for the RFC-phile) that

ends with a value which has the most-significant bit clear. (If you treat byte as unsigned, this means nonnegative; if signed, it's < 128, whichever test you'd prefer to code. I have my preference...)

This gives 2^(7 * n)possible characters of length n:
1. 128.
2. 16,384, cumulative 16,512.
3. 2,097,152, cumulative 2,113,664.
4. 268,435,456, cumulative 270,549,120.
5. 34,359,738,368, cumulative 34,630,287,488.
6. 4,398,046,511,232, cumulative 4,432,676,798,720.
7. ...
As you can see, 3 bytes allow encoding that covers pretty much every estimate I've seen here.
The system can be arbitrarily extended any time it's necessary, and existing agents that understand the fundamental rule would know how to parse these extended characters; although they would not know how to present the characters, they would be able to present an appropriate token indicating this fact, rather than displaying gibberish composed of the 8-bit "ascii" encoding they do understand.
Share
twitter facebook
Unicode's reply (Score:4)

by roozbeh ( 247046 ) writes: on Tuesday June 05, 2001 @03:49PM (#175677) Homepage

It's probably too late, but following is a reponse from on of the editors of the Unicode Standard:
Dear Mr. Carroll,
I have just finished reading the article you published today on the Hastings Research website, authored by Norman Goundry, entitled "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations."
Mr. Goundry's grounding in Chinese is evident, and I will not quibble with his background East Asian historical discussion, but his understanding of the Unicode Standard in particular and of the history of Han character encoding standardization is woefully inadequate. He make a number of egregiously incorrect statements about both, which call into question the quality of research which went into the Unicode side of this article. And as they are based on a number of false premises, the article's main conclusions are also completely unreliable.
Here are some specific comments on items in the article which are either misleading or outright false.
Before getting into Unicode per se, Mr. Goundry provides some background on East Asian writing systems. The Chinese material seems accurate to me. However, there is an inaccurate statement about Hangul: "Technically, it was designed from the start to be able to describe *any sound* the human throat and mouth is capable of producing in speech, ..." This is false. The Hangul system was closely tied to the Old Korean sound system. It has a rather small number of primitives for consonants and vowels, and then mechanisms for combining them into consonantal and vocalic nuclei clusters and then into syllables. However, the inventory of sounds represented by the Jamo pieces of the Hangul are not even remotely close to describing any sound of human speech. Hangul is not and never was a rival for IPA (the International Phonetic Alphabet).
In the section on "The Inability of Unicode To Fully Address Oriental Characters", Mr. Goundry states that "Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate *every single written language* on the planet." While the intended scope of the Unicode Standard is indeed to include all significant writing systems, present and past, as well as major collections of symbols, the Unicode Standard is *not* about creating "formalized font systems", whatever that might mean. Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.
Immediately thereafter, Mr. Goundry starts making false statements about the architecture of the Unicode Standard, making tyro's mistakes in confusing codespace with the repertoire of encoded characters. In fact the codespace of the Unicode Standard contains 1,114,112 code points -- positions where characters can be encoded. The number he then cites, 49,194, was the number of standardized, encoded characters in the Unicode Standard, Version 3.0; that number has (as he notes below) risen to 94,140 standardized, encoded characters in the *current* version of the Unicode Standard, i.e., Version 3.1. After taking into account code points set aside for private use characters, there are still 882,373 code points unassigned but available for future encoding of characters as needed for writing systems as yet unencoded or for the extension of sets such as the Han characters.
*Even if* Mr. Goundry's calculation of 170,000 characters needed for China, Taiwan, Japan, and Korea were accurate, the Unicode Standard could accomodate that number of characters easily. (Note that it already includes 70,207 unified Han ideographs.) However, Mr. Goundry apparently has no understanding of the implications or history of Han unification as it applies to the Unicode Standard (and ISO/IEC 10646). Furthermore, he makes a completely false assertion when he states that Mainland China, Taiwan, Korea, and Japan "were not invited to the initial party."
Starting with the second problem first, a perusal of the Han Unification History, Appendix A of the Unicode Standard, Version 3.0, will show just how utterly false Mr. Goundry's implication that the Asian countries were left out of the consideration of encoding of Han characters in the Unicode Standard is. Appendix A is available online, so there really is no valid research excuse for not having considered it before haring off to invent nonexistent history about the project, even if Mr. Goundry didn't have a copy of the standard sitting on his desk. See:
http://www.unicode.org/unicode/uni2book/appA.pdf [unicode.org]
The "historical" discussion which follows in Mr. Goundry's account, starting with "The reaction was predictable..." is nothing less than fantasy history that has nothing to do with the actual involvement of the standardization bodies of China, Japan, Korea, Taiwan, Hong Kong, Singapore, Vietnam, and the United States in Han character encoding in 10646 and the Unicode Standard over the last 11 years.
Furthermore, Mr. Goundry's assertions about the numbers of characters to be encoded show a complete misunderstanding of the basics of Han unification for character encoding. The principles of Han unification were developed on the model of the main *Japanese* national character encoding, and were fully assented to by the Chinese, Korean, and other national bodies involved. So assertions such as "they [Taiwan] could not use the same number [for their 50,000 characters] as those assigned over to the Communists on the Mainland" is not only false but also scurrilously misrepresents the actual cooperation that took place among all the participants in the process.
Your (Mr. Carroll's) editorial observation that "It is only when you get *all* the nationalities in the same room that the problem becomes manifest," runs afoul of this fantasy history. All the nationalities have been participating in the Han unification for over a decade now. The effort is led by China, which has the greatest stakeholding in Han characters, of course, but Japan, Korea, Taiwan and the others are full participants, and their character requirements have *not* been neglected.
And your assertion that many Westerners have a "tendency .. to dismiss older Oriental characters as 'classic,'" is also a fantasy that has nothing to do with the reality of the encoding in the Unicode Standard. If you would bother to refer to the documentation for the Unicode Standard, Version 3.1, you would find that among the sources exhaustively consulted for inclusion in the Unicode Standard are the KangXi dictionary (cited by Mr. Goundry), but also Hanyu Da Zidian, Ci Yuan, Ci Hai, the Chinese Encyclopedia, and the Siku Quanshu. Those are *the* major references for Classical Chinese -- the Siku Quanshu *is* the Classical canon, a massive collection of Classical Chinese works which is now available on CDROM using Unicode. In fact, the company making it available is led by the same man who represents the Chinese national standards body for character encoding and who chairs the Ideographic Rapporteur Group (the international group that assists the ISO working group in preparing the Han character encoding for 10646 and the Unicode Standard).
Mr. Goundry's argument for "Why Unicode 3.1 Does Not Solve the Problem" is merely that "[94,140 characters] still falls woefully short of the 170,000+ characters needed"-- and is just bogus. First of all the number 170,000 is pulled out of the air by considering Chinese, Japanese, and Korean repertoires *without* taking Han unification into account. In fact, many *more* than 170,000 candidate characters were considered by the IRG for encoding -- see the lists of sources in the standard itself. The 70,207 unified Han ideographs (and 832 CJK compatibility ideographs) already in the Unicode Standard more than cover the kinds of national sources Mr. Goundry is talking about.
Next Mr. Goundry commits an error in misunderstanding the architecture of the Unicode Standard, claiming that "two *separate* 16 bit blocks do not solve the problem at all." That is not how the Unicode Standard is built. Mr. Goundry claims that "18 bits wide" would be enough -- but in fact, the Unicode Standard codespace is 21 bits wide (see the numbers cited above). So this argument just falls to pieces.
The next section on "The Political Significance Of This Expressed In Western Terms" is a complete farce based on false premises. I can only conclude that the aim of this rhetoric is to convince some ignorant Westerners who don't actually know anything about East Asian writing systems -- or the Unicode Standard, for that matter -- that what is going on is comparable to leaving out five or six letters of the Latin alphabet or forcing "the French ... to use the German alphabet". Oh my! In fact, nothing of the kind is going on, and these are completely misleading metaphors.
The problem of URL encodings for the Web is a significant problem, but it is not a problem *created* by the Unicode Standard. It is a problem which is being actively worked on my the IETF currently, and it is quite likely that the Unicode Standard will be a significant part of the *solution* to the problem, enabling worldwide interoperability, rather than obstructing it.
And it isn't clear where Mr. Goundry comes up with asides about "Ascii-dependent browsers". I would counter that Mr. Goundry is naive if he hasn't examined recently the internationalized capabilities of major browsers such as Internet Explorer -- which themselves depend on the Unicode Standard.
Mr. Goundry's conclusion then presents a muddled summary of Unicode encoding forms, completely missing the point that UTF-8, UTF-16, and UTF-32 are each completely interoperable encoding forms, each of which can express the entire range of the Unicode Standard. It is incorrect to state that "Unicode 3.1 has increased the complexity of UCS-2." The architecture of the Unicode Standard has included UTF-16 (not UCS-2) since the publication of Unicode 2.0 in 1996; Unicode 3.1 merely started the process of standardizing characters beyond the Basic Multilingual Plane.
And if Mr. Goundry (or anyone else) dislikes the architectural complexity of UTF-16, UTF-32 is *precisely* the kind of flat encoding that he seems to imply would be preferable because it would not "exacerbate the complexity of font mapping".
In sum, I see no point in Mr. Goundry's FUD-mongering about the Unicode Standard and East Asian writing systems.
Finally, the editorial conclusion, to wit, "Hastings [has] been experimenting with workarounds, which we believe can be language- and device-compatible for all nationalities," leads me to believe that there may be hidden agenda for Hastings in posting this piece of so-called research about Unicode. Post a seemingly well-researched white paper with a scary headline about how something doesn't work, convince some ignorant souls that they have a "problem" that Unicode doesn't address and which is "politically explosive", and then turn around and sell them consulting and vaporware to "fix" their problem. Uh-huh. Well, I'm not buying it.
--Ken Whistler, B.A. (Chinese), Ph.D. (Linguistics),
Technical Director, Unicode, Inc.
Co-Editor, The Unicode Standard, Version 3.0

--

Share
twitter facebook
Pictographs suck (Score:3)

by cryptochrome ( 303529 ) writes: on Tuesday June 05, 2001 @08:42AM (#175694) Journal

For crying out loud, somebody tries and do something nice for somebody and they come back and accuse them of cultural chauvanism. The powers that be didn't have to develop unicode or UCF at all. They only developed it because of the proliferation of language protocols was making the internet difficult to use for foreign languages and multinational businesses in general.

And besides which, the point of the article is moot. As this article states:

ISO 10646 defines formally a 31-bit character set. However, of this huge code space, so far characters have been assigned only to the first 65534 positions (0x0000 to 0xFFFD). This 16-bit subset of UCS is called the Basic Multilingual Plane (BMP) or Plane 0. The characters that are expected to be encoded outside the 16-bit BMP belong all to rather exotic scripts (e.g., Hieroglyphs) that are only used by specialists for historic and scientific purposes. Current plans suggest that there will never be characters assigned outside the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over one million potential future characters.

The italics and bold are mine. The 16 bit system was not meant to be completely comprehensive - it was meant to be useful for everyday use. Which, since it covers the characters literate people are expected to know in these systems, it does. The rest of the characters are academic (literally). If these characters are so important why don't they expect all of their own countrymen to know them?

The proprietors of the internet could have happily stuck with the regular 8-bit Roman alphabet system forever (the internet being an American military invention in the first place). The roman alphabet was just part of the system. Hell, even a 16-bit code would have covered all script-based writing and scientific/miscellaneous notation systems easily, while leaving codes or a dedicated bit for the eastern pictograph systems to signal an extension of the protocol and letting them work out their own standard amongst themselves. It would have been fun to watch them (particularly Taiwan and China) squabble for dominance over it too. No one is forcing these eastern nations (or any non-roman-alphabet users) to use unicode or UCF, or the internet or computers for that matter. If they really wanted to, they could come up with their own systems based on their own languages. They just hopped on board and adapted it to their own needs like everyone else because it's a good idea, and it would be way to difficult to build around their own languages. But isn't it funny how every one of these eastern countries (except Japan thanks to hiragana and katakana) adapted the phonic roman alphabet to simplify the teaching of their own languages? With at least 170,000 characters between them, defenders of these languages claim they are a rich cultural heritage and a beautiful illustrated system. You could just as easily say that modern use of these pictograph-based written languages are oppressively difficult and ensure a lot of time and effort wasted just trying to learn to write at best, and a stratifying system which guarantees high rates of illiteracy at worst. Erosion of these rigid and limited pictographic writing systems in favor of flexible and encompassing phonic ones is no accident or western conspiracy. Just as UCF was developed to make computer communication universal, the adaptation of phonic systems is the tendency to make literacy universal.

cryptochrome

P.S. Some may think that ISO 10646 (aka UCF-2) is not Unicode, but in fact as that same article points out "They joined their efforts and worked together on creating a single code table. Both projects still exist and publish their respective standards independently, however the Unicode Consortium and ISO/IEC JTC1/SC2 have agreed to keep the code tables of the Unicode and ISO 10646 standards compatible and they closely coordinate any further extensions. "

Share
twitter facebook
I had no trouble reading that at all (Score:3)

by cryptochrome ( 303529 ) writes: on Tuesday June 05, 2001 @09:18AM (#175695) Journal

The irony of that message being marked as funny(adapted as it is from Mark Twain) is that after a few seconds to adjust, I had no trouble reading that statement at all.

We tend to forget that there have been a lot of different spelling and notation systems for english. Even today, the british and american methods aren't identical. For all the fun we make and fear we have of the idea that the english (or any other language's) orthographic system should be simplified and made consistent with pronunciation, it is not a bad idea [whowhere.com]. It would greatly simplify the process of becoming literate and save tons of effort spent trying to learn irregular spellings. Beyond that, applying the same principles to pronunciation, the alphabetic letters (children's difficulty distinguishing b and d is universal), and vocabulary would accomplish the same goals with learning and using language.

cryptochrome

P.S. You forgot to mention dropping that pesky capitalization system. of course half the messages on the net don't both with it. same thing goes for dealing with contractions, a la dont, wont, ill, and so on.

Share
twitter facebook
Hmm.. I must have been using something else then? (Score:4)

by vidarh ( 309115 ) writes: <vidar@hokstad.com> on Tuesday June 05, 2001 @07:25AM (#175711) Homepage Journal

I've been using Unicode in various incarnations for a long time. And UCS-2 is not the only way to encode Unicode. UTF-8 is perhaps a lot more widespread, as it is the defacto standard encoding for exchange of XML documents over the web.
UCS-4 is also quite common, and allows for the new extensions.
UTF-16 is used by some that needs to extend their UCS-2 applications to UTF-16, or that mostly need text that work with UCS-2, but wants to be prepared for more.
Yes, a lot of things are difficult with Unicode. But if you look at most recent internationalization efforts, unicode is what people use.

Share
twitter facebook
Re:You bring up a good point (Score:4)

by SpeelingChekka ( 314128 ) writes: on Tuesday June 05, 2001 @09:19AM (#175717) Homepage

Does anyone know a a real language that has a simpler writing system than english?
Spoken like a true English-is-my-home-language person. English is NOT a simple language by any means, ask any foreigner who has learned English. Almost every rule in English has several exceptions, and many things in English cannot be deduced from rules, they must simply each be learned, and there are hundreds of these. Pronunciation is ridiculous, which you've mentioned, but apart from pronunciation is grammar, spelling, plural forms, tenses and possessive forms, all of these have strange nuances in English. The plural of dish is dishes, but the plural of fish is fish - sorry, no rule you can deduce that from, you must just learn that. The past tense of "hang" depends on what is getting hung/hanged. The rule says "add an apostrophe s" for possessive form, but of course there are exceptions, e.g. "it" "her" etc, or when the subject is a plural already, then you add an apostrophe but no "s". And the rules for when something is a plural "are" not always clear (and thus even educated people often aren't sure whether to use "are" or "is"). "Bananas are nice" is easy, but "A bunch of bananas" or "a group of individuals", are these plural or singular? And the examples get more and more complex. And there are obscure rules such as '"their" may be used in place of "his/her". And there are so many exceptions to rules like "i before e except after c", rules which many educated people even sometimes struggle to remember. I can name many University educated adults with English as their first language who still don't even know the difference between "lend" and "borrow" - that says something about the language.
I'm glad English is my home language, but I feel sorry for foreigners who have to learn English as a second language.
Is it just a coincidence that the simplest writing system was the first to be digitized
Yes, actually, it is. ASCII was probably the first wide-scale character set standard used in computing - what does the "A" stand for?

Share
twitter facebook
More Flamebait :) (Score:3)

by bark76 ( 410275 ) writes: on Tuesday June 05, 2001 @07:40AM (#175724)

Maybe if people didn't try to get character sets like Klingon [dkuug.dk], Cirth and Tengwar [unicode.org] added into unicode we wouldn't have this problem!

Share
twitter facebook
another drawback of unicode (Score:3)

by Magumbo ( 414471 ) writes: on Tuesday June 05, 2001 @07:32AM (#175729)

And we must not forget about hierogliphics. Unicode certainly has forgotten about them. That would be so cool to write perl code with little cats, birds, ankhs, and various other squiggles.

--

Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Languages (Score:2)

Re:Unicode's reply (Score:2)

Re:Languages (Score:2)

Re:Unicode's reply (Score:2)

Re:Unicode's reply (Score:2)

Re:Conspiracy Theories and Unicode (Score:2)

Re:Statelessness of text (Score:2)

Re:Unicode's reply (Score:2)

Re:Conspiracy Theories and Unicode (Score:2)

Re:Unicode's reply (Score:2)

Re:Unicode's reply (Score:2)

Re:Unicode's reply (Score:2)

Re:Unicode's reply (Score:2)

Re:Unicode Character Set vs Character Encoding (Score:2)

Unicode Character Set vs Character Encoding (Score:5)

Re:Duh. (Score:2)

Unicode includes all common Asian character sets (Score:3)

Babel (Score:2)

Use UTF-8, don't worry about sizes (Score:3)

Re:2 + 1 bytes? (Score:2)

Chinese language(s) (Score:2)

Re:Overstating and misunderstanding the problem (Score:2)

Case (Score:2)

Re:UCS-4 (Score:2)

Re:ASCII stupidity all over again... (Score:2)

Re:ASCII stupidity all over again... (Score:2)

Re:Arabic space (Score:2)

Re:Arabic space (Score:2)

Re:UTF-8 should be fine for almost any application (Score:5)

Re:All Character sets simultaneously?? (Score:2)

Correct, also see link (Score:2)

Re:You bring up a good point (Score:3)

Re:Unicode's reply (Score:2)

Re:No, _n_ bytes per character! (Score:2)

Re:Unicode's reply (Score:2)

Re:ISO-2022-JP and "alphabetical order" (Score:2)

Re:Unicode's reply (Score:2)

Re:Unicode's reply (Score:2)

Re:Unicode's reply (Score:2)

Re:Esperanto, Ido, lojban; BCE (Score:2)

Re:It works (Score:2)

No you didn't read the article, or even think (Score:2)

Re:C programs (Score:2)

What about the artist formerly known as Prince? (Score:2)

Re:Quit whining and move to a phonetic alphabet (Score:2)

Re:Languages (Score:2)

Re:Well DUH! It's not meant to have every characte (Score:2)

Re:After some skimming... (Score:2)

Some errors (Score:5)

printing [ Chinese ] vs dictionary [ Chinese } (Score:2)

totally unconvinced (Score:2)

Re:Solution - Everybody use Euro-English! (Score:2)

Re:You bring up a good point (Score:5)

Re:Quit whining and move to a phonetic alphabet (Score:2)

Re:UTF-8 should be fine for almost any application (Score:3)

Re:unicode does *not* encode 65,536 characters (Score:2)

Re:No you didn't read the article, or even think (Score:2)

Re:No you didn't read the article, or even think (Score:2)

Re:After some skimming... (Score:3)

Re:Quit whining and move to a phonetic alphabet (Score:2)

This is so wrong (Score:2)

Compaction and Traction (Score:2)

Re:Compaction and Traction (Score:2)

Re:Quit whining and move to a phonetic alphabet (Score:2)

Quit whining and move to a phonetic alphabet (Score:3)

Nordic Runes? (Score:2)

A Plan for the Improvement of English Spelling (Score:4)

Re:Perl in Hierogliphics (Score:2)

Re:Is this really such a problem? (Score:2)

Re:ASCII stupidity all over again... (Score:2)

It works (Score:2)

Re:Wrong, wrong! (Score:2)

Re:You bring up a good point (Score:2)

Well DUH! It's not meant to have every character (Score:2)

Re:too sinocentric, but Unicode has problems (Score:2)

Re:After some skimming... (Score:2)

Re:Alrighty (Score:2)

Flamebait :) (Score:3)

too sinocentric, but Unicode has problems (Score:4)

Wrong, wrong! (Score:4)

Re:unicode does not encode 65,536 characters (Score:2)

unicode does not encode 65,536 characters (Score:4)