Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Why Unicode Will Work On The Internet

Posted by timothy on Sat Jun 09, 2001 12:00 PM
from the contrary-viewpoints dept.
Ken Whistler sent in the lengthy and rather pointed response below to the article "Why Unicode Won't Work On The Internet," which was posted to Slashdot on Tuesday.


I have just finished reading the article you published today on the Hastings Research website, authored by Norman Goundry, entitled "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations."

Mr. Goundry's grounding in Chinese is evident, and I will not quibble with his background East Asian historical discussion, but his understanding of the Unicode Standard in particular and of the history of Han character encoding standardization is woefully inadequate. He make a number of egregiously incorrect statements about both, which call into question the quality of research which went into the Unicode side of this article. And as they are based on a number of false premises, the article's main conclusions are also completely unreliable.

Here are some specific comments on items in the article which are either misleading or outright false.

Before getting into Unicode per se, Mr. Goundry provides some background on East Asian writing systems. The Chinese material seems accurate to me. However, there is an inaccurate statement about Hangul: "Technically, it was designed from the start to be able to describe any sound the human throat and mouth is capable of producing in speech, ..." This is false. The Hangul system was closely tied to the Old Korean sound system. It has a rather small number of primitives for consonants and vowels, and then mechanisms for combining them into consonantal and vocalic nuclei clusters and then into syllables. However, the inventory of sounds represented by the Jamo pieces of the Hangul are not even remotely close to describing any sound of human speech. Hangul is not and never was a rival for IPA (the International Phonetic Alphabet).

In the section on "The Inability of Unicode To Fully Address Oriental Characters", Mr. Goundry states that "Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate every single written language on the planet." While the intended scope of the Unicode Standard is indeed to include all significant writing systems, present and past, as well as major collections of symbols, the Unicode Standard is not about creating "formalized font systems", whatever that might mean. Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.

Immediately thereafter, Mr. Goundry starts making false statements about the architecture of the Unicode Standard, making tyro's mistakes in confusing codespace with the repertoire of encoded characters. In fact the codespace of the Unicode Standard contains 1,114,112 code points -- positions where characters can be encoded. The number he then cites, 49,194, was the number of standardized, encoded characters in the Unicode Standard, Version 3.0; that number has (as he notes below) risen to 94,140 standardized, encoded characters in the current version of the Unicode Standard, i.e., Version 3.1. After taking into account code points set aside for private use characters, there are still 882,373 code points unassigned but available for future encoding of characters as needed for writing systems as yet unencoded or for the extension of sets such as the Han characters.

Even if Mr. Goundry's calculation of 170,000 characters needed for China, Taiwan, Japan, and Korea were accurate, the Unicode Standard could accomodate that number of characters easily. (Note that it already includes 70,207 unified Han ideographs.) However, Mr. Goundry apparently has no understanding of the implications or history of Han unification as it applies to the Unicode Standard (and ISO/IEC 10646). Furthermore, he makes a completely false assertion when he states that Mainland China, Taiwan, Korea, and Japan "were not invited to the initial party."

Starting with the second problem first, a perusal of the Han Unification History, Appendix A of the Unicode Standard, Version 3.0, will show just how utterly false Mr. Goundry's implication that the Asian countries were left out of the consideration of encoding of Han characters in the Unicode Standard is. Appendix A is available online, so there really is no valid research excuse for not having considered it before haring off to invent nonexistent history about the project, even if Mr. Goundry didn't have a copy of the standard sitting on his desk. See:

http://www.unicode.org/unicode/uni2book/appA.pdf

The "historical" discussion which follows in Mr. Goundry's account, starting with "The reaction was predictable ..." is nothing less than fantasy history that has nothing to do with the actual involvement of the standardization bodies of China, Japan, Korea, Taiwan, Hong Kong, Singapore, Vietnam, and the United States in Han character encoding in 10646 and the Unicode Standard over the last 11 years.

Furthermore, Mr. Goundry's assertions about the numbers of characters to be encoded show a complete misunderstanding of the basics of Han unification for character encoding. The principles of Han unification were developed on the model of the main Japanese national character encoding, and were fully assented to by the Chinese, Korean, and other national bodies involved. So assertions such as "they [Taiwan] could not use the same number [for their 50,000 characters] as those assigned over to the Communists on the Mainland" is not only false but also scurrilously misrepresents the actual cooperation that took place among all the participants in the process.

Your (Mr. Carroll's) editorial observation that "It is only when you get all the nationalities in the same room that the problem becomes manifest," runs afoul of this fantasy history. All the nationalities have been participating in the Han unification for over a decade now. The effort is led by China, which has the greatest stakeholding in Han characters, of course, but Japan, Korea, Taiwan and the others are full participants, and their character requirements have not been neglected.

And your assertion that many Westerners have a "tendency .. to dismiss older Oriental characters as 'classic,'" is also a fantasy that has nothing to do with the reality of the encoding in the Unicode Standard. If you would bother to refer to the documentation for the Unicode Standard, Version 3.1, you would find that among the sources exhaustively consulted for inclusion in the Unicode Standard are the KangXi dictionary (cited by Mr. Goundry), but also Hanyu Da Zidian, Ci Yuan, Ci Hai, the Chinese Encyclopedia, and the Siku Quanshu. Those are the major references for Classical Chinese -- the Siku Quanshu is the Classical canon, a massive collection of Classical Chinese works which is now available on CDROM using Unicode. In fact, the company making it available is led by the same man who represents the Chinese national standards body for character encoding and who chairs the Ideographic Rapporteur Group (the international group that assists the ISO working group in preparing the Han character encoding for 10646 and the Unicode Standard).

Mr. Goundry's argument for "Why Unicode 3.1 Does Not Solve the Problem" is merely that "[94,140 characters] still falls woefully short of the 170,000+ characters needed"-- and is just bogus. First of all the number 170,000 is pulled out of the air by considering Chinese, Japanese, and Korean repertoires without taking Han unification into account. In fact, many more than 170,000 candidate characters were considered by the IRG for encoding -- see the lists of sources in the standard itself. The 70,207 unified Han ideographs (and 832 CJK compatibility ideographs) already in the Unicode Standard more than cover the kinds of national sources Mr. Goundry is talking about.

Next Mr. Goundry commits an error in misunderstanding the architecture of the Unicode Standard, claiming that "two separate 16-bit blocks do not solve the problem at all." That is not how the Unicode Standard is built. Mr. Goundry claims that "18 bits wide" would be enough -- but in fact, the Unicode Standard codespace is 21 bits wide (see the numbers cited above). So this argument just falls to pieces.

The next section on "The Political Significance Of This Expressed In Western Terms" is a complete farce based on false premises. I can only conclude that the aim of this rhetoric is to convince some ignorant Westerners who don't actually know anything about East Asian writing systems -- or the Unicode Standard, for that matter -- that what is going on is comparable to leaving out five or six letters of the Latin alphabet or forcing "the French ... to use the German alphabet". Oh my! In fact, nothing of the kind is going on, and these are completely misleading metaphors.

The problem of URL encodings for the Web is a significant problem, but it is not a problem *created* by the Unicode Standard. It is a problem which is being actively worked on my the IETF currently, and it is quite likely that the Unicode Standard will be a significant part of the solution to the problem, enabling worldwide interoperability, rather than obstructing it.

And it isn't clear where Mr. Goundry comes up with asides about "Ascii-dependent browsers". I would counter that Mr. Goundry is naive if he hasn't examined recently the internationalized capabilities of major browsers such as Internet Explorer -- which themselves depend on the Unicode Standard.

Mr. Goundry's conclusion then presents a muddled summary of Unicode encoding forms, completely missing the point that UTF-8, UTF-16, and UTF-32 are each completely interoperable encoding forms, each of which can express the entire range of the Unicode Standard. It is incorrect to state that "Unicode 3.1 has increased the complexity of UCS-2." The architecture of the Unicode Standard has included UTF-16 (not UCS-2) since the publication of Unicode 2.0 in 1996; Unicode 3.1 merely started the process of standardizing characters beyond the Basic Multilingual Plane.

And if Mr. Goundry (or anyone else) dislikes the architectural complexity of UTF-16, UTF-32 is *precisely* the kind of flat encoding that he seems to imply would be preferable because it would not "exacerbate the complexity of font mapping".

In sum, I see no point in Mr. Goundry's FUD-mongering about the Unicode Standard and East Asian writing systems.

Finally, the editorial conclusion, to wit, "Hastings [has] been experimenting with workarounds, which we believe can be language- and device-compatible for all nationalities," leads me to believe that there may be hidden agenda for Hastings in posting this piece of so-called research about Unicode. Post a seemingly well-researched white paper with a scary headline about how something doesn't work, convince some ignorant souls that they have a "problem" that Unicode doesn't address and which is "politically explosive", and then turn around and sell them consulting and vaporware to "fix" their problem. Uh-huh. Well, I'm not buying it.


Whistler is Technical Director for Unicode, Inc. and co-editor of The Unicode Standard, Version 3.0. He holds a B.A. in Chinese and a Ph.D. in Linguistics.

This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1) | 2
  • Re:Huh? by Anonymous Coward (Score:1) Saturday June 09 2001, @08:52AM
  • Original Article now has this message prefixed .. by Anonymous Coward (Score:1) Saturday June 09 2001, @09:44AM
  • Re:yes, unicode works, but is unnecessary. by Anonymous Coward (Score:1) Saturday June 09 2001, @12:20PM
  • Goundry doesn't understand Chinese by Anonymous Coward (Score:1) Saturday June 09 2001, @03:48PM
  • Re:You're wrong by Anonymous Coward (Score:1) Saturday June 09 2001, @05:42PM
  • Re:yes, unicode works, but is unnecessary. by Anonymous Coward (Score:1) Sunday June 10 2001, @03:35AM
  • Re:Lack of editorial control by Anonymous Coward (Score:2) Saturday June 09 2001, @08:38AM
  • yes, unicode works, but is unnecessary. by Anonymous Coward (Score:2) Saturday June 09 2001, @08:48AM
  • Point-Counterpoint (Score:3)

    by Anonymous Coward on Saturday June 09 2001, @08:49AM (#163987)
    "Why Unicode Won't Work on the Internet" vs. "Why Unicode WILL Work on the Internet". Are we now going to see rebuttals to all Slashdot stories?
    • "Bill Gates is a Tool of the Devil" vs. "Microsoft is a Boon to Our Capitalist Economy"
    • "Real Hackers Use 'vi'" vs. "Emacs is the One True Editor"
    • "Napster Increases Music Sales" vs. "You're All a Bunch of Hypocritical Pirates"
    • "The Government is Spying on Us" vs. "Big Brother is Your Friend fnord"
    • "Jon Katz is a Pandering Conceited Hack" vs. "Jon Katz is a Pandering Conceited Hack"
  • Re:yes, unicode works, but is unnecessary. by Ian Bicking (Score:1) Saturday June 09 2001, @08:31PM
  • I use Unicode, and am very thankful for it.

    I am encoding sets of articles on international subjects (the most recent is a set of essays on libraries around the world). While I'm thankfully avoiding any issue of East Asian character sets, characters from multiple character sets do happen in a single document. Also, it's nice to basically leave the character set issue out entirely -- I use one character set, UTF-8. Keeping track of character set is a PITA.

    Really, Unicode just eliminates a whole class of issues -- many of which are currently solvable, but with Unicode they simply aren't a problem at all. As it becomes better supported -- and Unicode is already quite well supported -- I think most places will be using it.

    Also, if there's redundancy in Unicode, I imagine most of that space could be saved with gzip, which also has good support over the web, though like Unicode is far underused.

  • Re:Unicode's Universality. by spitzak (Score:2) Saturday June 09 2001, @04:44PM
  • Re:Unicode's Universality. by dvdeug (Score:2) Saturday June 09 2001, @11:59PM
  • Once something by jjr (Score:1) Saturday June 09 2001, @08:20AM
  • by hpa (7948) on Saturday June 09 2001, @10:40AM (#163993) Homepage
    Actually, the reason we're not seeing mixed languages used a lot is that the infrastructure hasn't been in place. Mixing languages within a system is frequently highly desirable, especially when you consider what "mixing languages" really mean. Yes, one can go the Microsoft route and have a completely different system for each language, including differerent APIs, but that doesn't really help me as a Swedish-speaking individual who want to do my Japanese homework online.


    <...> tags isn't the way to go, either; the only way you can make that work sensibly is by having a single encoding internally, which is typically going to be Unicode.


    That being said, there is no question it will take time to catch on, as people open up to the abilities that this provides, and tools start supporting it.

  • Re:yes, unicode works, but is unnecessary. by Sentry21 (Score:1) Saturday June 09 2001, @04:17PM
  • Voting on Slashdot policy? by Delphis (Score:2) Saturday June 09 2001, @09:23AM
  • Re:Unicode on slashdot... by IntlHarvester (Score:1) Saturday June 09 2001, @03:23PM
  • Re:Huh? by dirty (Score:1) Saturday June 09 2001, @09:31AM
  • Re:Lack of editorial control by Graymalkin (Score:1) Saturday June 09 2001, @11:05AM
  • Re:Lack of editorial control by Graymalkin (Score:1) Saturday June 09 2001, @05:12PM
  • Re:Lack of editorial control by Graymalkin (Score:2) Sunday June 10 2001, @12:49PM
  • Re:Another Unicode Character we need.... by sharkey (Score:1) Saturday June 09 2001, @03:27PM
  • one for "Damn, he put the smack down on that bitch!"

    Excellent rebuttal.... good to see this type of discussion on Slashdot again.
  • Chinese problem by JJ (Score:1) Saturday June 09 2001, @08:34AM
  • by jmauro (32523) on Saturday June 09 2001, @10:00AM (#164004) Homepage
    There are no legal issues of lots of different people hitting a site at the same time. None of the traffic is intended to cause harm (everyone just wants to read the story), and the information is publicly available. If the site can't handle a lot of people getting the information at once, then it is the fault of the administrators, not of Slashdot or the public at large. The administrators of the site make determinations about bandwith, virtual servers, computers, etc, etc. They make this determiniation based on the average number of users on the site. If they are wrong, thanks to slashdot, then they are wrong. It happens. The Slashdot Effect is annoying, but there would be many, many more legal problems with directly copying the content (i.e. caching), then just telling everyone else reading the story. No one gets sued for telling everyone they know there is a cool artile in Time, and everyone needs to go check it out. They get sued over copying the artile and posting it on their web site.
  • Re:yes, unicode works, but is unnecessary. by Another MacHack (Score:1) Saturday June 09 2001, @12:18PM
  • by Another MacHack (32639) on Saturday June 09 2001, @11:59AM (#164006)
    Slashdot serves pages as "Content-Type: text/html; charset=iso-8859-1"

    Under that encoding, your Japanese text looks like: [227: small a, tilde] [129: out of range] [170: feminine ordinal] [227: small a, tilde] [129:out of range] [171: left guillemot] [239: small i, dieresis] [188: fraction one fourth] [159: out of range] instead of [\u306a: HIRAGANA LETTER NA] [\u306b: HIRAGANA LETTER NI][\uff1f: FULLWIDTH QUESTION MARK]

    Your browser is playing games with you if it's displaying text marked as iso-8859-1 as utf-8 without user intervention. It "works fine" only on browsers which second-guess the charset field.
  • Re:Probably because... by mughi (Score:1) Saturday June 09 2001, @12:34PM
  • Re:yes, unicode works, but is unnecessary. by mughi (Score:1) Saturday June 09 2001, @12:53PM
  • Re:yes, unicode works, but is unnecessary. by mughi (Score:1) Saturday June 09 2001, @01:00PM
  • Re:Glyphs versus characters in Castillian by mughi (Score:1) Saturday June 09 2001, @01:25PM
  • Re:Unicode's Universality. by mughi (Score:2) Saturday June 09 2001, @01:03PM
  • Re:Unicode on slashdot... by ncc74656 (Score:2) Saturday June 09 2001, @02:35PM
  • Original article was just ignorant FUD by divec (Score:1) Saturday June 09 2001, @08:10AM
  • Re:Lack of editorial control by Hard_Code (Score:2) Monday June 11 2001, @05:15AM
  • Re:yes, unicode works, but is unnecessary. by Old Wolf (Score:1) Saturday June 09 2001, @01:22PM
  • You're wrong (Score:4)

    by Hydrophobe (63847) on Saturday June 09 2001, @01:44PM (#164016)
    If I am wrong here, I would love to be set straight by someone better informed.

    You're wrong. In 1994 Spanish stopped considering ch and ll as separate letters [www.rae.es] for dictionary ordering purposes.

  • Re:Original article was just ignorant FUD by MikeBabcock (Score:2) Saturday June 09 2001, @08:44AM
  • Multilingual domain names by LokiFox (Score:1) Sunday June 10 2001, @04:05AM
  • Re:Unicode on slashdot... by mcjulio (Score:1) Saturday June 09 2001, @07:14PM
  • Re:yes, unicode works, but is unnecessary. by mcjulio (Score:1) Saturday June 09 2001, @07:22PM
  • Re:Huh? by tommck (Score:1) Saturday June 09 2001, @08:47AM
  • Re:Unicode on slashdot... by locoluis (Score:1) Tuesday June 12 2001, @09:16AM
  • by dsplat (73054) on Saturday June 09 2001, @09:49AM (#164023)
    I do not have the knowledge of Asian languages necessary to evaluate the arguments on each side. However, even if Unicode were entirely inadequate for Asian languages it has already started to solve a very real problem. Unicode font sets representing the European characters encoded by it are now available on a variety of platforms. This makes possible a single character set with which the speakers of a number of languages can exchange information. Tools can support Unicode rather than lots of character sets. If you speak a language that isn't a common target for localization of software, this makes a wider range of tools available to you for processing date in your own language.
  • Re:Unicode on slashdot... by selectspec (Score:1) Saturday June 09 2001, @08:28AM
  • by selectspec (74651) on Saturday June 09 2001, @08:20AM (#164025)
    ASSERT(SlashdotPost == fud);

    Assertion failed line 1.

    Intelligent Commentary posted on /. !!! Somebody knows what they are talking about. Shutdown the servers. MySQL must be acting up again! ECC failure! The CPU is running too hot! Drive failure! Katz must be out sick!

  • Possible solution to FP madness by mrogers (Score:2) Monday June 11 2001, @06:08AM
  • Re:Probably because... by Tiroth (Score:1) Saturday June 09 2001, @08:58AM
  • Re:yes, unicode works, but is unnecessary. by Tiroth (Score:1) Saturday June 09 2001, @10:50AM
  • Re:Unicode on slashdot... by Tiroth (Score:2) Saturday June 09 2001, @08:54AM
  • Re:Unicode on slashdot... by Tiroth (Score:2) Saturday June 09 2001, @05:16PM
  • by Tiroth (95112) on Saturday June 09 2001, @09:02AM (#164031) Homepage
    The reason it is useful is because with Unicode you don't need to guess the encoding. You can't rely on document creators to tag their documents; plenty of times my browser has guessed the wrong encoding for a page, and that requires manually going in and changing it. Can you imagine an automated system printing out reams from a website and then realizing the encoding was fubar?

    Also consider how often one might want Chinese/Korean/Japanese/German/French/English in the same document. (product manual?) Unicode easily handles this...localized standards don't.
  • Re:Lack of editorial control by Omeganon (Score:1) Saturday June 09 2001, @09:16AM
  • Re:Unicode's Universality. by Lawrence Ho (Score:1) Saturday June 09 2001, @09:53AM
  • Re:Unicode Standard by Lawrence Ho (Score:1) Saturday June 09 2001, @10:14AM
  • Re:Perhaps it would be a good thing? by Lawrence Ho (Score:1) Saturday June 09 2001, @10:26AM
  • So we're okay to put in Klingon? by brassman (Score:2) Sunday June 10 2001, @04:45AM
  • Now if only FOX News had the same philosophy by Von Rex (Score:1) Saturday June 09 2001, @12:28PM
  • Re:yes, unicode works, but is unnecessary. by Velex (Score:1) Saturday June 09 2001, @10:29AM
  • Re:Voting on Slashdot policy? by Anarchos (Score:1) Saturday June 09 2001, @11:59AM
  • by YU Nicks NE Way (129084) on Saturday June 09 2001, @09:06AM (#164040)
    Sorry, but this is simply not true. Sort order is not handled by embedded encodings; the ONLY way to handle it is to go to Unicode. (And that, in fact, is why CJK compatibility is so important. If I embed Japanese text into a Chinese document, and then search the Chinese document, the Japanese text should be ordered according to the Chinese context, not the Japanese content.)
  • Re:Unicode on slashdot... by Megane (Score:2) Saturday June 09 2001, @12:45PM
  • by Megane (129182) on Saturday June 09 2001, @12:48PM (#164042)
    ...and while IE 5.0 properly rendered "Unicode '&#' escapes", Mozilla didn't. I presume it will work properly with the in-spec &amp; instead of a bare ampersand.
  • by pjrc (134994) <paul@pjrc.com> on Saturday June 09 2001, @01:18PM (#164043) Homepage Journal
    Only like 10 people actually went to the site and even fewer went to the articles linked to about the site by the slashdot post. Everyone started bitching about Microsoft and IE. What the fuck. Do you not read slashdot or something?

    In those first 20-30 minutes, when the vast majority of highly visible user posts were made, this very well may have been true. Slashdot's forums reward those who leap before they look, because by the time anybody could read the linked material, there's already a lot of posts.

    Reading these posts, I think it's best to keep in mind that (the tiny fraction of slashdot's readership that posts messages) automatically assumes whatever the slashdot editor's commnets are true, largely because there isn't time to actually read the linked material (or do other research). There's a big difference between automatically assuming something is true, for the sole purpose of posting within the first 100 messages, than automatically believing it's true because it was posted by a slashdot editor.

    It's important to remember that only a tiny fraction of slashdot's readership actually reads user comments and a very very tiny portion posts. You really can't draw any conclusions about slashdot's impact on its readership based on the comments posted by a tiny tiny minority (who have an incentive to post quickly and thus rashly).

    It's easy to claim there's a lack of editorial control, but it's a fact that many major media sources print bogus information regularily. What many major newspapers don't regularily do is admit they were wrong, in at least as conspicious way as the original wrong information. Many never admit to anything, and those that do often place it where it's not easily seen.

    Sure, it'd be nice if everything were so carefully reviewed that nothing was inaccurate or misleading, but given the choice between always correct and always honest, I'd rather read honesty every time!

    (...I'm not claiming slashdot and/or it's editors are always honest... set your threshold to -1 on any story posted by Michael to read some -1 Trolls about a rather ugly little dispute between Michael and other members of the former censorware.org)

  • Re:So we're okay to put in Klingon? by wfaulk (Score:1) Monday June 11 2001, @07:27AM
  • Re:Unicode on slashdot... by bellings (Score:2) Saturday June 09 2001, @10:09AM
  • by -tji (139690) on Saturday June 09 2001, @08:29AM (#164046) Journal
    The REAL problem here is that the first article got posted at all. It was obviously a load of garbage.

    Slashdot has evolved into a powerful media outlet for an important group of people. Spreading misinformation to these people can have bad effects. When a group is starting a new project, and have doubt about something, like Unicode, caused by a seemingly authoritative source, they will do the wrong thing.

    It's time for slashdot to mature & behave as a major media outlet. This should include:

    - Independant verification of stories before posting

    - Caching (when achievable) of sites referenced in articles. -- Some sites WANT the huge number of hits, others can't begin to handle that type of load. So, ASK THEM, then cache as appropriate. Google does it, so can /.

  • by torokun (148213) on Saturday June 09 2001, @08:21AM (#164047) Homepage

    My only question is -- why doesn't slashdot allow UTF-8 posts? They are rejected by the filters...

  • Re:yes, unicode works, but is unnecessary. by Robbat2 (Score:1) Saturday June 09 2001, @09:27AM
  • by connorbd (151811) on Saturday June 09 2001, @08:32AM (#164049) Homepage
    Hmp. Does make sense, doesn't it?

    I thought it was a bit strange that he was talking about separate character spaces for each language in the standard. I mean, I realize there's substantial differences between Chinese, Japanese, and Korean ideographs, but I'm reading and thinking, doesn't Unicode have an overlap space? (Which, as someone pointed out upthread, indeed it does.)

    What bothers me about articles like the original is that the guy doesn't seem to quote chapter and verse, but he's slick enough to make the casual reader think he knows what he's talking about anyway. But we see so much of that anyway that it goes right past the bullshit filter nine times out of ten...

    /brian
  • Re:Lack of editorial control by aardvarkjoe (Score:2) Saturday June 09 2001, @12:28PM
  • by Rademir (168324) on Saturday June 09 2001, @12:35PM (#164051) Homepage
    >4294967296 different languages
    >4294967296 different characters

    This scheme does not allow for enough languages--we should be ready for the time when each individual crafts their own personal language(s) for various purposes. There are already more than 6 billion people on the planet. Allowing for some population growth, one might think that 40 bits would be enough, allowing for 1099511627776 languages.

    When the population reaches 100 billion, there's still enough space for 10 languages each.

    But what about aliens? We should make sure our system is ready to interoperate with their languages and computer systems.

    Liberal estimates suggest thousands of sentient species in our galaxy. To restrict ourself to our local group of galaxies, let's say fifty thousands species, of no more than a trillion individuals each (most of them have probably spread out to dozens of planets), and 10 languages per individual...

    That would require 56 bits, to get about 7.2 x 10 ^16 different languages. This way, we won't have to upgrade again for a while!

    On the other hand, 4294967296 characters for each language seems a bit high. Maybe we can save 8 bits and use only 24, for "only" 16777216. characters

    Life,
    Rademir
  • Re:Why not require a delay for Upmodding? by MrBogus (Score:1) Sunday June 10 2001, @09:17AM
  • Glyphs versus characters in Castillian by Phronesis (Score:1) Saturday June 09 2001, @11:59AM
  • Slashdot already has caching by gnugnugnu (Score:1) Saturday June 09 2001, @09:38AM
  • Re:Lack of editorial control by Reality Master 101 (Score:1) Saturday June 09 2001, @11:54AM
  • Slashdot has evolved into a powerful media outlet for an important group of people.

    I think you vastly overestimate Slashdot's influence. I don't deny that there are probably a lot of influential people who check it regularly or occasionally, but let's remember that Slashdot mainly links to other's articles. I don't think anyone seriously believes that anything posted on Slashdot is automatically true.

    And the original editorial content on Slashdot whipsaws between the hopelessly naive and the outright obvious, so I doubt they have much influence there beyond high schoolers who still have pretty narrow horizons.


    --

  • Re:yes, unicode works, but is unnecessary. by Prior Restraint (Score:1) Tuesday June 12 2001, @09:29AM
  • Fun Unicode demos by crism (Score:1) Saturday June 09 2001, @11:04AM
  • Re:Lack of editorial control by mami (Score:1) Sunday June 10 2001, @03:39AM
  • Re:Unicode Standard by DarkEdgeX (Score:1) Saturday June 09 2001, @10:21AM
  • Re:Huh? (Score:3)

    by DarkEdgeX (212110) on Saturday June 09 2001, @08:59AM (#164061) Journal

    This story is a response to the first one that said it wouldn't. This one, IMHO, appears far more credible than the former, and seems to contain a more accurate view of Unicode overall than the previous story.

  • The answers we needed... by mirabilos (Score:1) Sunday June 10 2001, @01:40PM
  • All I really wanted to know... by (H)elix1 (Score:1) Saturday June 09 2001, @05:36PM
  • Re:Is unification a good thing at all? by alvazi (Score:1) Saturday June 09 2001, @02:00PM
  • Re:yes, unicode works, but is unnecessary. by roozbeh (Score:1) Saturday June 09 2001, @11:55AM
  • Re:yes, unicode works, but is unnecessary. by roozbeh (Score:1) Saturday June 09 2001, @11:59AM
  • Re:Original article was just ignorant FUD by roozbeh (Score:1) Saturday June 09 2001, @12:22PM
  • by roozbeh (247046) on Saturday June 09 2001, @12:14PM (#164068) Homepage

    In Unicode terms, "ch" is named a grapheme, it's different from a character. (Or you may want to call it a letter.) it is encoded using the two characters "c" and "h". It is something that considered a unit in some places, but not in the others. I would recommend taking a look at the Unicode Standard book, which you can read online [unicode.org]. This things are in chapters1 [unicode.org] and 2 [unicode.org].

    About string ordering, Unicode does not claim anything. If you look into ASCII, you will find that even that is not suitable for normal English sorting, since "B" is encoded before "a". But don't go away. Unicode has a Collation Algorithm [unicode.org] that specifies what should one do with advanced natural language ordering of strings, and also tells what should one do with the Castillian "ch".

    --

  • Re:yes, unicode works, but is unnecessary. by lindner (Score:1) Saturday June 09 2001, @04:12PM
  • Re:Lack of editorial control by NaturePhotog (Score:1) Saturday June 09 2001, @01:40PM
  • Re:Unicode on slashdot... by haruharaharu (Score:1) Saturday June 09 2001, @09:48AM
  • Cause the lack of characters... by l3377r0lld00d (Score:1) Saturday June 09 2001, @08:11AM
  • Why not require a delay for Upmodding? by ColGraff (Score:2) Saturday June 09 2001, @03:30PM
  • Is unification a good thing at all? by m08593 (Score:1) Saturday June 09 2001, @11:18AM
  • Re: Troller by kwhistler (Score:1) Saturday June 09 2001, @01:22PM
  • And who specifies the languages, pray tell? by kwhistler (Score:1) Saturday June 09 2001, @01:39PM
  • Sure it is by kwhistler (Score:1) Saturday June 09 2001, @02:02PM
(1) | 2