Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
The Internet

Why Unicode Will Work On The Internet 99

Ken Whistler sent in the lengthy and rather pointed response below to the article "Why Unicode Won't Work On The Internet," which was posted to Slashdot on Tuesday.


I have just finished reading the article you published today on the Hastings Research website, authored by Norman Goundry, entitled "Why Unicode Won't Work on the Internet: Linguistic, Political, and Technical Limitations."

Mr. Goundry's grounding in Chinese is evident, and I will not quibble with his background East Asian historical discussion, but his understanding of the Unicode Standard in particular and of the history of Han character encoding standardization is woefully inadequate. He make a number of egregiously incorrect statements about both, which call into question the quality of research which went into the Unicode side of this article. And as they are based on a number of false premises, the article's main conclusions are also completely unreliable.

Here are some specific comments on items in the article which are either misleading or outright false.

Before getting into Unicode per se, Mr. Goundry provides some background on East Asian writing systems. The Chinese material seems accurate to me. However, there is an inaccurate statement about Hangul: "Technically, it was designed from the start to be able to describe any sound the human throat and mouth is capable of producing in speech, ..." This is false. The Hangul system was closely tied to the Old Korean sound system. It has a rather small number of primitives for consonants and vowels, and then mechanisms for combining them into consonantal and vocalic nuclei clusters and then into syllables. However, the inventory of sounds represented by the Jamo pieces of the Hangul are not even remotely close to describing any sound of human speech. Hangul is not and never was a rival for IPA (the International Phonetic Alphabet).

In the section on "The Inability of Unicode To Fully Address Oriental Characters", Mr. Goundry states that "Unicode's stated purpose is to allow a formalized font system to be generated from a list of placement numbers which can articulate every single written language on the planet." While the intended scope of the Unicode Standard is indeed to include all significant writing systems, present and past, as well as major collections of symbols, the Unicode Standard is not about creating "formalized font systems", whatever that might mean. Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.

Immediately thereafter, Mr. Goundry starts making false statements about the architecture of the Unicode Standard, making tyro's mistakes in confusing codespace with the repertoire of encoded characters. In fact the codespace of the Unicode Standard contains 1,114,112 code points -- positions where characters can be encoded. The number he then cites, 49,194, was the number of standardized, encoded characters in the Unicode Standard, Version 3.0; that number has (as he notes below) risen to 94,140 standardized, encoded characters in the current version of the Unicode Standard, i.e., Version 3.1. After taking into account code points set aside for private use characters, there are still 882,373 code points unassigned but available for future encoding of characters as needed for writing systems as yet unencoded or for the extension of sets such as the Han characters.

Even if Mr. Goundry's calculation of 170,000 characters needed for China, Taiwan, Japan, and Korea were accurate, the Unicode Standard could accomodate that number of characters easily. (Note that it already includes 70,207 unified Han ideographs.) However, Mr. Goundry apparently has no understanding of the implications or history of Han unification as it applies to the Unicode Standard (and ISO/IEC 10646). Furthermore, he makes a completely false assertion when he states that Mainland China, Taiwan, Korea, and Japan "were not invited to the initial party."

Starting with the second problem first, a perusal of the Han Unification History, Appendix A of the Unicode Standard, Version 3.0, will show just how utterly false Mr. Goundry's implication that the Asian countries were left out of the consideration of encoding of Han characters in the Unicode Standard is. Appendix A is available online, so there really is no valid research excuse for not having considered it before haring off to invent nonexistent history about the project, even if Mr. Goundry didn't have a copy of the standard sitting on his desk. See:

http://www.unicode.org/unicode/uni2book/appA.pdf

The "historical" discussion which follows in Mr. Goundry's account, starting with "The reaction was predictable ..." is nothing less than fantasy history that has nothing to do with the actual involvement of the standardization bodies of China, Japan, Korea, Taiwan, Hong Kong, Singapore, Vietnam, and the United States in Han character encoding in 10646 and the Unicode Standard over the last 11 years.

Furthermore, Mr. Goundry's assertions about the numbers of characters to be encoded show a complete misunderstanding of the basics of Han unification for character encoding. The principles of Han unification were developed on the model of the main Japanese national character encoding, and were fully assented to by the Chinese, Korean, and other national bodies involved. So assertions such as "they [Taiwan] could not use the same number [for their 50,000 characters] as those assigned over to the Communists on the Mainland" is not only false but also scurrilously misrepresents the actual cooperation that took place among all the participants in the process.

Your (Mr. Carroll's) editorial observation that "It is only when you get all the nationalities in the same room that the problem becomes manifest," runs afoul of this fantasy history. All the nationalities have been participating in the Han unification for over a decade now. The effort is led by China, which has the greatest stakeholding in Han characters, of course, but Japan, Korea, Taiwan and the others are full participants, and their character requirements have not been neglected.

And your assertion that many Westerners have a "tendency .. to dismiss older Oriental characters as 'classic,'" is also a fantasy that has nothing to do with the reality of the encoding in the Unicode Standard. If you would bother to refer to the documentation for the Unicode Standard, Version 3.1, you would find that among the sources exhaustively consulted for inclusion in the Unicode Standard are the KangXi dictionary (cited by Mr. Goundry), but also Hanyu Da Zidian, Ci Yuan, Ci Hai, the Chinese Encyclopedia, and the Siku Quanshu. Those are the major references for Classical Chinese -- the Siku Quanshu is the Classical canon, a massive collection of Classical Chinese works which is now available on CDROM using Unicode. In fact, the company making it available is led by the same man who represents the Chinese national standards body for character encoding and who chairs the Ideographic Rapporteur Group (the international group that assists the ISO working group in preparing the Han character encoding for 10646 and the Unicode Standard).

Mr. Goundry's argument for "Why Unicode 3.1 Does Not Solve the Problem" is merely that "[94,140 characters] still falls woefully short of the 170,000+ characters needed"-- and is just bogus. First of all the number 170,000 is pulled out of the air by considering Chinese, Japanese, and Korean repertoires without taking Han unification into account. In fact, many more than 170,000 candidate characters were considered by the IRG for encoding -- see the lists of sources in the standard itself. The 70,207 unified Han ideographs (and 832 CJK compatibility ideographs) already in the Unicode Standard more than cover the kinds of national sources Mr. Goundry is talking about.

Next Mr. Goundry commits an error in misunderstanding the architecture of the Unicode Standard, claiming that "two separate 16-bit blocks do not solve the problem at all." That is not how the Unicode Standard is built. Mr. Goundry claims that "18 bits wide" would be enough -- but in fact, the Unicode Standard codespace is 21 bits wide (see the numbers cited above). So this argument just falls to pieces.

The next section on "The Political Significance Of This Expressed In Western Terms" is a complete farce based on false premises. I can only conclude that the aim of this rhetoric is to convince some ignorant Westerners who don't actually know anything about East Asian writing systems -- or the Unicode Standard, for that matter -- that what is going on is comparable to leaving out five or six letters of the Latin alphabet or forcing "the French ... to use the German alphabet". Oh my! In fact, nothing of the kind is going on, and these are completely misleading metaphors.

The problem of URL encodings for the Web is a significant problem, but it is not a problem *created* by the Unicode Standard. It is a problem which is being actively worked on my the IETF currently, and it is quite likely that the Unicode Standard will be a significant part of the solution to the problem, enabling worldwide interoperability, rather than obstructing it.

And it isn't clear where Mr. Goundry comes up with asides about "Ascii-dependent browsers". I would counter that Mr. Goundry is naive if he hasn't examined recently the internationalized capabilities of major browsers such as Internet Explorer -- which themselves depend on the Unicode Standard.

Mr. Goundry's conclusion then presents a muddled summary of Unicode encoding forms, completely missing the point that UTF-8, UTF-16, and UTF-32 are each completely interoperable encoding forms, each of which can express the entire range of the Unicode Standard. It is incorrect to state that "Unicode 3.1 has increased the complexity of UCS-2." The architecture of the Unicode Standard has included UTF-16 (not UCS-2) since the publication of Unicode 2.0 in 1996; Unicode 3.1 merely started the process of standardizing characters beyond the Basic Multilingual Plane.

And if Mr. Goundry (or anyone else) dislikes the architectural complexity of UTF-16, UTF-32 is *precisely* the kind of flat encoding that he seems to imply would be preferable because it would not "exacerbate the complexity of font mapping".

In sum, I see no point in Mr. Goundry's FUD-mongering about the Unicode Standard and East Asian writing systems.

Finally, the editorial conclusion, to wit, "Hastings [has] been experimenting with workarounds, which we believe can be language- and device-compatible for all nationalities," leads me to believe that there may be hidden agenda for Hastings in posting this piece of so-called research about Unicode. Post a seemingly well-researched white paper with a scary headline about how something doesn't work, convince some ignorant souls that they have a "problem" that Unicode doesn't address and which is "politically explosive", and then turn around and sell them consulting and vaporware to "fix" their problem. Uh-huh. Well, I'm not buying it.


Whistler is Technical Director for Unicode, Inc. and co-editor of The Unicode Standard, Version 3.0. He holds a B.A. in Chinese and a Ph.D. in Linguistics.

This discussion has been archived. No new comments can be posted.

Why Unicode Will Work On The Internet

Comments Filter:
  • by Anonymous Coward
    In case you hadn't noticed, the current story was written as a direct rebuttal to the previous story.
  • "We will respond to Unicode's rather lengthy criticism of this paper (and the hilarious criticism of us personally) over the weekend. Our appreciation to the many people who sent thank you letters!
    June 8, 2001. Correction: due to the immense volume of email we have received, we will be postponing our response for a month."

  • by Anonymous Coward
    Unicode is a character set. UTF-8 is a character encoding. Confusing these sorts of concepts is responsible for 90% of the bullshit and FUD flying around in discussions about Unicode.
  • by Anonymous Coward
    joe@confucius.gnacademy.org

    I'm reminded about Velisosky who talked about astronomers and middle eastern history. The astronomers thought the astronomy was non-sense and didn't talk about the middle eastern history. The people skilled in middle eastern history thought that that was non-sense but couldn't comment on the astronomy.

    I don't know much about much about the computer science of character encoding, but I do know enough to know that Goundy knows nothing about Chinese.

    1) The PRC did not reduce the number of characters in use. What they did was to declare that the official forms of some characters would be written with fewer strokes. Most Chinese when writing Chinese will use "abbreviations" in writing characters, and what the PRC did was to make some of these abbreviations official. Essentially what the PRC did was to deem a new fontset to be the standard form for characters. This means that character simplification has *NOTHING* at all to impact Unicode. To switch between traditional and simplified characters, all you have to do is to switch fontsets. Therefore it would be totally ridiculous not to share encoding between simplified and traditional code, and unless I'm wildly misinformed, Unicode shares these encodings.

    2) I don't know where Goundry got the idea that the PRC sanctioned a subset of Chinese characters. A typical literate Chinese knows between 3000 and 5000 characters in daily life. But this is true in both the PRC, Taiwan, and among overseas communities.

    3) It is true that most literate Chinese cannot read typical classical Chinese, but this is true in all Chinese communities and has nothing to do at all with character simplification.

    4) Unicode *FIXES* the deficiencies in current representations of Chinese characters. Both the GB and Big6 standards are 16 bits and cannot represent all of the characters in use. This is particularly embarrassing with personal names. The "Rong" character in the PRC Premier Zhu Rongji's name is not a standard character in the GB standard used in the PRC and computers have to go through all sorts of silliness to deal with this. Also having two sets of character coding is a royal pain, and Unicode has the huge advantage of being political neutral.

    5) I don't know anything about the impact of computer systems in Japan, although I suspect that Goundry may be cluess there. Let me just point out that Chinese and Japanese really have wildly different cultures.

    6) Actually character input is not much of a problem in Chinese. You type in the phonetic transliteration with a keyboard. The screen puts up a menu of characters, you choose.

  • by Anonymous Coward
    As the link in your post makes clear, you are only refering to a decision by the Real Academia Española, which is only the language authority for Spain. Other countries, still use the traditional sorting order. This is particularly significant given that other countries, notably Mexico, represent much larger populations and market share. Also, I thought I heard the Real Academia was reconsidering this anyway. This was backwards decision driven by limitations in software--attempting to make a natural language conform to technology--and I hope they take to heart that the software problems have been solved.
  • by Anonymous Coward
    You probably have seen an example of a UTF-8 homepage on the Internet, but not realized it, since modern browsers deal with UTF-8 transparently.

    For example, try Morgan Stanley's European mutual fund web site [msdw.com]. This site is available in both English and Italian, and it uses UTF-8 as the encoding. If you look closely at some of the pages in Italian, you may see characters such as the Windows smart-double-quotes that are not in ISO-8859-1, but are represented in Unicode and UTF-8.

    UTF-8 is a viable solution for multilingual web sites today.
  • by Anonymous Coward
    very funny.

    Independant verification of stories before posting
    - would require real editors.

    Caching (when achievable) of sites referenced in articles
    - /. can barely keep up with their own load. Caching would essentially double their job. It would be nice if they could just copy the text from the more straight ahead stories into the post, but I think there are legal issues around that.

  • by Anonymous Coward
    Unicode does work on the internet, but so does alot of other character encodings. Take, for example, sites in Japanese. They are mostly either in Shift-JIS (e.g. Google Japan) or EUC-JP (e.g. Yahoo Japan). Since compabitiblities with those encodings are necessary anyway, and those encodings work fine, there is no need for the unpopular Unicode support.

    Think, I could have a Kanji followed by Arabic letters in Unicode, but when is that useful? hardly ever. The smaller character set of Shift-JIS and EUC-JP means more space savings. If there is a need for multiple languages, it is always possible to use "< >" tags, right? Multiple character encodings will always exist, so for each text character encodings must be specified somehow anyway.

    There is a reason why Unicode isnt in wide use on the internet, and it's because there's no need to. I haven't seen a homepage in Unicode yet, with the obvious exception of plain ascii, a subset of UTF-8.
  • by Anonymous Coward on Saturday June 09, 2001 @09:49AM (#163987)
    "Why Unicode Won't Work on the Internet" vs. "Why Unicode WILL Work on the Internet". Are we now going to see rebuttals to all Slashdot stories?
    • "Bill Gates is a Tool of the Devil" vs. "Microsoft is a Boon to Our Capitalist Economy"
    • "Real Hackers Use 'vi'" vs. "Emacs is the One True Editor"
    • "Napster Increases Music Sales" vs. "You're All a Bunch of Hypocritical Pirates"
    • "The Government is Spying on Us" vs. "Big Brother is Your Friend fnord"
    • "Jon Katz is a Pandering Conceited Hack" vs. "Jon Katz is a Pandering Conceited Hack"
  • While Apache doesn't support gzip natively (I think it does have some experimental support, though), for dynamically generated pages it's easy enough to add in. It's just a matter of searching for gzip in HTTP_ACCEPT_ENCODING, putting a certain header in there, and running the page through gzip (or, often, the implementations of the format integrated into the programming language of your choice). This requires no server support to do.
  • I use Unicode, and am very thankful for it.

    I am encoding sets of articles on international subjects (the most recent is a set of essays on libraries around the world). While I'm thankfully avoiding any issue of East Asian character sets, characters from multiple character sets do happen in a single document. Also, it's nice to basically leave the character set issue out entirely -- I use one character set, UTF-8. Keeping track of character set is a PITA.

    Really, Unicode just eliminates a whole class of issues -- many of which are currently solvable, but with Unicode they simply aren't a problem at all. As it becomes better supported -- and Unicode is already quite well supported -- I think most places will be using it.

    Also, if there's redundancy in Unicode, I imagine most of that space could be saved with gzip, which also has good support over the web, though like Unicode is far underused.

  • For European languages that could be represented in ISO-8859-1 it is true that the foreign letters turn from 1 to 2 bytes in UTF-8. Since this is a fraction of the characters the increase in size is really only about 5% at most. This seems a reasonable payoff for the fact that we can now support all those languages that could not be supported in ISO-8859-1.

    But wait, there is more:

    I propose (is there any official way to do it) that the official standard be to interpret "malformed sequences" (ie bytes which don't form a correct minimal-length UTF-8 encoding) as being the raw 8-bit bytes. This will allow almost all ISO-8859-1 text to pass through even though it is not encoded in UTF-8, because you would have to put a foreign punctuation mark followed by 2 accented characters for it to be mistaken as a UTF-8 character, or a C1 control character followed by an accented character, this is very rare.

    I also think all UTF-8 sequences that encode characters longer than the minimal encoding should be considered illegal and be interpreted as individual bytes. This greatly increases the ability of plain ISO-8859-1 text to go through, and avoids what I expect will be a security bug-fest when people fool things into accepting slashes and newlines and semicolons that the programs thing they are filtering out.

    This idea avoids problems in the standard for determining when a malformed sequence ends (ie does a prefix followed by a 7-bit character mean a single error or an error and a character?).

    It also eliminates the need for there to be an "error" character or any kind of error state in UTF-8 parsing, as all possible sequences of bytes are legal sequences. This vastly simplifies the programming interface.

    And it has the side effect that almost 100% of European text is the same length in UTF-8 as ISO-8859-1.

  • > Using Unicode in that case doubles your
    > net traffic for absolutely no reason whatsoever.
    >
    > To sum it up: East Asian, Cyrillic, Greek, Hebrew,
    > and assorted other peoples will never use
    > Unicode under no circumstances whatsoever.

    Then what will they use? Revo is an Esperanto dictionary with German, Turkish, Czech and Russian translations, that's currently in UTF-8. What would the Cyrillic way of solving this be? ISO-2022? Transliterating everything into English characters? People want to do stuff like revo, and Unicode is pretty much the only supported solution.

    What's with the space concerns, anyway? Project Gutenberg is over 3000 texts, and still fits on one CD (if you exclude the Human Genome data). My hard drive is filled with mp3's and jpg's and ASCII program source code (invariant under UTF-8), not text files. My time on the modem is usually spent waiting for graphics to download, not text.

    If you're really concerned, use gzip everywhere, or get SCSU working. SCSU (Simple Compression Scheme for Unicode) compresses Greek and Russian strings to one byte per character (plus a byte overhead), and gzip can still compress the resulting text.
  • Becomes used on an everyday bases it is harder to get people to move. Getting people to move over to unicode will take alot of work but as more and more programs are using unicode then this will not be as much as an issue as it was before. It will take education of programers and of the users in order to make this work.
  • by hpa ( 7948 ) on Saturday June 09, 2001 @11:40AM (#163993) Homepage
    Actually, the reason we're not seeing mixed languages used a lot is that the infrastructure hasn't been in place. Mixing languages within a system is frequently highly desirable, especially when you consider what "mixing languages" really mean. Yes, one can go the Microsoft route and have a completely different system for each language, including differerent APIs, but that doesn't really help me as a Swedish-speaking individual who want to do my Japanese homework online.


    <...> tags isn't the way to go, either; the only way you can make that work sensibly is by having a single encoding internally, which is typically going to be Unicode.


    That being said, there is no question it will take time to catch on, as people open up to the abilities that this provides, and tools start supporting it.

  • Think, I could have a Kanji followed by Arabic letters in Unicode, but when is that useful? hardly ever.

    Sounds to me like you don't do a lot of dealing with other character sets. What if I was writing a translation, say of a Hebrew song. I can put English on the page, and I can put Hebrew on the page, but how do I say the 'é' in 'cliché', for example? To me, it looks like yud, if I'm using the Hebrew character set, because they occupy the same space in the ISO-8859-X encodings (which are the standards I use).

    I have three choices, really: use Latin (ISO-8859-1) and have the Hebrew look like gibberish, use Hebrew (ISO-8859-8) and have the accents in the English come out as Hebrew characters, or I can use Unicode. It's just easier that way.

  • Both points, story verification and caching of links sound like GREAT ideas and I agree with you wholeheartedly.

    I sincerely hope that Slashdot does mature like you say. It's gone way beyond Rob's toy project. Since I didn't have mod points to use, I wanted to mention I agree too.

    Maybe policy changes like these can be voted on more formally on Slashdot, so when they come up so the readers can give their input as to the direction Slashdot should mature and grow towards. A more formal 'suggestion box' for these ideas should be implemented too, with the reasonable ones getting put to the vote.

    --
    Delphis
  • This page:

    http://www.pemberley.com/janeinfo/latin1.utf8

    shows up fine in Mozilla for me. Maybe it's your font.
    --
  • Atleast slashdot can post follow ups that contradict earlier stories. Besides, if you don't like it, go away. Start your own site. No one is forcing you to read slashdot.
  • Uh...what about the article a little bit ago about the UK site requiring IE and Windows? Only like 10 people actually went to the site and even fewer went to the articles linked to about the site by the slashdot post. Everyone started bitching about Microsoft and IE. What the fuck. Do you not read slashdot or something?
  • Wait a tick, how many stupid people can you think of that wield alot of power? I can name several off the top of my head. I was pointing out the fact so many fools do read slashdot (or have their mom's read it to their illiterate asses) means the previous post was poorly thought out. Slashdot ends up causing alot of email floods, especially when they post something inflamatory due to Hemos and timothy never checking out the shit they post. Which is why I brought up the UK site article. I can't believe that got fucking posted. There was practically no editorial at all. If it had been "Linux users jump the gun, again" and linked to the article it might have been a little more apropo. I think what slashdot needs most is a hooked on phonics slashbox for those that have trouble.
  • Son/daughter's asses.
  • Damn, that was hella funny!

    --
  • one for "Damn, he put the smack down on that bitch!"

    Excellent rebuttal.... good to see this type of discussion on Slashdot again.
  • Unicode would have adequate space for almost all scripts, if not all if Chinese could be resolved. This could include the Hangul pictographs and the Japanese Kanji. This leaves two alternatives. Either allow both Chinese governments to dictate the full extent of their character sets (and sacrifice the workability of Unicode) or have non-Chinese dictate a solution that fits into the allotted space. I personally prefer the later.
  • by jmauro ( 32523 ) on Saturday June 09, 2001 @11:00AM (#164004)
    There are no legal issues of lots of different people hitting a site at the same time. None of the traffic is intended to cause harm (everyone just wants to read the story), and the information is publicly available. If the site can't handle a lot of people getting the information at once, then it is the fault of the administrators, not of Slashdot or the public at large. The administrators of the site make determinations about bandwith, virtual servers, computers, etc, etc. They make this determiniation based on the average number of users on the site. If they are wrong, thanks to slashdot, then they are wrong. It happens. The Slashdot Effect is annoying, but there would be many, many more legal problems with directly copying the content (i.e. caching), then just telling everyone else reading the story. No one gets sued for telling everyone they know there is a cool artile in Time, and everyone needs to go check it out. They get sued over copying the artile and posting it on their web site.
  • What happens when you use that tag to switch to a character encoding in which "" means something else? What happens when you want to search through a document for text, the literal bytes of which match completely unrelated text in a different encoding within the same document?
  • by Another MacHack ( 32639 ) on Saturday June 09, 2001 @12:59PM (#164006)
    Slashdot serves pages as "Content-Type: text/html; charset=iso-8859-1"

    Under that encoding, your Japanese text looks like: [227: small a, tilde] [129: out of range] [170: feminine ordinal] [227: small a, tilde] [129:out of range] [171: left guillemot] [239: small i, dieresis] [188: fraction one fourth] [159: out of range] instead of [\u306a: HIRAGANA LETTER NA] [\u306b: HIRAGANA LETTER NI][\uff1f: FULLWIDTH QUESTION MARK]

    Your browser is playing games with you if it's displaying text marked as iso-8859-1 as utf-8 without user intervention. It "works fine" only on browsers which second-guess the charset field.
  • 1. Unicode fonts on X suck. 2. Netscape on X blows up trying to read a page with Unicode fonts. 3. Netscape on Windows does the same.


    Then don't use them. I've been reading Japanese and Chinese on the net for years (or rather 'viewing', since I'm not fluent in the languages themselves). You don't need to have Unicode fonts, just install Chinese fonts for Chinese pages, Japanese fonts for japanese pages, and have something inteligent to build them up.



    Java has used the idea of composite fonts since the get-go. Netscape also does 'Psuedo Fonts'. I forget about MSIE, since I don't use it much as all. In fact, I'm sitting here now in Netscape under X and instead of a 'suck'y Unicode font, Unicode encoded characters are getting displayed by a NSPseudoFont.

    3. Netscape on Windows does the same


    You should look more into details of your problem. Here's a possibly shocking bit of news for ya: Since Windows 3.1, Windows fonts have used and preferred Unicode mapping. So it's extremely unlikely that a font being Unicode in and of itself is causing Netscape any problems on Windows.



  • Think, I could have a Kanji followed by Arabic letters in Unicode, but when is that useful? hardly ever.


    Well, it turns out that your view is not shared by the extensive research Microsoft for one has undertaken. They did a lot of research in this field since more than 40% of their Office revenue came from overseas. One interesting thing they found is that people who did use more than one language tended to actually mix two or three main languages per document.

    All this is part of why Microsoft for some time had been changing Office internally and had made it all Unicode internally by Office 97.


    If there is a need for multiple languages, it is always possible to use "<>" tags, right? Multiple character encodings will always exist, so for each text character encodings must be specified somehow anyway.


    Yes, multiple languages but not necesarily multiple encodings. In fact many programs have problems with multiple encodings in one document. I know as a software engineer that doing so is one nightmare I'd like to avoid at all costs.



    And on windows one does not have to specify text encodings at all. Since Windows 3.1 their fonts have used Unicode mapping tables internally. And COM (and thus ActiveX) on Windows is pure Unicode for all it's strings. So there is no need to change away for COM work or display.


  • Sorry, but this is simply not true. Sort order is not handled by embedded encodings; the ONLY way to handle it is to go to Unicode.


    Sorry, but that is simply not true. Unicode does nothing really to help sort-order. US-ASCII English is about the only thing that can use Unicode ordering for sorting. Any other language needs to go through some other mechanism to achieve ording. In your example that would be either an explicit Chinese ordering routine or an explicit Japanese ording routine. But those could take EUC, for example, instead of Unicode.



    So, calling Unicode "the ONLY way" to handle sorting is just incorrect.


  • In Castillian Spanish, 'ch' and 'll' are characters that require two glyphs to print. However, for alphabetization purposes, 'ch' and 'll' are distinct characters


    Well, in general this is what should occur. If it does not, then it is because the national standards for those languages had already been treating them as two separate characters, with two encoding points.



    http://www.unicode.org/unicode/standard/where/ [unicode.org]
    Says:
    For compatibility with pre-existing standards, there are characters that are equivalently represented either as sequences of code points or as a single code point called a composite character. For example, the i with 2 dots in naïve could be presented either as i + diaeresis (0069 0308) or as the composite character i + diaeresis (00EF).


    That's were Unicode bends from pure idealism to practical matters. They'd preffer to have one char, but if Spain is already using two characters for 'ch', then the Unicode consortium would obey their practice.

    simple-minded comparison routines that compare character codes report erroneous comparison values because they doesn't realize that 'cg'


    Of course, the same goes for any Unicode sorting. No Unicode sorting can be assumed to work based just on the character values. That's why all software libraries that allow for sorting Unicode have functions for the programmer to use, based on the language/locale in effect.


  • UTF-8 doubles the net traffic for all languages other than those used by Latin-1. You are misinformed.


    Well... close. But for Asian languages it might only increase it by 150%, not 200% (i.e. 3 bytes per character instead of 2).

    :-)

  • ãï¼Y

    This post was UTF-8 posting Japanese, and it worked fine.

    It posted OK, but it came through as garbage. There's nothing in /.'s HTML that sets the character encoding used when rendering a page, so the browser presumably uses its default encoding (which I'm guessing is Latin-1).

  • The author seems to have heard from somewhere that UCS-2 can only encode 65536 characters, and blithely extended that statement to Unicode/ISO-10646 as a whole. That Unicode has a big enough codespace for everyone should be manifest from the following fact: the creators of almost every other character standard now describe their standards by saying how they map into Unicode.

    Ah well. Shame things this ignorant get any credence. Thanks for correcting it.

  • "They get sued over copying the artile and posting it on their web site."

    *** Grey-Area Alert ***

    What about photocopied news articles on bulletin boards? I think a lot of this falls under fair use.
  • Anyone who's ever tried to write a page that needs to use multilingual fonts will know that Unicode is by far and away the easiest. I can load a Unicode-compatible text editor, type straight out the letters I want , and save, upload to webserver, and voila! Easy, compatible, and no pissing around with various coding conventions and garbage.

    Here's an example which I did quickly: http://wolf.project-w.com/chess/pieces.html [project-w.com]

  • by Hydrophobe ( 63847 ) on Saturday June 09, 2001 @02:44PM (#164016)
    If I am wrong here, I would love to be set straight by someone better informed.

    You're wrong. In 1994 Spanish stopped considering ch and ll as separate letters [www.rae.es] for dictionary ordering purposes.

  • Lets just hope the original posting is edited to include a link to this article.

    I'd hate for people to do a search on UNICODE and come up with Slashdot's reference to that previous piece.
  • It seems interesting to me that both of these articles bring up the issue of multilingual domain names when, as far as I know, verisign isn't using Unicode for it. They are trying to set up their own RACE encoding which just puts the characters into ascii. Such as the hiragana for kitsune.org is turned into bq--gbgwii.org and looked up that way.
  • What Slashdot is running is Perl, which deals only half-ass with UTF-8 and not at all with UTF-16 or any other form of Unicode encoding. Perl is 8-bit clean, but is completely unprepared to deal with 16 bit text. Also, what the heck is "running Unicode"? It's not an application or a daemon, it's a character encoding scheme.
  • This isn't insightful, it's crap. Win2K's multilanguage UI rocks. The same build supports dozens of languages and locales, and the APIs are Unicode-aware. As a Swedish-speaking person who wants to do their Japanese homework online, you just install the Japanese and Swedish locales, and swap them on the fly. Couldn't be cleaner.
  • Yeah... slashdot has really dropped in quality. Lots of FUD lately... Kind of disappointing.
  • Slashdot rejects UTF-8 because it thinks it's binary junk.

    However, it does accept iso8859-1 characters, as in "Iré a dormir bajo un árbol."

    It also does accept this notation: Ï (Ampersand # number ; )

    Or UTF-7, like here:
    Japanese Minami (South): +U1f/HVNBZwhe/1NB-
    Japanese no ('s) : +MG4-
    Japanese Yume (Dream): +WSL/HV7/dTBnCF8TYgj/O18TYgj/HVkV-]
  • by dsplat ( 73054 ) on Saturday June 09, 2001 @10:49AM (#164023)
    I do not have the knowledge of Asian languages necessary to evaluate the arguments on each side. However, even if Unicode were entirely inadequate for Asian languages it has already started to solve a very real problem. Unicode font sets representing the European characters encoded by it are now available on a variety of platforms. This makes possible a single character set with which the speakers of a number of languages can exchange information. Tools can support Unicode rather than lots of character sets. If you speak a language that isn't a common target for localization of software, this makes a wider range of tools available to you for processing date in your own language.
  • First off, my guess is that /. is running asci on the servers, which is fair considering the site is in English. Second, if they were running unicode on the servers, they would have to update their lameness-filters significantly to deal with the goatsex problem.
  • by selectspec ( 74651 ) on Saturday June 09, 2001 @09:20AM (#164025)
    ASSERT(SlashdotPost == fud);

    Assertion failed line 1.

    Intelligent Commentary posted on /. !!! Somebody knows what they are talking about. Shutdown the servers. MySQL must be acting up again! ECC failure! The CPU is running too hot! Drive failure! Katz must be out sick!

  • Slashdot's forums reward those who leap before they look, because by the time anybody could read the linked material, there's already a lot of posts.

    It should be impossible to moderate responses to a story for ten minutes after the story first appears. This will give readers a chance to read the article without (effectively) losing their chance to comment on it.
    --

  • I think you are barking up the wrong tree here. I'm posting from IE on windoze, and both UTF-8 as well as SHIFT-JIS have worked fine for months on /. For that matter, I also use Netscape on X from time to time, and set up correctly that works like a charm too with international characters.
  • Hehe...at least you picked three that I know. Seriously, though, the web (and html/xml) is only one source of documents. Wouldn't be really nice to be able to open any document, of any type (pdf, word, etc), anywhere in the world? That's the power of a single standard.
  • ããï¼Y

    This post was UTF-8 posting Japanese, and it worked fine.
  • If it looks like garbage to you your browser is probably setup wrong. I usually use Shift-JIS for Japanese, so it does come out as garbage under that, but it appears fine under UTF-8.

    You are dead right that the browser won't choose the correct encoding...if you look at my other comment [slashdot.org] this is precisely why universal unicode usage is a good idea. There's no way that /. could even push the correct encoding, since I've noticed a number of people who frequently add asides in Japanese or Chinese, so with any encoding you are screwing someone. (even with UTF, since few people use it at the moment)

    (note that reposting the valid UTF-8 hiragana through a Latin-1 browser fubar'd it since the browser interpreted the raw bits wrong...another good argument for unicode)
  • by Tiroth ( 95112 ) on Saturday June 09, 2001 @10:02AM (#164031) Homepage
    The reason it is useful is because with Unicode you don't need to guess the encoding. You can't rely on document creators to tag their documents; plenty of times my browser has guessed the wrong encoding for a page, and that requires manually going in and changing it. Can you imagine an automated system printing out reams from a website and then realizing the encoding was fubar?

    Also consider how often one might want Chinese/Korean/Japanese/German/French/English in the same document. (product manual?) Unicode easily handles this...localized standards don't.
  • And there aren't potential legal issues with what is effectively a DOS attack against small to medium sized sites that get linked from here? What's worse is that site may be virtual-hosted and then it not only affects that site, but all others hosted on that machines or farm. I have to agree with the original poster.
  • > Using Unicode in that case doubles your
    > net traffic for absolutely no reason whatsoever.

    Have you ever heard of UTF-8?

    > To sum it up: East Asian, Cyrillic, Greek, Hebrew, and assorted other peoples will never use
    Unicode under no circumstances whatsoever.

    Take your bs somewhere else...

    > Whoopie. So basically we bloated Latin-1 with
    64,000 useless characters that nobody ever will
    use. Is this genius or what?

    At least not as useless as your misinformed post!
  • You mean using 32 bits to represent characters in a language?
    What a huge waste! Have you realized that no language has such a large character set?

    UCS-2 is adequate for current use, not to mention UCS-4.
  • Limitation is a good thing?!
    Perhaps you should not read /.

    Well, how about this? ;)
    perhaps limitation is a good thing. Its just one more reason to learn using M$ software. And given enough time and the ever growing importance of .NET it wouldn't be more than a few years before M$......

  • Gee, when I heard that some folks wanted to include such artificial alphabets as Klingon and Elvish in Unicode, I thought that was a bit over the top; but with more than 1 <voice type="Dr. Evil"> MILLION </voice> 'data points' to play with, maybe providing a few for fan fiction isn't such a bad idea.
    --
  • Props to Slashdot for giving the correction as much prominence as the original false story. Too bad the mainstream media isn't as ethical.
  • Well, why couldn't you just do something like <p style="language:German">blahblahblah</p>. Boku mean, wirklich, how many mal versuche people to schreib in tres languages at the selbe time, ne?

  • Are you going to fund it then?
  • by YU Nicks NE Way ( 129084 ) on Saturday June 09, 2001 @10:06AM (#164040)
    Sorry, but this is simply not true. Sort order is not handled by embedded encodings; the ONLY way to handle it is to go to Unicode. (And that, in fact, is why CJK compatibility is so important. If I embed Japanese text into a Chinese document, and then search the Chinese document, the Japanese text should be ordered according to the Chinese context, not the Japanese content.)
  • It looked okay in Mozilla (which was only set to Japanese auto-detect, not UTF-8), but looked like garbage on IE 5.0 until I manually selected UTF-8.

    However, the Japanese characters in my sig are done with Unicode '&#' escapes, and seem to display properly on most Unicode-aware browsers, in spite of charset headers and character set settings.

  • by Megane ( 129182 ) on Saturday June 09, 2001 @01:48PM (#164042)
    ...and while IE 5.0 properly rendered "Unicode '&#' escapes", Mozilla didn't. I presume it will work properly with the in-spec &amp; instead of a bare ampersand.
  • by pjrc ( 134994 ) <paul@pjrc.com> on Saturday June 09, 2001 @02:18PM (#164043) Homepage Journal
    Only like 10 people actually went to the site and even fewer went to the articles linked to about the site by the slashdot post. Everyone started bitching about Microsoft and IE. What the fuck. Do you not read slashdot or something?

    In those first 20-30 minutes, when the vast majority of highly visible user posts were made, this very well may have been true. Slashdot's forums reward those who leap before they look, because by the time anybody could read the linked material, there's already a lot of posts.

    Reading these posts, I think it's best to keep in mind that (the tiny fraction of slashdot's readership that posts messages) automatically assumes whatever the slashdot editor's commnets are true, largely because there isn't time to actually read the linked material (or do other research). There's a big difference between automatically assuming something is true, for the sole purpose of posting within the first 100 messages, than automatically believing it's true because it was posted by a slashdot editor.

    It's important to remember that only a tiny fraction of slashdot's readership actually reads user comments and a very very tiny portion posts. You really can't draw any conclusions about slashdot's impact on its readership based on the comments posted by a tiny tiny minority (who have an incentive to post quickly and thus rashly).

    It's easy to claim there's a lack of editorial control, but it's a fact that many major media sources print bogus information regularily. What many major newspapers don't regularily do is admit they were wrong, in at least as conspicious way as the original wrong information. Many never admit to anything, and those that do often place it where it's not easily seen.

    Sure, it'd be nice if everything were so carefully reviewed that nothing was inaccurate or misleading, but given the choice between always correct and always honest, I'd rather read honesty every time!

    (...I'm not claiming slashdot and/or it's editors are always honest... set your threshold to -1 on any story posted by Michael to read some -1 Trolls about a rather ugly little dispute between Michael and other members of the former censorware.org)

  • There's a pseudo-standard called the ConScript Unicode Registry [www.egt.ie] that's attempting to use the sections of Unicode codespace designated for private use to describe many fictional languages. Already encoded are Klingon [www.egt.ie], Tengwar [www.egt.ie] (amongst some other Tolkien-created scripts), even such silliness as Seussian alphabets [www.egt.ie].

    Note that it is not official Unicode, but might become a de-facto standard for those folks that are silly enough to actually want to use this stuff.

  • my guess is that /. is running asci (sic) on the servers

    Wow... seven bit servers. Where do you suppose they found those?
  • by -tji ( 139690 ) on Saturday June 09, 2001 @09:29AM (#164046) Journal
    The REAL problem here is that the first article got posted at all. It was obviously a load of garbage.

    Slashdot has evolved into a powerful media outlet for an important group of people. Spreading misinformation to these people can have bad effects. When a group is starting a new project, and have doubt about something, like Unicode, caused by a seemingly authoritative source, they will do the wrong thing.

    It's time for slashdot to mature & behave as a major media outlet. This should include:

    - Independant verification of stories before posting

    - Caching (when achievable) of sites referenced in articles. -- Some sites WANT the huge number of hits, others can't begin to handle that type of load. So, ASK THEM, then cache as appropriate. Google does it, so can /.

  • by torokun ( 148213 ) on Saturday June 09, 2001 @09:21AM (#164047) Homepage

    My only question is -- why doesn't slashdot allow UTF-8 posts? They are rejected by the filters...

  • You might have seen more unicode than you realize.

    The first major example I came across was when I was writing a login page that allowed you to choose what language you wanted from a drop-down box. The only way it will work right for the large plethora of languages I support is with Unicode. I have several of the oriental languages as well as Arabic available to choose from in their own scripts on the dropdown.


    ICQ# : 30269588
    "I used to be an idealist, but I got mugged by reality."
  • by connorbd ( 151811 ) on Saturday June 09, 2001 @09:32AM (#164049) Homepage
    Hmp. Does make sense, doesn't it?

    I thought it was a bit strange that he was talking about separate character spaces for each language in the standard. I mean, I realize there's substantial differences between Chinese, Japanese, and Korean ideographs, but I'm reading and thinking, doesn't Unicode have an overlap space? (Which, as someone pointed out upthread, indeed it does.)

    What bothers me about articles like the original is that the guy doesn't seem to quote chapter and verse, but he's slick enough to make the casual reader think he knows what he's talking about anyway. But we see so much of that anyway that it goes right past the bullshit filter nine times out of ten...

    /brian
  • I don't think anyone seriously believes that anything posted on Slashdot is automatically true.

    It would be a little more accurate to say that nobody important believes a slashdot posting to be automatically true. (As someone else pointed out, there's plenty of stupid people who will believe anything here, especially if it's some kind of silly rant.)

  • by Rademir ( 168324 ) on Saturday June 09, 2001 @01:35PM (#164051) Homepage
    >4294967296 different languages
    >4294967296 different characters

    This scheme does not allow for enough languages--we should be ready for the time when each individual crafts their own personal language(s) for various purposes. There are already more than 6 billion people on the planet. Allowing for some population growth, one might think that 40 bits would be enough, allowing for 1099511627776 languages.

    When the population reaches 100 billion, there's still enough space for 10 languages each.

    But what about aliens? We should make sure our system is ready to interoperate with their languages and computer systems.

    Liberal estimates suggest thousands of sentient species in our galaxy. To restrict ourself to our local group of galaxies, let's say fifty thousands species, of no more than a trillion individuals each (most of them have probably spread out to dozens of planets), and 10 languages per individual...

    That would require 56 bits, to get about 7.2 x 10 ^16 different languages. This way, we won't have to upgrade again for a while!

    On the other hand, 4294967296 characters for each language seems a bit high. Maybe we can save 8 bits and use only 24, for "only" 16777216. characters

    Life,
    Rademir
  • Limiting posting time would only benefit the karma whores who *still* wouldn't read the article, but instead compose a 1000 word post commenting on it.

    I do agree that moderation should be locked for a few hours after posting. Nothing is more depressing than seeing an inaccurate or stupid or troll post modded up to 4 within the first 5 minutes, drowning out much of the intellegant conversation on the article. Of course the "Troll HOWTO" and "Karma Whore HOWTO" have been around for many years and Taco et al must have read them.
  • Mr. Goundry, while critiquing Anglo-centricity in thinking about the Web and the Internet as an "unfortunate flaw in Western attitudes" seems to have made the mistake of confusing glyph and character -- an unfortunate flaw in Eastern attitudes that often attends those focussing exclusively on Han characters.

    Please pardon me if this is a stupid question, and I would appreciate anyone who can set me straight in my erroneous thinking, but I have always been under the impression that Unicode as it exists had fundamental confusion beteen glyphs and characters even with European languages such as Spanish.

    In Castillian Spanish, 'ch' and 'll' are characters that require two glyphs to print. However, for alphabetization purposes, 'ch' and 'll' are distinct characters (A Castillian dictionary has sections 'A' 'B' 'C' 'CH' 'D' ... 'K' 'L' 'LL' 'M' ...). This makes it a pain to sort Castillian words encoded in ASCII or UNICODE---simple-minded comparison routines that compare character codes report erroneous comparison values because they doesn't realize that 'cg' < 'cj' < ... < 'cz' < 'ch'. Of course, the proper way to have done this would have been for Unicode to allocate a 'ch' character code between 'c' and 'd' and an 'll' code between 'l' and 'm', but Unicode seems more preoccupied with glyphs than with characters.

    If I am wrong here, I would love to be set straight by someone better informed.

  • but it is better known as "Karma Whoring"

    :)

    - Caching (when achievable) of sites referenced in articles. -- Some sites WANT the huge number of hits, others can't begin to handle that type of load. So, ASK THEM, then cache as appropriate. Google does it, so can /.

    Usually someone gets a copy of a site before its slashdotted and reposts it as a comment, and conveniently saves the editors the effort of having to ask permission.
    Having to ask permission to mirror something that is freely on the internet is laughable, considering the local internet proxy or your browser can set to cache everything.

    --
    Three rings for the elven-kings under the sky,
    Seven for the Dwarf-lords in their halls of stone,
    Nine for Mortal Men doomed to die,
    One for the Dark Lord on his dark throne
    In the Land of Mordor where the Shadows lie.
    One Ring to rule them all, One Ring to find them,
    One Ring to bring them all and in the darkness bind them
    In the Land of Mordor where the Shadows lie.
  • Everyone started bitching about Microsoft and IE. What the fuck. Do you not read slashdot or something?

    And this was "influential" how? I don't deny that there a lot of fools that believe everything they see on Slashdot, but the original poster believed that "influential people" bought into Slashdot's often misleaing postings. I don't believe that anyone smart enough to have influence in real ways would be dumb enough to believe everything they read on Slashdot.

    In other words, I don't think the Unicode committee is going see that article and say to themselves, "Good God! Slashdot thinks that Unicode is a failure, so I guess we better close up shop!"


    --

  • Slashdot has evolved into a powerful media outlet for an important group of people.

    I think you vastly overestimate Slashdot's influence. I don't deny that there are probably a lot of influential people who check it regularly or occasionally, but let's remember that Slashdot mainly links to other's articles. I don't think anyone seriously believes that anything posted on Slashdot is automatically true.

    And the original editorial content on Slashdot whipsaws between the hopelessly naive and the outright obvious, so I doubt they have much influence there beyond high schoolers who still have pretty narrow horizons.


    --

  • This isn't insightful, it's crap. Win2K's multilanguage UI rocks.

    Cut hpa a little slack. While NT has supported Unicode for a while, it wasn't until Win2k that the "swap [locales] on the fly" behavior you mention worked reasonably, and it isn't even an option on Win9x (which is what most home users would have). hpa's info on MS APIs is a little out of date, yes, but his/her OS probably is, too.

  • For a couple of cool demos of the kind of multilingual Web pages that Ken Whistler is talking about, see the announcement for the Tenth Unicode Conference [unicode.org] or "I don't know, I only work here." [maden.org] Both of these pages demonstrate Han unification, in which the same code points tagged as different languages get different visual presentation in a compliant browser.

  • ( or have their mom's read it to their illiterate asses)

    Whose illiterate asses are you talking about, mom's asses or their son's asses ?

    Is that a phonics or a grammar issue ?

  • The problem with that is then you have each character in a text file being represented by a 64-bit value. A text file that's only 1,500 bytes in size under ASCII standards or UTF-8 suddenly turns into a 12,000 byte file. Strings inside executables would be updated to use these new characters and the size of applications would increase as a result. Really, the Unicode standard does fine, it works, and the idiot that posted the first story should be shot for wasting everyones time with this crap when it's clear the standard is proving itself fine in the real world.

  • by DarkEdgeX ( 212110 ) on Saturday June 09, 2001 @09:59AM (#164061) Journal

    This story is a response to the first one that said it wouldn't. This one, IMHO, appears far more credible than the former, and seems to contain a more accurate view of Unicode overall than the previous story.

  • This guy has provided us with nearly all answers
    we needed on the story before.
    The only thing I personally have to add, and dont
    call me American because I am German:
    ASCII (ISO_646.irv:1991) is THE standard, and
    nearly all text-based RFCs (such as 2821/2822 SMTP)
    base on it. And so it can't be put aside.
    The only solution, with look at ASCII as well as
    to the endianness problems, is UTF-8. And I don't
    believe it is harder to excange than e.g. UTF-16
    even in far-east equipment.
    Furthermore it is easier to handle for e.g. libc
    strcpy() and co. which is such a great function
    (in the code in the K&R books).
    So please, people, stop arguing. I always write in
    ISO_646.irv:1991 but usually give a charset=utf-8
    when I can't choose ISO-646.irv:1991 (or ASCII-7).

    Please don't call me old-fashioned, but ASCII has
    been the most important standard, base64 bases on
    ASCII and EBCDIC (which is the only real alternative)
    and it must not be simply thrown away. UTF-16 is
    NOT ASCII-compatible. ASCII defines 7-bit encodings,
    which are wrapped into 8-bit characters as long as
    we don't go away from the bit (e.g. to the (0,1,2)
    state).


    --
  • was how to cast the bstr. Thanks for the extra info however...
  • Of course unification makes sense. There is plenty of software that is used globally (e.g. the browser that you are reading this in - would you want to create a different browser version for each country?). Even just for Western languages Unicode obviates the need to deal with different codepages.
  • unicode works, but is unnecessary

    It is necessary extended scripts, like Persian [persianacademy.ir] which is somehow an extended Arabic script, and many of the minor scripts of the world, like Syriac [bethmardutho.org].

    I haven't seen a homepage in Unicode yet.

    Then see my homepage [sharif.ac.ir]!


    --

  • Also, if there's redundancy in Unicode, I imagine most of that space could be saved with gzip, which also has good support over the web, though like Unicode is far underused.

    Well, one may also try the Standard Compression Scheme for Unicode [unicode.org].

    --

  • This reply is posted there [slashdot.org] before publication here, and has a "4: Informative". The reader will not miss it if he cares enough.

    --

  • by roozbeh ( 247046 ) on Saturday June 09, 2001 @01:14PM (#164068) Homepage

    In Unicode terms, "ch" is named a grapheme, it's different from a character. (Or you may want to call it a letter.) it is encoded using the two characters "c" and "h". It is something that considered a unit in some places, but not in the others. I would recommend taking a look at the Unicode Standard book, which you can read online [unicode.org]. This things are in chapters1 [unicode.org] and 2 [unicode.org].

    About string ordering, Unicode does not claim anything. If you look into ASCII, you will find that even that is not suitable for normal English sorting, since "B" is encoded before "a". But don't go away. Unicode has a Collation Algorithm [unicode.org] that specifies what should one do with advanced natural language ordering of strings, and also tells what should one do with the Castillian "ch".

    --

  • Unicode is the only sensible way to do any multi-language application on the web. Without it your brain will explode.

    If you want to see unicode in action visit a site I developed: the Universal Declaration of Human Rights [unhchr.ch] site. It has 320+ language translations of this document, the majority of which are in unicode. You will also find a nice browser UTF-8 torture test there..

    There are still languages that do not have standardized unicode glyphs (Amharic for example) thus you'll find some pdf and scanned images there.. But all in all unicode made this project doable.

  • Independant verification of stories before posting ...and independent verification of spelling :-)
  • perhaps the filter parses everything as ISO 8859-1, and whether a post gets filtered or not depends on how it looks in that light.

    Do form submissions in IE and Netscape pass a charset header when posting text?

  • ... is preventing people from responding ;-)
  • "Slashdot's forums reward those who leap before they look"

    Very true, and there's an easy fix. Make it so that no message posted a certain amount of time after a news item is posted can be upmoded, and limit the number of posts that can be upmoded within a longer span of time after a news item is posted. With luck, this would encourage people to look before they lept, since anything posted in, say, the first 15 minutes would never be upmodded.
  • I have my doubts that unifying the character sets is a good thing at all. Languages based on the Chinese character set have very different requirements and properties from alphabetic writing systems. Alphabetic writing systems have a number of complexities that don't really exist in Chinese (collation, ligatures, hyphenation, diacritics, etc.), while the CJK languages have enormous character sets that require many more bits to be reserved than alphabetic languages need. The Japanese writing system is so oddball and complex that it really doesn't fit in anywhere.

    Why should programmers for any one market have to deal with the complexities of the other writing systems? It seems to me that the only companies that really win here are those with global reach, who get to churn out localized versions of their software with minimal effort. But is that kind of generic adaptation really that high quality? Wouldn't software developed from scratch locally be better?

  • It was news to me, too. *hehe* Since I live in Berkeley, California, a well-known hotspot for white supremacist agitating (*rolls eyes*), I guess I'll have to check with my black, Hispanic, and gay neighbors to see where I got this reputation.

    It was just a troll, and an anonymous one at that.

    --Ken Whistler
  • Sorry, but this is just a goofy idea.

    Beyond the problem that nobody yet has a foolproof, standardizable listing of the 6000+ languages in current use on the planet, let alone the thousands more historical languages and all the dialects, having a character encoding that requires language identification on a character-by-character basis couldn't work in practice. How do you deal with borrowed vocabulary? How does a user input this stuff -- maybe they don't even know? How do you deal with conversion of text that isn't identified this way? And on and on.

    There are good reasons why character sets are built the way they currently are, and why language identification is treated as an issue for markup of text, rather than for character encoding.
  • Granted that a universal character encoding with the scope of Unicode is very complex, and a *full* implementation that does justice to all parts of it is beyond the capability of all but a few large software companies.

    But there are several mitigating points you may be missing.

    First, conformance to the Unicode Standard does not mean you have to actively support the repertoire of all the characters. It is perfectly compliant to just pay attention, say, to the Ethiopic characters, and do the best-in-the-world Ethiopic word processor or whatever, while simply passing through and essentially ignoring all the rest of the characters. In this sense, the Unicode Standard is not inhibiting local best-of-breed development, but rather enabling it without diversion down the path of having to start off with local character encoding standards (often 8-bit font hacks) that don't, in turn, interoperate with anybody *else's* software.

    Second, most serious software development these days is modular anyway. You depend on other people to provide generic platform services, or to develop general libraries of routines that you turn around and use. Much Unicode development falls into that category. If Windows (or some other platform) does a good job of implementing Unicode, other developers can turn around and make use of the API's those platforms provide to build applications on top of those platforms. Or you call into libraries that specialize in these issues. Nobody much goes around building their own graphics routines nowadays, for example -- you depend on the platforms or specialized libraries to provide such services and get on with concerns about the rest of your application.

    > Why should programmers for any one market have
    > to deal with the complexities of the other
    > writing systems?

    Well, in principle they should not, unless their concern is explicitly with rendering and writing system support.

    What you may be missing here is that the alternative to Unicode is having to deal with the complexities of character encoding support for hundreds of existing character encodings. That is far more of a generic burden on application development than having a *single* encoding (you usually pick either UTF-8 or UTF-16 and stick with it) for the character handling. There is a reason why Java just defined its strings from the beginning in terms of Unicode, and why that model took off so quickly.

Them as has, gets.

Working...