Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Programming IT Technology

How Do You Handle Unicode? 15

spectecjr asks: "The word on the street is that Unicode is the panacea for programming globalizable applications. The thing is, which operating systems support it? And how much work do you need to do? Allegedly, KDE and GNOME both support Unicode. Windows 9x supports it, but in a rather fragmented manner. Windows 2000 is Unicode to the extreme, even supporting Dvengali, Thai and Arabic script on all versions. But what are the pitfalls? What do you have to be aware of? What makes it different than talking to ASCII? How do you handle whitespace? How do you make your API display the characters in the right fonts? All of these issues are becoming more important as the world becomes more switched on, and the boundaries shrink between places. But what does it really mean for Joe Q. Developer?"

For some related articles on this subject, you might want to check out the following:

This discussion has been archived. No new comments can be posted.

How do You Handle Unicode?

Comments Filter:
  • The PNG format has a new (January 1999) chunk called "iTXt" that conveys text in the Unicode UTF-8 character set.

    What's different about Unicode? For one thing, at one point we were thinking of developing a chunk that could contain a graphical font (images of symbols) which would be easy to do in ASCII (256 pointers into the table) but was complex enough to do with large character sets that we gave up.
  • get a (+1;informative) for posting the links to previous /. discussions on this topic?

    Inquiring minds want to know...
  • To say that Windows 9x supports Unicode "in a fragmented manner" is putting it mildly. Windows 9x supports converting Unicode to/from any installed ANSI codepage, but almost nothing else. 99% of the Win32 APIs in Win9x do NOT support Unicode strings, whereas in NT and W2000, almost all do.

    More generally, how one handles Unicode depends on what form of Unicode one wants to support. There are at least three basic varieties. MS Windows supports only UCS-2, in which all characters are two bytes wide. UTF-8 (variable-length characters) is supported only by the URL library. UCS-4 (all characters four bytes wide) is not supported at all. (There is also a variety called UTF-16 which I don't recall the details of. I think it's like UTF-8, but with 16-bit words instead of 8-bit bytes making up the individual components of each variable-length character.) String processing in variable length character sets is a pain, though it's easier in UTF-* than it is in MBCS because Unicode guarantees that the ranges of lead and trail bytes don't overlap.

  • I've been playing with XHTML and as an experiment I have set up this page:

    http://c.croome.net/ [croome.net]

    to be served as UTF-8 and the same page at this address:

    http://chris.croome.net/ [croome.net]

    to be served as ISO-8859-1

    It causes NN4.x to do some slightly weird things in X11 -- you get different fonts!

    I've tested it in IE 4 and 5 and NN 1, 2, 3 and 4 on windoze and NN 3, 4 and 6 in Linux and Lynx in Linux and all seems to be OK.

    I have heard it said that some old browsers might not be able to cope with UTF-8... but I don't have any evidence for this... Perhaps I should try and translate the site into German or something and see what happens then?

    Anyway, my conclusion is that for web sites in English there seems to be no problem in using UTF-8, however I don't think there is any advantage either! Please correct me if I'm wrong.


    --
  • The thing is, which operating systems support it? And how much work do you need to do?

    These two questions are quite closely related. If your operating system supports Unicode well, your level of extra work ranges from none to a little bit for programs that work with text (like word processors). Text processing programs often need to work with whitespace and case. Luckily, the Unicode Consortium offers small databases of the characters that are whitespace, and those which have counterparts in other cases. These are probably built into an operating system with good Unicode support.

    At the other extreme, an operating system that doesn't have any Unicode support is going to require a lot of extra work from you, the developer. Sometimes, you can use the ASCII-compatible UTF-8 encoding to slip Unicode through OS functions that work on ASCII, but you'll probably most often end up working with Unicode internals to your program, and then interfacing with the OS in ASCII and/or its native character set. Display of non-native scripts is going to be the hardest part. If your host OS at least supports TrueType fonts, your program then must find out how the glyphs are encoded in the fonts to get the proper glyphs for display. This method works for XFree86 displays. If your host OS only has a font system geared toward its native character set, well, you're in for a world of pain extending it. And of course, for real Unicode support, you'll need to write a text display system that supports at least the Unicode directionality hints...

    But what are the pitfalls? What do you have to be aware of? What makes it different than talking to ASCII?

    First off, you have the issue of internal representation of characters. Do you choose UTF-8? If your program deals mostly with English characters, UTF-8 is the smallest. It's also more compute-intensive, as the characters are variable width. How about UCS2? It's uses 16-bit wide characters, and thus more space for English scripts, though each character is a fixed width. You do have to process it looking for surrogate character pairs when displaying text, though. Then there's also UCS4, where each character is 32 bits. That uses a lot more storage, but each character is exactly 32-bit (i.e. not variable width, nor are there any surrogates).

    Once you've chosen your storage format, you have to take into account things like case, directionality, composed characters, compatibility characters, and invalid characters. Unicode has three cases: uppercase, lowercase, and ``titlecase''. Titlecase is there to distinguish phrases with the first letter of each word capitalized. This is actually important for some characters, such as the Latin character ``Dz''. (``DZ'' vs. ``Dz'' vs. ``dz'') Composed characters are glyphs such as ``é'' that have their own Unicode character, but can also be formed by ``e'' plus an acute accent. These are often included for compatibility with legacy character sets. Other compatibility characters are characters that correspond to the same glyph as another Unicode character, but are included as part of a range of Unicode values taken from a legacy character set. Then there are some Unicode characters that are considered invalid, such as a high or low surrogate character appearing alone.

    This is all very different than ASCII, where you can play tricks like uppercasing a string by bit twiddling. With Unicode, you need a database to look up which characters have counterparts in other cases and what those equivalents are, you need to keep track of what's whitespace, you need to keep track of what's considered a decimal digit, and you need to know what's an invalid character.

    How do you make your API display the characters in the right fonts?

    You use a display system that handles it for you! ;-)

    Seriously, this is difficult. The world's scripts are rather diverse. There are the right-to-left scripts, right-to-left scripts with left-to-right decimal numbers embedded, top to bottom scripts, and (if you're really a gonzo coder) Ancient Egyptian changes direction with each line! In addition to directionality, there are complex rules for rendering combined diacritical marks properly. Some scripts of course combine characters, and render them differently when combined. And of course, ligatures, funky punctation, and text styles. There's a lot to take into account.

    I recommend picking up ``The Unicode Standard Version 3.0'' for more information. It's an invaluable resource for anybody dealing with Unicode.

    But what does it really mean for Joe Q. Developer?

    IMHO, what it means is that software developers ought to put some effort into standard (free, open source!) toolkits that handle the guts of Unicode processing and display, so that we can all put in a little bit of effort on it once, rather than trying to wrestle with Unicode support in every project.

  • Hmm...you probably could go read things at the site in your posting http://www.unicode.org/ [unicode.org] and learn these things, its not hard. its just text after all... http://www.unicode.org/unic ode/standard/principles.html [unicode.org]

    Windows 2000 is Unicode to the extreme, even supporting Dvengali, Thai and Arabic script on all versions.
    no thats just full unicode support. If you truly support unicode you must support the entire character set, not just the part you feel like using. Its like saying you are ASCII compliant, but only characters 12-34. BTW: NT 3.5x, NT4 and CE (all versions) are unicode compliant as well. (Probably all NT versions but I'm not positive)

    But what are the pitfalls?
    Hmm...you might actually support the 4 billion people in the world who never heard of ascii until the british invaded, uh, I mean colonized their homeland.

    seriously a unicode "character" is 16bits. Which means if you are developing an app for english speaking/reading americans, your text resources will double in size. Bummer... (Although there is a UTF-8 variant I don't know much about - I think its basically unicode for the most popular/common languages)

    Another pain in the ass is dealing with network protocols. Everybody expects ascii. So you do a lot of converting which is a pain. Although maybe some day everything will be xml based and you could be using unicode plaintext for that...

    How do you handle whitespace?
    Uh just like whitespace. In the nt world you can use iswspace instead of isspace. Basically all your ascii crt functions have a unicode equivalent. This way brain dead programmers don't go running around saying the sky is falling just because they didn't bother to read up on the subject...

    How do you make your API display the characters in the right fonts
    Uh..instead of a small lookup table for the few ascii characters you need a big lookup table for unicode (probably a smarter implementation that maps in sections at a time when you use them). If your OS supports it, you don't worry about it at all. You use something like DrawText( L"blah blah blah" ). What gets difficult is using an ASCII based text editor to enter unicode strings. Basically you have to use cut and paste from an app that does support it or type things in manually.

    All of these issues are becoming more important as the world becomes more switched on, and the boundaries shrink between places And you're saying it wasn't important when the western europeans were ranging all over Africa, Asia and the Americas conquering people? Its only important for those with compassion and understanding. Just like it should have been important to the detroit automakers who were pissed off that Japan wouldn't let them sell cars there, that when they finally did get to sell some cars there that maybe they should check and see what the preferred side of the car was for the steering wheel.
  • Just a quick followup question:

    I seem to remember that MS marketed Win98 as having extensive Unicode support compared to Win95. You talk about Win9x in general however... Isn't there a difference?

  • No, Windows 98 doesn't have significantly greater Unicode support than Windows 95. Nor, as far as I know, does the forthcoming "Windows Millenium Edition" (Windows Me -- god, what a name). You can safely think of Windows 98 as just Windows 95 Second Edition (or Fourth, if you count 95 OSR1 and OSR2 as Second and Third).
  • No, Windows 98 doesn't have significantly greater Unicode support than Windows 95. Nor, as far as I know, does the forthcoming "Windows Millenium Edition" (Windows Me -- god, what a name). You can safely think of Windows 98 as just Windows 95 Second Edition (or Fourth, if you count 95 OSR1 and OSR2 as Second and Third).
  • No, Windows 98 doesn't have significantly greater Unicode support than Windows 95. Nor, as far as I know, does the forthcoming "Windows Millenium Edition" (Windows Me -- god, what a name). You can safely think of Windows 98 as just Windows 95 Second Edition (or Fourth, if you count 95 OSR1 and OSR2 as Second and Third).
  • One minor nit - Unicode is not always 16 bits/character. There are at least three modes, requiring 8-, 16- and 32-bit wide characters. For practical reasons the 32-bit mode is rarely used today (imagine managing a font with hundreds of thousands of glyphs!), but a properly written Unicode application should be able to handle it.

    In the meanwhile, the 40,000-odd characters in UTF-16 (not 2^16 since a big chunk is reserved for local use, etc.) is enough to handle all of the alphabetic languages *and* the most common words in common ideogrammatic languages. It's not sufficient for the most demanding tasks, but it will be nirvana for anyone coming out of ASCII-land.
  • Do you mean Devanagari?
  • UTF-16 is two-byte objects, like UCS-2. Most UTF-16 characters are a single object. The difference is that in UTF-16 you can use "surrogate pairs" - two objects - to refer to characters outside the basic multilanguage plane (BMP). UCS-2 simply doesn't permit references outside the BMP.

    UCS-4 gives access to the entire 31-bit ISO 10646 character set, but it's fairly inefficient since most planes in that range haven't even been assigned yet.

    See appendix C of The Unicode Standard, Version 2.0 for details, or the Unicode Consortium's Web site [unicode.org].

    (Why doesn't /. allow use of the <cite> element?)

  • You correctly observered

    Hmm...you might actually support the 4 billion people in the world who never heard of ascii...

    but said

    (Although there is a UTF-8 variant I don't know much about - I think its basically unicode for the most popular/common languages)

    In fact, UTF-8 is hell for the most popular and common languages. For ASCII (Unicode characters 0-127 (U+0000 - U+007F)), UTF-8 means one byte per character. For ISO Latin 1 (U+0080 - U+00FF), UTF-8 means two bytes per character. For all other characters (except surrogates), UTF-8 means three bytes per character. That means that for ideographic languages (the most popular languages), as well as eastern European and non-Latin-alphabetic languages, UTF-8 is 150% of the size of UTF-16 - clearly a non-starter. For western European languages, or text primarily made of such languages, UTF-8 is a clear win, since accented characters aren't the majority.

    Another reason to pick UTF-8 is that it's network friendly and backwards-compatible with ASCII. Characters without the eighth bit set mean exactly what older systems expect them to mean, and all non-ASCII characters have the eighth bit set. The drawback is that the network has to be 8-bit-safe, which SMTP (for example) isn't. UTF-8 is also safe from "helpful" gateways that do e.g. line-end normalization. UTF-16 can be completely hosed by such gateways. For 7-bit-only channels, you can use UTF-7 (which is a somewhat nasty 7-bit encoding), or just bang the entire thing (in UTF-8 or UTF-16) into quoted-printable form.

  • First off, take a look at Jukka Korpela's excellent tutorial on character code issues [www.hut.fi].

    It's a little problematic to say that Microsoft supports Unicode -- they have a rather characteristic "embrace and extend" attitude towards character sets. The "windows character set", the reason why early JonKatz articles had question marks instead of quotes, is an extension to ISO Latin 1 which features smart quotes, em- and en-dashes, guillemots, etc., in a reserved section of the set (130-160). This creates a whole host of interoperability problems, as most Microsoft tools think it's OK to save 8-bit strings as text.

    In Microsoft's favor, however, Unicode support in IE is pretty good -- and I think Unicode is probably the best way to display many international characters on the web -- the standard &-entities (i.e. oslash for ø) aren't supported everywhere, but the Unicode entity (#xxxxx, where xxxxx is a decimal number) is gaining more support in HTML 4.0/compliant browsers. However, IE supports the non-standard extensions, and most support for non-Latin glyphs is through codepages...

    In any case, the real solution is to use LaTeX for all typesetting. :-)

    ~wog

"What man has done, man can aspire to do." -- Jerry Pournelle, about space flight

Working...