Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror

Comment Re:String f**k up (Score 4, Informative) 184

Spoken like somebody that's never had to deal with encoding issues. Using UTF-8 internally is fine, but exposing it to the programmer is insane and error-prone. And if the programmer then proceeds to manipulate that raw byte buffer as a string, he's an idiot.

The proper solution is to use 8-bit strings, but any functions that care (such as I/O) should treat them as being UTF-8. Most functions do not care and thus the treatment of "Unicode" and "bytes" are the same.

You might not be aware of this, but computers are used for more than just transmitting text. I don't want my binary streams being rewritten to gibberish because some I/O routine was written to be too clever. Furthermore, not every system uses UTF-8. Some may even need to send data over a *gasp* network! Good luck getting every other computer in the world to start using UTF-8 immediately.

The problem with UTF-16 is you cannot losslessly convert a string that *might* be UTF-8 to UTF-16 and then back again. This is because any illegal UTF-8 byte sequences will be lost or altered.

If you try to convert bytes that aren't in UTF-8 using a UTF-8 codec, an error will be raised. This behavior is proper -- if you don't know what format your input is in, there's no way to perform text-based operations on it.

This has been a real pain so far in our use of Python, and I am quite alarmed to see that they are changing the meaning of plain quotes in 3.0 to "Unicode".

Every developer I know uses Unicode strings already. The new behavior is just one less character to type in front of literals.

This is really a serious step backwards, as we will be forced to tell anybody using our system to put 'b' before all their string constants

Otherwise said as: "We're too stupid to fix the glaring encoding errors in our product, so we'll just use bytes everywhere and pretend it's all working". Also, Unicode strings in Python are implemented with either UTF-16 or UCS-4 depending on platform.

Slashdot Top Deals

The typical page layout program is nothing more than an electronic light table for cutting and pasting documents.

Working...