Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror

Comment: Re:I am amazed (Score 1) 204

by spitzak (#49788071) Attached to: A Text Message Can Crash An iPhone and Force It To Reboot

Actually I think "Unicode strings" should be avoided completely.

They do not help at all in doing text manipulation, because Unicode code points are *not* "characters" or any other unit that users think about. This is due to combining characters and invisible characters such as bidi indicators. There is a prefix code unit that eats the next 2 letters and turns it into a country flag! It is a huge mess.

Far more important is they all lack the ability to store errors that are in a UTF-8 string in a lossless way. This means you cannot trust arbitrary 8-bit data to survive translation to "Unicode" and back. This has been the source of endless bugs and is the reason people can't use Python 3.0.

Comment: Re:I am amazed (Score 1) 204

by spitzak (#49788047) Attached to: A Text Message Can Crash An iPhone and Force It To Reboot

My recommendation is special interators on std::string. Something like this:


    for (utf16_interator i = string.begin(); i != string.end(); ++i) {
          int x = *i;
          if (x < 0) error_byte_found();
          else utf16_found(x);
    }

There would also be interators for UTF-32 (probably what you were thinking of as "Unicode" but a lot of Microsoft programmers think "Unicode" means UTF-16). And iterators for other normalization forms. In all cases these would return negative numbers or some value that cannot be confused with a code point for UTF-8 error bytes.

This would be very fast because you can find the next Unicode code unit or whatever in constant time. Any api where you can arbitrarily index a unit using an integer is not going to be constant, it will be linear with that integer. Iterators avoid this.

Comment: Re:Lol (Score 1) 204

by spitzak (#49788001) Attached to: A Text Message Can Crash An iPhone and Force It To Reboot

No you don't. You are demonstrating the typical moronic attempts to deal with UTF-8.

Here is how you do it:

Go X bytes into the string. If that byte is a continuation byte, back up. Back up a maximum of 3 times. This will find a truncation point that will not introduce more errors into the string than are already there.

BUT BUT BUT I'm sure you are sputtering about how this won't give you exactly X "characters". NOBODY F**KING CARES!!!! If you want the string to "fit" you should be *measuring* it, not saying stuff that has not been true on computers since the 1950's about "N characters fit". I bet you think a combining letter and accent should count as 2, huh?

And your display function should not crash because it was given a string with an error in it! Even if you stupidly inserted the ellipsis all it should do is draw a few error indicators before the ellipsis.

Comment: Re: Lol (Score 1) 204

by spitzak (#49787957) Attached to: A Text Message Can Crash An iPhone and Force It To Reboot

No, the problem is code that pretends that illegal UTF-8 sequences magically don't exist!

For some reason UTF-8 turns otherwise intelligent programmers into complete morons. Here is another example from Apple. Let me state some rules about how to deal with UTF-8:

1. Stop thinking about "characters"!!!! This is a byte stream. The ONLY reason to think about a "character" is because you are DRAWING it on a display designed for a human to read, and humans do think about "characters". All other software either does not care, or is concerned with far more complex patterns (such as regexp and editors that deal with words and sentences), these second ones are not helped at all by an intermediate translation.

2. It is TRIVIAL to detect that the byte sequence you are looking at is not a valid UTF-8 character. In this case draw a replacement for exactly ONE byte and then try the next byte to see if it is a valid sequence. Do not skip more. There must be one error per byte so that the maximum number of good characters is preserved and so that a sequence with errors can be parsed bidirectionally without looking more than a few bytes ahead, and so that it is possible to search for error patterns. It also means there are only 128 different errors, not millions.

3. NEVER "translate to Unicode" (ie UTF-16) because this will be a lossy conversion of these invalid sequences and thus you have not preserved the original data. I'm sorry but Microsoft really screwed us here. Best recommendation is to write a wrapper around the filesystem calls and translate from UTF-8 to UTF-16 at the last moment, using U+DCxx as a translation for the error bytes (this is lossy but filenames already are, due to case independence, Apple's normalization, and even on Unix where "./foo" and "foo" are the same file).

This is blatantly obvious if you substitute "words" for "characters" and imagine how you would write a program to deal with text strings. Words are also composed of multiple bytes in a row. For some reason nobody seems to crash on misspelled words, and they manage to concatenate and split strings and make whole file systems and diff programs and all kinds of other fancy text manipulation without having to translate the text so that each word is a fixed-sized integer. Amazing!

Comment: Re:flashy, but risky too. (Score 1) 83

by spitzak (#49579997) Attached to: Uber Testing Massive Merchant Delivery Service

Although I see problems with this I kind of doubt counterfeiting is going to be one. To successfully do this the driver/Uber would have to have access to a huge warehouse of counterfeit goods so they could exchange the real item (chosen by the customer, not the Uber driver) for a matching fake one. I just don't see that as a practical scheme for stealing goods.

Comment: Re:Animator needs three (Score 1) 301

if you and Wacom would embrace Bluetooth

So either the tablet is plugged into the wall or it is thick enough to contain a battery, or it has some thick part near the edge containing the battery? And I have to recharge it or replace the battery? Sorry I don't think so.

Comment: Re:Valve needs to use their clout (Score 2) 309

by spitzak (#49480757) Attached to: NVIDIA's New GPUs Are Very Open-Source Unfriendly

Actually you can change the monitor layout without restarting X now.

And the Gnome control for moving the monitors around somewhat works, though it is unclear if they are special casing Nvidia or that NVidia is implementing the necessary parts of xrnr. The Nvidia control works somewhat better.

Comment: Re:Schneier got it right a decade and a half ago (Score 1) 119

by spitzak (#49316223) Attached to: OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab

Yes, Java and Python (3) and Qt all are causing enormous difficulties as they followed Microsoft down the fantasy road and thought you had to convert strings on input to "unicode" or somehow it was impossible to use them. Since not all 8-byte strings can convert there must either be a lossy conversion or there must be an error, neither of which are expected, especially if the software is intended to copy data from one point to another without change.

The original poster is correct in saying "stay away from Unicode". This does not mean that Unicode is impossible. It means "treat it as a stream of bytes". Do not try to figure out what Unicode code points are there unless you really really have a reason to. And you will be surprised how little you need to figure this out. In particular you can search for arbitrary regexps (including sets of Unicode code points) with a byte-based regexp interpreter. And you can search for ASCII characters with trivial code.

Comment: Re:Type "bush hid the facts" into Notepad. (Score 1) 119

by spitzak (#49316185) Attached to: OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab

Actually Plan 9 and UTF-8 encoding existed well before Microsoft started adding Unicode to Windows.

The reason for 16-bit Unicode was political correctness. It was considered wrong that Americans got the "better" shorter 1-byte encodings for their letters, therefore any solution that did not punish those evil Americans by making them rewrite their software was not going to be accepted. No programmer at that time (including ones that did not speak English) would ever argue for using anything other than a variable-length byte encoding for a system that still had to deal with existing software and data that was ASCII, this was a command from people who did not have to write and maintain the software.

The programmers, who knew damn well that variable-length was the correct solution, were unfortunately not bright enough to avoid making mistakes in their encodings (such as not making them self-synchronizing). UTF-8 fixed that, but these errors also led some of the less-knowledgeable to think there was a problem with variable length.

Unfortunately political correctness at Microsoft won, despite the fact that they had already added variable-length encoding support to Windows. It may also have been seen as a way to force incompatibility with NFS and other networked data so that Microsoft-only servers could be used.

One of the few good things to come out of the "Unix wars" was that commercial Unix development was stopped before the blight of 16-bit characters was introduced (it was well on it's way and would have appeared at the same time Microsoft did it). Non-commercial Unix made the incredibly easy decision to ignore "wide characters".

The biggest problem now is that Window convinced a lot of people who should know better that you need to use UTF-16 to open files by name (all that is really needed is to convert UTF-8 just before the api is called). This led to UTF-16 to infect Python, Qt, Java, and a lot of other software and cause problems and headaches and bugs even on Linux. There is some hope that they are starting to realize they made a terrible mistake, Python in particular seems to be backing out by storing a UTF-8 version of the string alongside the UTF-32.

Comment: Re: novice programmer alert! (Score 1) 119

by spitzak (#49316061) Attached to: OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab

The big downside of UTF-8 is using it as an in-memory string. To find the nth character and you have to start at the beginning of the string.

And this is important, why? Can you come up with an example where you actually produce "n" by doing anything other than looking at the n-1 characters before it in the string? No, and therefore an offset in bytes can be used just as easily.

C# and Java use UTF16 internally for strings.

And you are aware that UTF-16 is variable-length as well, and therefore you can't "find the nth character" quickly either?

You might want to retake compsci 101.

The solution of this problem is trivial and is left as an exercise for the reader.

Working...