Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×

Comment Re:Lol (Score 1) 248

There's no law that says they can't pad the variable length input to fixed length

I'm not sure you quite understand the problem, it's not the input length, it is the encoding of each of the characters. So are you suggesting turning all single-byte encoded characters into multi-byte encoding of some arbitrary maximum length? If you can already identify the problem at this level then you would just do that in the parser that is truncating the string.

...and then make sure you're handling combining character sequences and bidirectional text correctly.

Comment Re: Lol (Score 1) 248

It is not hard and it seems really obvious, but for some reason Unicode turns some otherwise really smart programmers into total idiots.

There's a lot to know, and people might not be aware of all of it and all the issues involved.

...and "really smart" might actually be a handicap if it means "I'm smart, I know how to do this, it's easy!", and not bother to Read The Fine Manual, whereas somebody less smart might find Unicode scary and actually bother to RTFM.

Comment Re: Lol (Score 1) 248

From that description it does sound like the string is still valid. However if the display is crashing on a certain sequence containing an ellipsis, I am not clear why you can't construct that string directly, rather than rely on the insertion of the ellipsis.

Yup.

It does sound like they maybe rely on "sanitizing" but of a far more complex scheme that I was aware of.

Not to me, unless by "sanitizing" you mean "shortening so it'll fit in Notification Center".

This is still wrong, maybe far worse, as they are detecting and rejecting patterns containing ellipsis and some other character

I've seen nothing to indicate that they're doing anything specific with ellipses, other than "sticking them in at the point of truncation to let the user know that the full message isn't being displayed".

About all I'd assume is that certain sequences of characters are not being handled correctly by some part of Core Text; perhaps it's assuming, explicitly or implicitly, that those sequences "can't happen" and, instead of drawing them, crashing, perhaps in an assert.

In this case their glyph layout should simply not crash on any possible arrangement of bytes or words in the incoming string.

Correct.

It is not hard and it seems really obvious, but for some reason Unicode turns some otherwise really smart programmers into total idiots.

There's a lot to know, and people might not be aware of all of it and all the issues involved.

Comment Re: Lol (Score 1) 248

So you are saying "fix the library". I am saying "sanitize input for library".

Both work, but I would argue that sanitizing for the library is usually a lot less problems.

"Programming for international environments is hard, let's go shopping!"

I would argue that you have perhaps not considered all the possible problems and have thus perhaps miscounted the problems with "work around a broken library by transforming perfectly legitimate Unicode character sequences into sequences that might not represent what the person sending the message intended", that being the correct description of the second approach to this problem in the list above.

Yeah, correctly truncating a message that could be an arbitrary sequence of text in multiple languages with combining character sequences and bidirectional text isn't easy, but, well, if you want to be thought of as a company that makes stuff that "just works", you'd better figure out how to make that complicated process "just work".

Maybe iOS 8.3.1 needs to have a quick fix of some sort, but iOS 10, if not iOS 9, should fix the truncation code.

Comment Re: Lol (Score 1) 248

In this case, the illegal UTF-8 sequence is the string after you have blown part of its funny foreign squiggle.

Where has it been proven that the bug is the trashing of a UTF-8 sequence?

First of all, Apple tends to use UTF-16 in the higher-level frameworks, e.g. that's how CFString/NSString work internally.

Second of all, processing entire characters rather than bytes is something I suspect Apple got right fairly early in the process. I suspect the problem is either that 1) when truncating the message for display, they're not processing entire graphemes, they're processing entire characters or 2) they're not taking bidirectionality into account or 3) they're not handling a combination of both issues.

He's saying that thing you call with your newly minted mangled string shouldn't fail.

Which is one way to solve it.

There are multiple things here that should be fixed. That's one of them - the renderer shouldn't crash if handed a bad string, it should fail more softly, e.g. put in a REPLACEMENT CHARACTER for all bad sequences and, if possible, log the error in a way that indicates that routine XXX has handed a bad character sequence to it.

I would argue, if the thing you calls mangles strings, sanitize its inputs so it doesn't get a string with a bad character (a unicode character of whatever format it uses internally, post-mangle).

And I would argue (all the way to the heat death of the universe) that, if you know that the thing you call mangles strings, and if it's produced by somebody else working on the same OS, you get it fixed so that it doesn't do that; you don't mangle user input (which includes text messages from other users) in released software, unless you don't have time to fix the underlying problem for the release.

Comment Re:Lol (Score 1) 248

It's a bad character if the library you call will fuck it up. That's what makes it bad.

If it's a valid character in the character set being used, and a valid representation of the character in the encoding being used for that character set, then it is by definition not a bad character; if the library you call fucks it up, the library is bad.

The fuckup isn't the lack of "sanitization" of perfectly clean strings, the fuckup is the library's inability to handle those strings.

Once you overwrite part of some multibyte character IT IS A BAD CHARACTER!!!

Then the fuckup is the overwriting of part of that character - or the overwriting of a combining character following a base character, or not handling bi-directionality correctly when figuring out where to and how to truncate the string. No, the rendering code shouldn't crash when handed the fucked-up string, but it should report the underlying bug somehow (in a way that gets back to the developer), so that bug doesn't go completely unnoticed.

Comment Re:Lol (Score 1) 248

(And you don't want to split it after N characters, if the goal is to limit the display length of the string you're displaying, as not all characters are the same width - and, of course, a base character followed by several combining characters might just have the width of the base character.)

...and, of course, when you're figuring out where to truncate, remember that some characters go right-to-left, not left-to-right - the string has both Roman-alphabet (left-to-right) and Arabic-alphabet (right-to-left) characters.

Comment Re:Lol (Score 1) 248

No you don't. You are demonstrating the typical moronic attempts to deal with UTF-8.

Here is how you do it:

Go X bytes into the string. If that byte is a continuation byte, back up. Back up a maximum of 3 times. This will find a truncation point that will not introduce more errors into the string than are already there.

As long as you're not splitting a sequence of multiple characters (multiple characters, some of which might be encoded in multiple bytes with UTF-8) some of which are combining characters. Don't split a character from a combining character following it. Splitting a sequence like that can introduce more rendering errors into the string than are already there.

(I suspect that's what the problem is in this bug, given that there are several combining characters in the string as shown in various places.)

(And you don't want to split it after N characters, if the goal is to limit the display length of the string you're displaying, as not all characters are the same width - and, of course, a base character followed by several combining characters might just have the width of the base character.)

Comment Re:Lol (Score 1) 248

Just because it's unlikely with a real text string doesn't mean that any of the text is invalid for a message. The text string should still not need to be changed. The bug only affects notifications, and it's clear that the text can be displayed just fine in conversation view.

This is almost certainly due to splitting multibyte characters on sub-character boundaries.

Or mishandling combining characters; the screenshot geminidomino provided shows several combining characters, as indicated by the dotted-line circles in some of the glyphs (and I suspect some of the marks above the Arabic characters come from combining characters as well).

Comment Re: Lol (Score 1) 248

No, the problem is code that pretends that illegal UTF-8 sequences magically don't exist!

Where's the illegal UTF-8 sequence in the message? Is the actual octet sequence in the message different from what's in this Slashdot posting (once converted to a sequence of octets), which contains no invalid UTF-8 sequences (yes, I went through them all by hand)?

Comment Re:What is the string? (Score 1) 248

That's the string encoded as UTF-8, so it's more like

50 6f 77 65 72 20 d9 84 d9 8f d9 84 d9 8f d8 b5 d9 91 d8 a8 d9 8f d9 84 d9 8f d9 84 d8 b5 d9 91 d8 a8 d9 8f d8 b1 d8 b1 d9 8b 20 e0 a5 a3 20 e0 a5 a3 68 20 e0 a5 a3 20 e0 a5 a3 20 e5 86 97

If we turn that into a sequence of (21-bit) Unicode code points, it becomes

000050 00006f 000077 000065 000072 000020 000644 00064f 000644 00064f 000635 000651 000628 00064f 000644 00064f 000644 000635 000651 000628 00064f 000631 000631 00064b 000020 000963 000020 000963

...with 000068 000020 000963 000020 000963 000020 005197 following it (I quit translating too early)

which, encoded as UTF-16, is

0050 006f 0077 0065 0072 0020 0644 064f 0644 064f 0635 0651 0628 064f 0644 064f 0644 0635 0651 0628 064f 0631 0631 064b 0020 0963 0020 0963

...with 0068 0020 0963 0020 0963 0020 5197 following it.

As UTF-16, there are no surrogate pairs, so the bug presumably isn't a problem with handling UTF-16-encoded Unicode characters bigger than 00FFFF.

Still true with the corrections.

Comment Re:What is the string? (Score 1) 248

In hex, the string is:

506f 7765 7220 d984 d98f d984 d98f d8b5 d991 d8a8 d98f d984 d98f d984 d8b5 d991 d8a8 d98f d8b1 d8b1 d98b 20e0 a5a3 20e0 a5a3 6820 e0a5 a320 e0a5 a320 e586 97

That's the string encoded as UTF-8, so it's more like

50 6f 77 65 72 20 d9 84 d9 8f d9 84 d9 8f d8 b5 d9 91 d8 a8 d9 8f d9 84 d9 8f d9 84 d8 b5 d9 91 d8 a8 d9 8f d8 b1 d8 b1 d9 8b 20 e0 a5 a3 20 e0 a5 a3 68 20 e0 a5 a3 20 e0 a5 a3 20 e5 86 97

If we turn that into a sequence of (21-bit) Unicode code points, it becomes

000050 00006f 000077 000065 000072 000020 000644 00064f 000644 00064f 000635 000651 000628 00064f 000644 00064f 000644 000635 000651 000628 00064f 000631 000631 00064b 000020 000963 000020 000963

which, encoded as UTF-16, is

0050 006f 0077 0065 0072 0020 0644 064f 0644 064f 0635 0651 0628 064f 0644 064f 0644 0635 0651 0628 064f 0631 0631 064b 0020 0963 0020 0963

As UTF-16, there are no surrogate pairs, so the bug presumably isn't a problem with handling UTF-16-encoded Unicode characters bigger than 00FFFF.

I suspect that the string is probably being processed as UTF-16, because that's how CFString/NSString are encoded internally and because code handling UTF-8 that can't handle multi-byte characters couldn't handle anything other than ASCII.

U+0963 is DEVANAGARI VOWEL SIGN VOCALIC LL, which is a nonspacing mark; my guess is that it (or perhaps some other character in that sequence that's a combining character) is getting split, by the ellipsis, from the character with which it's supposed to combine, and that the rendering code is blowing up because of that.

If so, this has nothing to do with UTF-16 being too hard to handle correctly, or with the code not being able to handle characters that are "too many bytes", it has to do with sequences of characters sometimes having to be handled specially, and not just blithely split between characters.

It starts with "Power ", but I guess that's not important.

It might make the string long enough that the code displaying it on the main screen would abbreviate it and thus insert an ellipse.

Slashdot Top Deals

The rule on staying alive as a program manager is to give 'em a number or give 'em a date, but never give 'em both at once.

Working...