Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×

Comment Re:Lol (Score 1) 248

Holy fucking shit, who cares? If this was done by LETTER WIDTH, we wouldn't see the problem-

EXACTLY! That is why you do not want "N characters". I don't understand what your problem is here.

It is true that for this example most programmers would scan from the start, finding the longest string that fits with an ellipsis at the end.

What I was trying to point out is that if you want to be clever, you can guess as to an insertion point. But 11 bytes is just as good of a guess as 11 "characters" and since 11 characters requires scanning you are not saving any time.

You are perfectly correct that after you stick the ellipsis in there you need to test to see if the rendering fits and perhaps try another guess. The idea is that you will do fewer measurements, but that such insertion can be done using byte offsets, and "N characters" is a useless concept that never enters into it.

Comment Re:Lol (Score 1) 248

Do you really think 12 happy faces fit in the same space as 12 letter 'i'?

This is why it is pointless to do such counting.

And what you propose would split betwen a letter and a combining accent, so it really isn't any better for trashing strings.

Basically as soon as the words "N characters" come out of your mouth you are wrong. All your description just does that for many paragraphs. Don't feel too bad however as there are many many other people, including ones working for Apple, who are wrong as well.

PS: the surrogate order does not depend on they byte order in UTF-16. You might want to check what you are doing if you thought that.

Comment Re:I am amazed (Score 1) 248

A lot of misinformed programmers use the term "Unicode" to mean encodings other than UTF-8, typically UTF-16 or UCS-2. For instance a function called "toUnicode" often is a translator from UTF-8 to UTF-16. Therefore when people say "Unicode strings" they almost always mean non-byte strings. I propose the best solution is to eliminate all such strings. It is true that byte strings would encode UTF-8 and thus be "Unicode" but the hope is that this would be so standard that there would be no need to ever specify this and they would be called "strings".

Comment Re:I am amazed (Score 1) 248

I don't think there are any useful algorithms that need random access, except for searches that do not require the index to be "exact". Therefore you certainly do not need a reliable string[i]. You can make it return a pointer to byte i, which will work for most uses (ie searches for ASCII and replacing one ASCII byte with another).

Some complex searches do want to jump arbitrary amounts ahead, but do not require any "exact" location. Instead they want "a character near the middle" and so forth. In this case it might be useful to produce an iterator that points near byte i. A simple UTF-32 one would jump to byte i and then back up to the last non-continuation byte, unless there are 3 or more in which case it would stay where it first went (as that is the start of an error). Ones that return normalization forms would be much more complex. Something like this:


      utf16_iterator i(string, string.length()/2); // i now points near middle of string

Comment Re:Lol (Score 1) 248

No you are wrong. What I propose does not fail any worse than what I think you are proposing, which is "search N Unicode code units forward and put the ellipsis there".

My scheme will not add an error. Either it will find the start of a character, or if there is enough trailing continuation bytes it will know that the string ends with an error and add the ellipsis after that (thus not adding an error and not removing it). As other posters here point out there is absolutely no need to count Unicode code points as it has nothing to do with how many "characters" there are.

A better scheme would be to actually measure the rendered string to see if it fits, and then do a weighted binary search for the correct location to place an ellipsis to get the longest string containing an ellipsis that fits. This still assumes that shorter strings render in a shorter area, which is still not true, but true enough that I think this may work.

Comment Re: Lol (Score 1) 248

What I recommend is that anything that takes text input assume that the input could be any possible arrangement of the data units (ie any stream of bytes for UTF-8, and any stream of 16-bit words for UTF-16).

Don't "sanitize", because that is simply a step that produces a new string and feeds it to the next step. You have not fixed anything because an error in "sanitizing" will still crash (as quite a few posters here have tried to point out). The work must be done at the point that the data is translated to something other than a string. In this case is is the glyph layout in their rendering. That code should assume the input is ANY possible arrangement. Ideally it should draw something visible showing that there was an error and place it between glyphs so that it is clear what location in the string the error was.

Relying on previous steps only producing valid data is not only unsafe (as this bug shows) but also wasteful because of the scanning of the data. And it is either lossy (because errors are translated to a valid sequence and thus two different inputs map to the same result) or a denial of service (due to an exception being thrown and the loss of further processing). Unfortunately handling that is completely obvious for most data is somehow confusing to programmers when they are presented with Unicode.

Comment Re: Lol (Score 1) 248

From that description it does sound like the string is still valid. However if the display is crashing on a certain sequence containing an ellipsis, I am not clear why you can't construct that string directly, rather than rely on the insertion of the ellipsis.

It does sound like they maybe rely on "sanitizing" but of a far more complex scheme that I was aware of. This is still wrong, maybe far worse, as they are detecting and rejecting patterns containing ellipsis and some other character, that is INSANE!!!. Any such work should be delayed until the VERY LAST moment. In this case their glyph layout should simply not crash on any possible arrangement of bytes or words in the incoming string. This is very much the same stupidity that I was ranting about for UTF-8. Nobody used to crash because you put mis-spelled words in your text and tried to print it. Apply the same logic to UTF-8 and Unicode. It is not hard and it seems really obvious, but for some reason Unicode turns some otherwise really smart programmers into total idiots.

Comment Re:I am amazed (Score 1) 248

Actually I think "Unicode strings" should be avoided completely.

They do not help at all in doing text manipulation, because Unicode code points are *not* "characters" or any other unit that users think about. This is due to combining characters and invisible characters such as bidi indicators. There is a prefix code unit that eats the next 2 letters and turns it into a country flag! It is a huge mess.

Far more important is they all lack the ability to store errors that are in a UTF-8 string in a lossless way. This means you cannot trust arbitrary 8-bit data to survive translation to "Unicode" and back. This has been the source of endless bugs and is the reason people can't use Python 3.0.

Comment Re:I am amazed (Score 1) 248

My recommendation is special interators on std::string. Something like this:


    for (utf16_interator i = string.begin(); i != string.end(); ++i) {
          int x = *i;
          if (x < 0) error_byte_found();
          else utf16_found(x);
    }

There would also be interators for UTF-32 (probably what you were thinking of as "Unicode" but a lot of Microsoft programmers think "Unicode" means UTF-16). And iterators for other normalization forms. In all cases these would return negative numbers or some value that cannot be confused with a code point for UTF-8 error bytes.

This would be very fast because you can find the next Unicode code unit or whatever in constant time. Any api where you can arbitrarily index a unit using an integer is not going to be constant, it will be linear with that integer. Iterators avoid this.

Comment Re:Lol (Score 1) 248

No you don't. You are demonstrating the typical moronic attempts to deal with UTF-8.

Here is how you do it:

Go X bytes into the string. If that byte is a continuation byte, back up. Back up a maximum of 3 times. This will find a truncation point that will not introduce more errors into the string than are already there.

BUT BUT BUT I'm sure you are sputtering about how this won't give you exactly X "characters". NOBODY F**KING CARES!!!! If you want the string to "fit" you should be *measuring* it, not saying stuff that has not been true on computers since the 1950's about "N characters fit". I bet you think a combining letter and accent should count as 2, huh?

And your display function should not crash because it was given a string with an error in it! Even if you stupidly inserted the ellipsis all it should do is draw a few error indicators before the ellipsis.

Comment Re: Lol (Score 2) 248

No, the problem is code that pretends that illegal UTF-8 sequences magically don't exist!

For some reason UTF-8 turns otherwise intelligent programmers into complete morons. Here is another example from Apple. Let me state some rules about how to deal with UTF-8:

1. Stop thinking about "characters"!!!! This is a byte stream. The ONLY reason to think about a "character" is because you are DRAWING it on a display designed for a human to read, and humans do think about "characters". All other software either does not care, or is concerned with far more complex patterns (such as regexp and editors that deal with words and sentences), these second ones are not helped at all by an intermediate translation.

2. It is TRIVIAL to detect that the byte sequence you are looking at is not a valid UTF-8 character. In this case draw a replacement for exactly ONE byte and then try the next byte to see if it is a valid sequence. Do not skip more. There must be one error per byte so that the maximum number of good characters is preserved and so that a sequence with errors can be parsed bidirectionally without looking more than a few bytes ahead, and so that it is possible to search for error patterns. It also means there are only 128 different errors, not millions.

3. NEVER "translate to Unicode" (ie UTF-16) because this will be a lossy conversion of these invalid sequences and thus you have not preserved the original data. I'm sorry but Microsoft really screwed us here. Best recommendation is to write a wrapper around the filesystem calls and translate from UTF-8 to UTF-16 at the last moment, using U+DCxx as a translation for the error bytes (this is lossy but filenames already are, due to case independence, Apple's normalization, and even on Unix where "./foo" and "foo" are the same file).

This is blatantly obvious if you substitute "words" for "characters" and imagine how you would write a program to deal with text strings. Words are also composed of multiple bytes in a row. For some reason nobody seems to crash on misspelled words, and they manage to concatenate and split strings and make whole file systems and diff programs and all kinds of other fancy text manipulation without having to translate the text so that each word is a fixed-sized integer. Amazing!

Slashdot Top Deals

Never test for an error condition you don't know how to handle. -- Steinbach

Working...