Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×

Comment Re:Interoperating with invalid data (Score 1) 196

Stupid software that thinks it has to convert to UTF-16 is about 95% of the problem.

UTF-16 cannot losslessly store invalid UTF-8. It also cannot losslessly store an odd subset of arrangements of Unicode code points (it can't store a low surrogate followed by a high surrogate, because this pattern is reserved to mean a non-BMP code point). It also forces a weird cutoff at 0x10FFFF which a lot of programmers get wrong (either using 0x1FFFF or 0x1FFFFF). UTF-16 is also variable sized and has invalid sequences, thus it has NO advantages over UTF-8, so the entire scheme is a waste of time.

Unfortunately a bunch of people are so enamored with all the work they did to convert everything to 16-bit that they are refusing to admit they made a mistake. One way is to declare invalid UTF-8 as throwing errors and thus make it virtually impossible to manipulate text in UTF-8 form. Note that they don't throw exceptions on invalid UTF-16, care to explain that??? HMM????

UTF-8 can store all possible UTF-16 strings losslessly (including lone surrogates which are considered "invalid" in UTF-16), as well as storing invalid UTF-8. It can encode a continuous range of code points from 0-0x10FFFF, or 0x1FFFFF with a trivial change (it can do up to 0x7FFFFFFF if you use the original UTF-8 design).

PEP 393 does NOT solve the problem. The "ascii" is limited to only 7-bit characters and thus cannot store UTF-8 (valid or not).

There is a "utf-8" entry in the PEP 393 strings but it appears current design requires it to be translated to UTF-16 and back to UTF-8 to store there, thus disallowing invalid strings. My proposal is that converting bytes to a string copies the data unchanged to this UTF-8 storage, and checking for encoding errors be deferred until there actually is a reason to look at Unicode code points, which is VERY VERY RARE, despite the impression of amateur programmers. I also propose some small changes to how the parser interprets "\xNN" and "\uNNNN" in string constants so that it is possible to swap between bytes and "unicode" strings without having to change the contents of the constant.

Comment Re:Interoperating with invalid data (Score 1) 196

Aha! Somebody who really does not have a clue.

No, substr() does not require decoding, because offsets can be in code units.

No, replace() does not require decoding, because pattern matching does not require decoding, since UTF-8 is self-synchronizing.

No split() does not require decoding because offsets can be in code units

No, join() does not require decoding (and in fact I cannot think of any reason you would think it does, at least the above have beginning-programmer mistakes/assumptions).

Comment Re:Interoperating with invalid data (Score 1) 196

Well the first thing you need to do to clean up the invalid UTF-8, for instance in filenames, is to detect it.

If reading the filename causes it to immediatly throw an exception and dispose of the filename, I think we have a problem. Right now you cannot do this in Python unless you declare it "bytes" and give up on actually looking at the Unicode in the vast majority of filenames that *are* correct.

It is also necessary to pass the incorrect filename to the rename() function, along with the correction. That is impossible with Python 3.0's library, and is probably the more serious problem.

Both of these problems are trivial to fix if it would just consider arbitrary byte sequences valid values for strings, and defer complaining about incorrect encoding until the string actually needs to be *decoded*, which actually is only really needed to display it, and sometimes for parsing in the rare cases that non-ASCII has syntactic value and is not just treated as letters.

Comment Re:Fuck that guy. (Score 1) 397

I really doubt a majority of people think affirmative action helps Asians. It helps underrepresented minorities, and in most jobs and schools Asians are not underrepresented. It seems incredibly unlikely that 95% of people (whether they approve or disapprove of affirmative action) think it helps Asians.

I suspect you actually made a typo of some sort but am curious what exactly you were trying to say there.

Comment Re: and... (Score 1) 196

I'm arguing against a design that is the equivalent of saying "you can't run cp on this file because it contains invalid XML".

There is nothing wrong with the xml interpreter throwing an error AT THE MOMENT YOU TRY TO READ DATA FROM THE STRING.

There is a serious problem that just saying "this buffer is XML" causes an immediate crash if you put non-xml into it.

Comment Re: and... (Score 1) 196

God damn you people are stupid.

I am trying to PREVENT denial of service bugs. If a program throws an unexpected exception on a byte sequence that it is doing nothing with except reading into a buffer, then it is a denial of service. If you really thing that invalid UTF-8 can lead to an exploit you seem to completely misunderstand how things work. All decoders throw errors when they decode UTF-8, including for overlong sequences and ll other such bugs. So any code looking at the unicode code points will still get errors. And if you think there is some exploit that relies on the byte pattern that somehow only works for invalid UTF-8 then you have quite a fantastic imagination but no knowledge of reality.

Comment Re:Interoperating with invalid data (Score 1) 196

The program should produce an error AT THE MOMENT IT TRIES TO EXTRACT A Unicode CODE POINT. Not before, and not after.

If the program reds the invalid string from one file and does not check it and writes it to another file, I expect, and REQUIRE, that the invalid byte sequence be written to the new file. It should not be considered any more of a problem than the fact that programs don't fix spelling mistakes when copying strings from one place to another.

Comment Re: and... (Score 1) 196

The text is 99.9999999% UTF-8.

What I want to do is gracefully handle tiny mistakes in the UTF-8 without having to rewrite every function and every library function it calls to take a "bytes" instead of a "string", and thus completely abandon useful Unicode handling!

Come on, it is blindingly obvious why this is needed, and I cannot figure out why people like you seem to think that physically possible arrangements of bytes will not appear in files. The fact that all serious software cannot use Unicode and has to resort to byte twiddling should be a clue, you know.

Comment Re: and... (Score 2) 196

No, all that means is that EVERYTHING has to be changed to use the bytes type.

I mean every single library function that takes a unicode string, every use of ParseTuple that translates to a string, etc. Pretty much the entire Python library must be rewritten, or a wrapper added around every function that takes a string argument.

Everybody saying that "it's good to catch the error earlier" obviously has ZERO experience programming. Let's see, would it be a good idea if attempting to read a text file failed if there was a spelling error? Or perhaps it might be a good idea to defer this problem until it actually makes a difference?

This crazy belief that somehow some physically possible patterns of bytes will just magically not happen because you said they are "invalid" is inexplictable. No other system than UTF-8 seems to cause this weird brain damage, no other system is so totally unprepared for invalid storage and pretends that all storage will be valid. I cannot explain it except that it seems like exposure to ASCII where all bytes sequences are always valid has rotted people's minds so that they dismiss the problem.
 

Comment Re: and... (Score 3, Informative) 196

This exactly.

If your UTF-8 string is not completely valid, Python 3 barfs in useless and unpredictable ways. This is not a problem with Python 2.x.

Until they fix the string so that an arbitrary sequence of bytes can be put into it and pulled out *UNCHANGED* without it throwing an exception then it cannot be used for any serious work. Bonus points if this is actually efficient (ie it is done by storing the bytes with a block copy).

Furthermore it would help if "\xNN" produced raw byte values rather than the UTF-8 encoding of "\u00NN" which I can get by typing (gasp!) "\u00NN".

Comment Re:Been terrific for me and my employees (Score 1) 578

Yea I would agree that it seems more fair if the company instead made a 50/50 split, so the employee is now paying $100 and the company another $100. The main reason this seems fair is that I'll be that if the cost went *up* they would not eat all the extra but would have split the higher cost so the employee paid more.

Slashdot Top Deals

FORTRAN is not a flower but a weed -- it is hardy, occasionally blooms, and grows in every computer. -- A.J. Perlis

Working...