spitzak - Slashdot User

Comment Re:Flight recorder (Score 1) 491

by spitzak on Monday March 24, 2014 @02:52PM (#46566253) Attached to: How Satellite Company Inmarsat Tracked Down MH370

Comment Re:Interoperating with invalid data (Score 1) 196

by spitzak on Thursday March 20, 2014 @01:59PM (#46536457) Attached to: Python 3.4 Released

Stupid software that thinks it has to convert to UTF-16 is about 95% of the problem.

UTF-16 cannot losslessly store invalid UTF-8. It also cannot losslessly store an odd subset of arrangements of Unicode code points (it can't store a low surrogate followed by a high surrogate, because this pattern is reserved to mean a non-BMP code point). It also forces a weird cutoff at 0x10FFFF which a lot of programmers get wrong (either using 0x1FFFF or 0x1FFFFF). UTF-16 is also variable sized and has invalid sequences, thus it has NO advantages over UTF-8, so the entire scheme is a waste of time.

Unfortunately a bunch of people are so enamored with all the work they did to convert everything to 16-bit that they are refusing to admit they made a mistake. One way is to declare invalid UTF-8 as throwing errors and thus make it virtually impossible to manipulate text in UTF-8 form. Note that they don't throw exceptions on invalid UTF-16, care to explain that??? HMM????

UTF-8 can store all possible UTF-16 strings losslessly (including lone surrogates which are considered "invalid" in UTF-16), as well as storing invalid UTF-8. It can encode a continuous range of code points from 0-0x10FFFF, or 0x1FFFFF with a trivial change (it can do up to 0x7FFFFFFF if you use the original UTF-8 design).

PEP 393 does NOT solve the problem. The "ascii" is limited to only 7-bit characters and thus cannot store UTF-8 (valid or not).

There is a "utf-8" entry in the PEP 393 strings but it appears current design requires it to be translated to UTF-16 and back to UTF-8 to store there, thus disallowing invalid strings. My proposal is that converting bytes to a string copies the data unchanged to this UTF-8 storage, and checking for encoding errors be deferred until there actually is a reason to look at Unicode code points, which is VERY VERY RARE, despite the impression of amateur programmers. I also propose some small changes to how the parser interprets "\xNN" and "\uNNNN" in string constants so that it is possible to swap between bytes and "unicode" strings without having to change the contents of the constant.

Comment Re:Interoperating with invalid data (Score 1) 196

by spitzak on Thursday March 20, 2014 @01:49PM (#46536369) Attached to: Python 3.4 Released

Comment Re:Interoperating with invalid data (Score 1) 196

by spitzak on Wednesday March 19, 2014 @11:08PM (#46530729) Attached to: Python 3.4 Released

Comment Re:Fuck that guy. (Score 1) 397

by spitzak on Wednesday March 19, 2014 @11:01PM (#46530683) Attached to: Jesse Jackson To Take On Silicon Valley's Lack of Diversity

Comment Re:Fuck that guy. (Score 1) 397

by spitzak on Wednesday March 19, 2014 @09:50PM (#46530271) Attached to: Jesse Jackson To Take On Silicon Valley's Lack of Diversity

Comment Re: and... (Score 1) 196

by spitzak on Wednesday March 19, 2014 @08:28PM (#46529641) Attached to: Python 3.4 Released

Comment Re: and... (Score 1) 196

by spitzak on Wednesday March 19, 2014 @07:45PM (#46529283) Attached to: Python 3.4 Released

Comment Re:Interoperating with invalid data (Score 1) 196

by spitzak on Wednesday March 19, 2014 @07:42PM (#46529251) Attached to: Python 3.4 Released

Comment Re: and... (Score 1) 196

by spitzak on Wednesday March 19, 2014 @07:40PM (#46529239) Attached to: Python 3.4 Released

Comment Re: and... (Score 2) 196

by spitzak on Wednesday March 19, 2014 @07:38PM (#46529219) Attached to: Python 3.4 Released

Comment Re: and... (Score 3, Informative) 196

by spitzak on Wednesday March 19, 2014 @12:59AM (#46522055) Attached to: Python 3.4 Released

Comment Re:doubtful (Score 1) 878

by spitzak on Monday March 17, 2014 @03:28PM (#46509367) Attached to: Russian State TV Anchor: Russia Could Turn US To "Radioactive Ash"

Comment Re:This could be good news... (Score 1) 241

by spitzak on Friday March 14, 2014 @03:14PM (#46485885) Attached to: Ubuntu's Mir Gets Delayed Again

Comment Re:Been terrific for me and my employees (Score 1) 578

by spitzak on Friday March 14, 2014 @02:23PM (#46485275) Attached to: White House: Get ACA Insurance Coverage, Launch Start-Ups

Slashdot Top Deals