spitzak - Slashdot User

Comment Re:Schneier got it right a decade and a half ago (Score 1) 119

by spitzak on Sunday March 22, 2015 @08:19PM (#49316223) Attached to: OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab

Yes, Java and Python (3) and Qt all are causing enormous difficulties as they followed Microsoft down the fantasy road and thought you had to convert strings on input to "unicode" or somehow it was impossible to use them. Since not all 8-byte strings can convert there must either be a lossy conversion or there must be an error, neither of which are expected, especially if the software is intended to copy data from one point to another without change.

The original poster is correct in saying "stay away from Unicode". This does not mean that Unicode is impossible. It means "treat it as a stream of bytes". Do not try to figure out what Unicode code points are there unless you really really have a reason to. And you will be surprised how little you need to figure this out. In particular you can search for arbitrary regexps (including sets of Unicode code points) with a byte-based regexp interpreter. And you can search for ASCII characters with trivial code.

Comment Re:Type "bush hid the facts" into Notepad. (Score 1) 119

by spitzak on Sunday March 22, 2015 @08:13PM (#49316185) Attached to: OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab

Actually Plan 9 and UTF-8 encoding existed well before Microsoft started adding Unicode to Windows.

The reason for 16-bit Unicode was political correctness. It was considered wrong that Americans got the "better" shorter 1-byte encodings for their letters, therefore any solution that did not punish those evil Americans by making them rewrite their software was not going to be accepted. No programmer at that time (including ones that did not speak English) would ever argue for using anything other than a variable-length byte encoding for a system that still had to deal with existing software and data that was ASCII, this was a command from people who did not have to write and maintain the software.

The programmers, who knew damn well that variable-length was the correct solution, were unfortunately not bright enough to avoid making mistakes in their encodings (such as not making them self-synchronizing). UTF-8 fixed that, but these errors also led some of the less-knowledgeable to think there was a problem with variable length.

Unfortunately political correctness at Microsoft won, despite the fact that they had already added variable-length encoding support to Windows. It may also have been seen as a way to force incompatibility with NFS and other networked data so that Microsoft-only servers could be used.

One of the few good things to come out of the "Unix wars" was that commercial Unix development was stopped before the blight of 16-bit characters was introduced (it was well on it's way and would have appeared at the same time Microsoft did it). Non-commercial Unix made the incredibly easy decision to ignore "wide characters".

The biggest problem now is that Window convinced a lot of people who should know better that you need to use UTF-16 to open files by name (all that is really needed is to convert UTF-8 just before the api is called). This led to UTF-16 to infect Python, Qt, Java, and a lot of other software and cause problems and headaches and bugs even on Linux. There is some hope that they are starting to realize they made a terrible mistake, Python in particular seems to be backing out by storing a UTF-8 version of the string alongside the UTF-32.

Comment Re: novice programmer alert! (Score 1) 119

by spitzak on Sunday March 22, 2015 @07:42PM (#49316061) Attached to: OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab

The big downside of UTF-8 is using it as an in-memory string. To find the nth character and you have to start at the beginning of the string.

And this is important, why? Can you come up with an example where you actually produce "n" by doing anything other than looking at the n-1 characters before it in the string? No, and therefore an offset in bytes can be used just as easily.

C# and Java use UTF16 internally for strings.

And you are aware that UTF-16 is variable-length as well, and therefore you can't "find the nth character" quickly either?

You might want to retake compsci 101.

Comment Re:Type "bush hid the facts" into Notepad. (Score 1) 119

by spitzak on Sunday March 22, 2015 @07:37PM (#49316027) Attached to: OS X Users: 13 Characters of Assyrian Can Crash Your Chrome Tab

Maybe you're willing to accept that ambiguity, and use the rule, "If the file looks like valid UTF-8, then use UTF-8; otherwise use

Yay! You actually got the answer partially correct. However you then badly stumble when you follow this up with:

8-bit ANSI, but under no circumstances UTF-16

The correct answer is "after knowing it is not UTF-8, use your complicated and error-prone encoding detectors".

The problem is a whole lot of stupid code, in particular from Windows programmers, basically tries all kinds of matching against various legacy encodings and UTF-16, and only tries UTF-8 if all of those return false. This is why Unicode support still sucks everywhere.

You try UTF-8 FIRST. This is for two reasons: first because UTF-8 is really popular and thus likely the correct solution (especially if you count all ASCII files as UTF-8, which they are). But the second is that a random byte stream is INCREDIBLY unlikely to be valid UTF-8 (like 2.6% chance for a two-byte file, and geometrically lower for any longer ones), this means your decision of "is this UTF-8" is very very likely to be correct. Just moving this really reliable test to be the first one will improve your detection enormously.

The biggest help would be to check for UTF-8 first, not last. This would fix "Bush hid the facts" because it would be identified as UTF-8. But a variation on that bug would still exist if you stuck a non-ASCII byte in there, in which case it would still be useful (but much much less important) to not do stupid things in the detectory, for instance requiring UTF-16 to either start with a BOM or to have at least one word with either the high or low byte all zero would be a good idea and indicate you are not an idiot.

Comment Re:Write-only code. (Score 1) 757

by spitzak on Monday March 16, 2015 @01:04PM (#49268321) Attached to: Was Linus Torvalds Right About C++ Being So Wrong?

I have no idea why you are arguing but saying EXACTLY the same things I am.

I am not saying to make a and b into unique pointers to a copy. I am saying "a and b ARE LOCAL VARIABLES!!!!!" They will be copied to make the lambda, it is NOT POSSIBLE to avoid this!!!! The function can return before the lambda is destroyed. And you seem to think "constructing on the stack" does not involve a copy of a and b, which is wrong. You do mention the "move" which does do another copy (though move semantics could cause a more-efficient version but it is not zero). Actually the lamda data structure is created on the heap because this is more efficient than the move.

The rest of my comments were about how the C++ compiler will actually do better than your attempts at premature optimization by forcing a and be to be on the heap. There will be only a single "shared pointer" to the lambda object, not one to a and another to b. Also what boost calls an "intrusive ptr" will be used, avoiding a lot of overhead of std::shared_ptr. And as my C code shows, it is possible to avoid multiple references to the lambda object, thus a unique_ptr could be used, though I believe this will require the optimizer to have access to the implementation of the thread constructor so it knows the lambda is not copied.

Comment Re:Write-only code. (Score 1) 757

by spitzak on Friday March 13, 2015 @01:02PM (#49250853) Attached to: Was Linus Torvalds Right About C++ Being So Wrong?

Above AC is an excellent example of the problems with C++. He has quite a few misconceptions.

a = std::make_shared(x) does make a local shared pointer, but not the data itself, which is allocated on the heap.

The lambda absolutely does use the equivalent of a unique ptr. There is a block of memory allocated and a and b are copied to it (this block also contains a pointer to the actual code, which in the example will be something to further copy or move a and b to the stack and call the do_something function). This is the copy that is unavoidable. This block is freed when the pointer goes out of scope. Since it is passed by value move semantics mean that there is never more than one pointer, so it is concievable that the optimizer will do a unique ptr to it (though it is likely that something more like a shared ptr is done, or the boost intrusive_ptr).

Yes you can force it to use std::move but this should be an automatic optimization, it is nonsense that I have to type that. But even a move is much less efficient than direct use. The block is freed when no longer needed by the execution of the lambda (in the parallel thread).

I do not want to use a and b in the parent thread. That is the whole point. Way to get completely confused there!

Comment Re: Write-only code. (Score 1) 757

by spitzak on Thursday March 12, 2015 @07:10PM (#49245833) Attached to: Was Linus Torvalds Right About C++ Being So Wrong?

C compatibility could be preserved by passing and POD and any structure containing only POD (no member functions and no private or protected data) always by copying. Since any more complex structure or class cannot be part of the C api it should not break compatibility if they were passed by const reference.

Comment Re:Write-only code. (Score 1) 757

by spitzak on Thursday March 12, 2015 @07:07PM (#49245807) Attached to: Was Linus Torvalds Right About C++ Being So Wrong?

You mean the caller has to do something like this (not sure of the syntax and I think it requires C++17)?

std::thread([std::move(a), std::move(b)](){do_something(a,b);}).detach();

Not sure if that is a good advertisement for C++.

It would be nice if this happened automatically when possible, but apparently for complex language rule reasons it cannot. The following code must make a copy of A:

void f() {
ComplexThing A(FunctionReturningComplexThing()); // move
DoSomething(A); // the copy is here
}

While this code, which seems like it should be what the above optimizes to, will only use move:

void f() {
DoSomething(FunctionReturingComplexThing());
}

That is annoying and the fact that such optimizations are not allowed is a good sign that there are problems with the design of C++.

Comment Re:Write-only code. (Score 1) 757

by spitzak on Thursday March 12, 2015 @06:55PM (#49245711) Attached to: Was Linus Torvalds Right About C++ Being So Wrong?

That will not work out well if a and b are local variables. You will have to make a copy in order to make the std::shared_ptr point at them, so just as many copies are done as before (the second copy is when *a is copied to the argument to do_something, and, as before, can be avoided by making do_something take a const reference.

The basic lambda [=] syntax will work better. First it makes only one pointer to a sort of box containing the copies of both a and b, rather than two pointers. Also it uses something much more like std::unique_ptr which is much more efficient.

Comment Re:C++ is hard (Score 1) 757

by spitzak on Wednesday March 11, 2015 @02:24PM (#49235589) Attached to: Was Linus Torvalds Right About C++ Being So Wrong?

I think you are right that placement_new could be used to get a block of memory filled with the object without using malloc and without double-indirection when using it. It looks like every method on that static object has to be copied to the dummy object, so I'm not sure if that is a good selling point for C++.

What I was thinking of was some keyword added to the static that causes no change in any code except the destructor is not called. An idea I had was to use '&' without constructor args after it:

static Foo&; // uses default constructor
static Foo&(1,2,3); // uses some other constructor

However I am rather worried that this may collide with some existing syntax.

I never heard of a guarantee that statics are destroyed in the opposite order of creation. In fact this seems to be completely false in cases where a function containing a static variable is first run in a parallel thread. Wrapping statics in functions is useful to guarantee construction order, and I do it all the time, but never used it to control destruction order.

Even if destruction order could be controlled, it does not fix the real problem where the static object obtains a pointer to an object that was constructed later, generally for caching. An example is an OpenGL resource, you want your destructor to release the resource but that will crash if the OpenGL context has been destroyed. Adding an if statement to the destructor that is only true when your program is exiting is pretty distasteful.

Comment Re:Write-only code. (Score 1) 757

by spitzak on Wednesday March 11, 2015 @02:07PM (#49235437) Attached to: Was Linus Torvalds Right About C++ Being So Wrong?

No, the lambda must not take the arguments by reference. This is because the original values can be destroyed before the lambda is run, invalidating the references.

do_something can take them by reference, the lambda calls it and the lambda is not destroyed until after it returns.

I think the job of figuring out which is more efficient should belong to the compiler, but this would require C++ to be redefined such that all arguments are possibly const references (ie a function cannot modify it's own arguments, or perhaps modification forces a copy inside the function).

Comment Re:Ahhhh, C++ (Score 1) 757

by spitzak on Wednesday March 11, 2015 @01:57PM (#49235353) Attached to: Was Linus Torvalds Right About C++ Being So Wrong?

Thanks for your comments about UTF-16 and Windows. You are correct that wrapper functions are about the only solution on Windows. Microsoft refuses to make the multibyte api accept UTF-8.

Comment Re:Ahhhh, C++ (Score 1) 757

by spitzak on Wednesday March 11, 2015 @01:56PM (#49235335) Attached to: Was Linus Torvalds Right About C++ Being So Wrong?

auto copy = string{mystring}.replace("from", "to"); and move semantics avoids the extra copy.

The string constructor does an unnecessary copy. You are correct that move semantics avoids yet another copy from the result to "copy". I have not found any way to do this except by having two different functions, one which does an in-place modification and another that returns a newly constructed string. This produces questions about how to name them, as only one of them gets the "good" name.

I have seen attempts to make a "modstring" subclass where the methods happen in-place. Not sure if this is a great idea.

I suspect disagreement about how to do this is why all these useful functions have not been added to strings.

In any case I also apologize, your style of in-place modification would not prevent reference-counting implementations. The problem is operator[] only. If in fact you changed characters with a string.replace_char(n,c) it could be done with reference counted strings.

Comment Re: Write-only code. (Score 2) 757

by spitzak on Wednesday March 11, 2015 @01:43PM (#49235197) Attached to: Was Linus Torvalds Right About C++ Being So Wrong?

His code is constructing a lamda as a local value, not copying it anywhere, and directly calling it, then destroying it. When the call is done every detail of the lambda is known precisely, and this can be optimized (apparently g++ does so, too).

The original post constructs a lambda and passes it by value to the thread constructor and then exits before the lambda is used. This requires a and b to be copied. Later the forked thread executes the lambda. It is highly unlikely the locations of a and b in the lamda structure are in the memory location that the lambda function looks for them, so they must be copied (and memmove is explicitly not allowed by C++, you must use the copy or move constructor).

The fix is to make do_something take the arguments as const references, which are really pointers, and then the lambda caller can just make the pointers point at where a and b are in the lambda structure.

C++ would have been helped considerably if all arguments to functions were const references (with the compiler allowed to choose whether to copy or make a reference depending on which is more efficient). You could use volatile to make a non-const reference if needed (though most code I have seen use a pointer for out parameters). This apparently would break too much code, but is by far the biggest source of unexpected inefficiency in C++ and really should be fixed rather than having the code writers decide whether a copy or reference is faster.

Comment Re:Write-only code. (Score 1) 757

by spitzak on Wednesday March 11, 2015 @01:34PM (#49235125) Attached to: Was Linus Torvalds Right About C++ Being So Wrong?

You are correct that the lamda_args object is leaked.

Slashdot Top Deals