The slide for UTF-16 clearly says that UTF-16 is the result of "encoding", not "decoding":
Actually, you are correct. Sorry, you confused me a bit. Go back to what I said earlier: a string is conceptually a sequence of Unicode codepoints, and to turn it into a filename you must encode it in a particular encoding. My greater point stands, namely that the Python terminology is completely consistent: you always encode from a string to an encoding, and you always decode from an encoding to a string. I apologize for the mistake.
Also I did experiments and the new encoding cannot produce unpaired surrogates, therefore it cannot produce all possible NTFS filenames.
Show me the code. If you really have found a problem, show me how to reproduce it.
Let me remind you that Python allows you to write literals in UTF-16, which would avoid this issue; let me also remind you that Python allows you to specify the options on encoding, so if you somehow got a Unicode string that contains surrogate characters that you need to have left alone, you can just specify the encode to not use surrogateescape.
I know a lot of people like to put filenames in text files. It is kind of useful, in fact this is supported directly by Python when you use a filename in quotes in a .py file! Yet they have made it impossible to place all possible UTF-8 filenames in a Python script unless a bytes api is used and the programmer has to write the UTF-8 code units individually as \xNN sequences, making it unreadable.
Frankly I don't care. First you said Python is broken because it's possible to make a filename with illegal characters in it; I pointed out that with os.listdir() that this case Just Works. Now you say that Python is broken because when you have a filename with illegal characters in it, the only way to make a literal in a program is to use hex escapes. If I have a filename with illegal characters in it, I'm just going to write the hex escapes; I don't have a problem with this. In other news, Python programs are slower than hand-crafted C programs. Python is really good at a lot of stuff, but not perfectly optimal at everything. You have identified a corner case where you must use ugly hex escapes in a filename literal. Okay, if that's a deal-breaker for you, don't use Python.
Your suggested solutions are just like all the other ones: basically never use Unicode at all in your Python program and use byte arrays everywhere.
Actually, no. You said that you wanted the ability to hold onto raw UTF-8 and pass it around, and I pointed out that Python lets you do that. I just want to use the provided Python API functions, which Just Work as far as I can see.
And for my own programs, so far all the filename literals have been boring, with no illegal characters in them. I've never had to write a Python program where UTF-8 wasn't adequate for all my filename literals.
Also we are back to the stupid programmer problem:
This is the underlying problem: the current behavior encourages ASCII-only use and is effectively destroying attempts to migrate to Unicode. They need to make it easy to write a reliable program that uses UTF-8 and UTF-16, which means it must not do something unexpected if a physically possible pattern is encountered in the data.
You keep saying these things, but I haven't seen any evidence.
And if you really are smarter than all the "incredible morons" working on Python, please contribute your insights on a Python mailing list, rather than just flaming here.
I don't think I'm convincing you of anything, and frankly I'm not the world's greatest expert on Python stuff, so I think I'm done with this thread. If you really care about convincing me that Python is broken, please show me the code. Thank you and have a nice day.