This is still wrong in that you have to pass the special "surrogateescape" and use encode/decode.
In the context of handling filenames, you get this by default. As I said, I used os.listdir() and my file whose name contained a character invalid for UTF-8 was in the results, with a surrogate escape code for the illegal character; I was able to open it, rename it, or delete it (I tested all three).
In short, filenames Just Work in Python 3, despite your claims.
I want to be able to store a "Unicode" string that contains a UTF-8 error without an exception being thrown. The exception would be thrown if you attempt to translate the string to UTF-16 or look at the code points (though I also recommend there be ways to avoid the exception).
If you are reading UTF-8 characters from a file, you don't get the surrogate encoding by default; by default it raises an exception, which you could handle. But it is a simple matter to request the surrogate encoding, and then you can easily filter the resulting string and look for the surrogate encoding characters. You may disagree with the default behavior in Python 3.x but I don't think you can claim that it is broken or insane.
And! I didn't realize this until now, but Python 3 also allows you to use a "bytes" object to store raw UTF-8. You can convert a Unicode string representing a directory name to bytes (using the str.encode() method function) and then pass the bytes object to os.listdir(), and the resulting list of filenames will be bytes objects with the raw UTF-8. I believe this is exactly what you said you wanted. (So are the Python guys still "incredible morons"?)
Their problem is that they seem to think that UTF-16 (or perhaps UTF-32) is somehow "decoded" and UTF-8 is "encoded", while in fact it is the opposite, and they seem to be thrashing around trying to hide the fact they got it wrong with this filesystem stuff.
And your problem is that you haven't studied what Python does or why it does it, yet you write long rants about how wrong it is. (See, I can be all judgmental too.)
In Python 3.x, the concept is "all strings are Unicode". This means that from a Python user's point of view, a string is a sequence of Unicode code points, with an associated set of method functions. All else is implementation details. So, if you are reading a file that contains UTF-8, Python must decode the UTF-8 encoded bytes into Unicode and make the string. If you are writing a file that should be encoded as UTF-8, Python must encode the Unicode characters into UTF-8. Despite your claims, Python is completely consistent: converting from any encoding (UTF-8, UTF-16, UTF-32, Latin-1, etc.) to Unicode string is called "decoding" and converting from a string to any encoding is "encoding". See the above-linked Unicode HOWTO document.
You keep saying they "got it wrong" but I actually tested it and it Just Worked for me, so it doesn't look wrong to me.
On Unix at least a filename is a stream of bytes, and changing the "locale" should not change what file it identifies.
If you just use the Python tools for managing files, they will Just Work. If you override the Python tools and tell them to decode with the wrong codec, you will get a bad result. This is a problem because... why, exactly? Would you also say that Python "got it wrong" because if you read a UTF-8 file but tell Python to use the Latin-1 codec it won't work right?
Even on Windows with NTFS a filename is UTF-16, which is not "decoded" in their terminology,
No, really, it is "decoded" in their terminology.