I meant code point, not code unit. Ie what you are calling a "character". I typed the wrong thing there which does not help. You are correct that people thinking they can work in code points rather than bytes (or words for UTF-16) are a huge problem and why Unicode is not working yet. I consider anybody who thinks Unicode requires more than 8-bit code units to be in this category. A further problem is that a lot of people think the code points are "characters", which is actually an undefined entity in Unicode.
More carefully, and putting the full incorrect assumptions in, there are people who think that a regex of "<character>*" means that the character should be repeated 0 or more times, when in fact it should mean that the last code unit of the character should be repeated 0 or more times. This may seem obviously wrong in UTF-8, but it is equally wrong even in UTF-32 (because of combining characters). Actually fixing this would require regex to understand the entire Unicode definition, which is hugely complex, changing over time, and this has the perverse effect that you can no longer use regex to accurately manage Unicode encodings since you can no longer deal with the code units.
The regex "(<character>)*" does what is wanted for all representations and allows the user to decide exactly what is a "character". I don't think the burden of putting two parenthesis in there is that bad, really.
Therefore the C++11 regex is doing the correct thing by being the simplest possible that does valid operations.
(I think it *may* be useful to have the regex ranges understand UTF-8, provided it is always possible to rearrange your ranges to not trigger the UTF-8 matching and thus do ranges of code units. This has to be VERY carefully decided on, and the rules for how it matches UTF-8 have to be very well defined and are not allowed to ever change even if Unicode changes the UTF-8 rules).