"No you are wrong."
Pretty sure I'm not. We could just claim that way back and forth, but lets go over this:
Here's what you said:
"Go X bytes into the string. If that byte is a continuation byte, back up. Back up a maximum of 3 times. This will find a truncation point that will not introduce more errors into the string than are already there."
Here's what I said:
"This only works for UTF-8, and theoretically fails with the older type of UTF-8 (when you could have up to 6 bytes, by spec). So you probably will have to go through it character by character, not byte by byte, exactly as Brons said."
So pretend you have a 12 character display. Your method, for UTF-8:
> Checks to see if the input is 12 or less bytes, and displays it fine (this works)
> If not, it goes to that 12th byte, then checks it to see if it is a continuation byte (a byte which, when ANDed with 0xC0, is equal to 0x80)
> If it is a continuation byte, and we haven't seen three in a row yet, increment the number seen, and back up one byte.
> If we found a non-continuation byte or we have seen three continuation bytes in a row, then what we are looking at must be a starter byte.
> Write four bytes beginning with the overwriting the starter byte: 0xE2 0x80 0xA6 0x00 (ellipsis, null character)
With this method, you definitely could have left some garbage to the right of the null (if that null ate anything to the right of that), but that's ok because the null ends the stream (if it doesn't, you'll need to pad some more nulls). An alternate method that doesn't stamp the null is vastly worse, as if you were finding a two byte character to stamp the three byte elipsis into, you would have eaten the first byte of the NEXT multibyte sequence, leaving you with an illegal data stream, and no null to tell the next guy to stop.
But, anyway, this one works- like I said- but I claimed that it had two problems- "only for UTF-8" and "results in a VERY short message for some inputs". It also trivially fails for the pre-RFC-3629 UTF-8 standard, but I guess we are ok with that (that version can have up to five continuation bytes).
If your message was, lets say, 8 of the "smiling face with smiling eyes" emojis:
http://www.fileformat.info/inf...
(or equivalent 4 byte characters)
The algorithm of "go 12 bytes in" will skip past the first two entire "0xF0 0x9F 0x98 0x8A" sequences, landing on the "0x8A" one of the last one. The algorithm will detect that this is a continuation byte, and back up the max times (through the 0x98, and 0x9F), landing on (and stamping over) the 0xF0 initial byte. But this means that your output message is:
(happy face)(happy face)(ellipsis)
You took a 12 character display AND LIMITED IT TO TWO CHARACTERS. When in fact, the original message would have fit, if you did what Brons said.
Because you searched in N bytes, instead of doing what Brons said (and that you even fucking called "MORONIC"), you fucked your hypothetical user AND insulted the guy with the right answer at the meeting (or were at least rude to him, brusque, or superior without cause).
But, lets continue.
I also claimed that this "only works for UTF-8". This is pretty trivially true- you explicitly refer to "continuation bytes", which are definitely not present in all encoding methods. UTF-16 is either one or two 16-bit words, and these are not "continuation bytes". With such an input, you would go 2*N forward, and then check for if the word sequence found was whichever surrogate comes first in your byte ordering (ex, you might be looking to see if it is a high surrogate, and therefore the start of a character, if your byte stream has that ordering), and if not, back up one word to find the guaranteed start of character, and then stamp over that with your elipsis. This is the general equivalent of your UTF-8 solution, but you still dramatically shorten what your user can display, to five happy faces and an ellipsis for their 8 character message that would have fit just fine.
Unless the stream you are parsing is massive, such that byte-by-byting it would be costly, Brons has the correct solution. And a byte loop that spins over a string that fits in an iphone text message, stopping when it has seen N characters, for small N, certainly isn't in the unloopable universe.
You took the correct solution, out of all of slashdot, and shit in its mouth. So annoying.