Also, please encode in UTF-8 if you can. UTF-16 is dangerous - namely because the vast majority of characters are only two bytes, but in rare cases they're four
Even that is not quite correct. The vast majority of Unicode code points are only two bytes (in UTF-16), but in rare cases (everything outside the Basic Multilingual Plane, which is actually significantly more code points than inside the BMP) they're encoded in two, two-byte code units (4 bytes).
Remember these aren't "characters" but code points. There's also combining characters, eg. U+0065 U+0301 which gives an e with an accent. The grapheme (character) is formed from two Unicode code points and encoded in UTF-16 takes 4 bytes. There are others that take more: U+0065 U+0302 U+0301 is an e with two accents used in Vietnamese and would take 6 bytes in UTF-16. There's also Han unification and probably other things I'm forgetting or don't know.
Unicode is a tricky beast. There's lots of stuff there and lots of ways to get it wrong. I would agree that UTF-8 is the best encoding to use for compatibility with ASCII, no endian issues, and it doesn't assume all code points fit in one code unit.