Journal Quantum Jim's Journal: URI Help Needed 4
I have been hacking a URI encoder, and I need some help. The encoder has to percent-encode binary octets in a URI that aren't UNRESERVED characters (i.e. Anything except ASCII letters, numbers, `-`, `.`, `_`, and `~`). The specs also say that the encoding value of escaped octet depends on the character encoding used.
So say I have the string, "Iñtërnâtiônàlizætiøn" in a URI. If it was in UTF-8, after encoding it the ASCII should be:
I%C3%B1t%C3%ABrn%C3%A2ti%C3%B4n%C3%A0liz%C3%A6ti%C3%B8n
If it was in Latin-1 (ISO-8859-1) on the other hand:
I%F1t%EBrn%E2ti%F4n%E0liz%E6ti%F8n
My problem is that I need to know what happens if you start with UTF-16 in big-endian, little-endian, or self-identifying-endian mode? The main concern is that UTF-16 uses two octets per character while UTF-8, Latin-1, and ASCII use up one. Straight encoding UTF-16 (any mode) octets produces lots of `%00` escapes since a letter in Latin-1 would have a byte of zeros appended to it in UTF-16. So the sequence "a b", could be in Latin-1 "a%20b", but in UTF-16LE "%00a%00%20%00b". This output confuses me (and lots of software).
So what should I do with all the %00 (NUL?) characters? Keep them in the output? Remove them from output? What should a UTF-16 encoded string of "Iñtërnâtiônàlizætiøn" look like? In which endian?
big-endian (Score:2)
I'm pretty sure that URIs should be represented in 'network byte order', in other words, 'Big-Endian'
But take a look at this... http://rf.net/~james/perli18n.html#Q17 [rf.net]
As I poke around more, it looks like there is some more interesting info in section 2.5 of http://www.gbiv.com/protocols/uri/rfc/rfc3986.html #UCS [gbiv.com]
Re:big-endian (Score:2)
Thanks for the head's up on the Perl, Unicode and i18N FAQ [rf.net]. The endianness of a URI itself is meaningless, since it must be encoded as ASCII - hense 1 byte per URI-character. In fact, it has little to do with my query. I specified the endianness of the UTF-16 encoding since it uses two bytes per character.
The problem is when escaping from character encodings don't look like ASCII, everything goes crazy. Say I have the Java String:
Strings are stored as UTF-16 (assume big-
UTF-8 (Score:2)
http://bg.wikipedia.org/wiki/%D0%9D%D0%B0%D0%BF%D 0 %BE%D0%BB%D0%B5%D0%BE%D0%BD_%D0%91%D0%BE%D0%BD%D0% B0%D0%BF%D0%B0%D1%80%D1%82 [wikipedia.org]
which encodes one byte for each UTF-8 byte (not character) in the string. Pretty much every other byte is a %D0.
Also, your string "a b" as "%00a%00%20%00b" is wrong. If anything, it's "a%00%20b", because the 'a' and 'b' are 16-bit already. What's more likely is "a%u20b"="a%u0020b" or similar
Re:UTF-8 (Score:2)
That's what I meant.
I'm not entirely convinced. RFC 3986 defines the encoding procedure in terms of 8-bit bytes. The letter 'a' is two bytes in UTF-16, so the %00" byte needs encoded (according to the specs I have on my desk). If you didn't, then there would be problems encoding non-textual data i