Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
User Journal

Journal Quantum Jim's Journal: URI Help Needed 4

I have been hacking a URI encoder, and I need some help. The encoder has to percent-encode binary octets in a URI that aren't UNRESERVED characters (i.e. Anything except ASCII letters, numbers, `-`, `.`, `_`, and `~`). The specs also say that the encoding value of escaped octet depends on the character encoding used.

So say I have the string, "Iñtërnâtiônàlizætiøn" in a URI. If it was in UTF-8, after encoding it the ASCII should be:

I%C3%B1t%C3%ABrn%C3%A2ti%C3%B4n%C3%A0liz%C3%A6ti%C3%B8n

If it was in Latin-1 (ISO-8859-1) on the other hand:

I%F1t%EBrn%E2ti%F4n%E0liz%E6ti%F8n

My problem is that I need to know what happens if you start with UTF-16 in big-endian, little-endian, or self-identifying-endian mode? The main concern is that UTF-16 uses two octets per character while UTF-8, Latin-1, and ASCII use up one. Straight encoding UTF-16 (any mode) octets produces lots of `%00` escapes since a letter in Latin-1 would have a byte of zeros appended to it in UTF-16. So the sequence "a b", could be in Latin-1 "a%20b", but in UTF-16LE "%00a%00%20%00b". This output confuses me (and lots of software).

So what should I do with all the %00 (NUL?) characters? Keep them in the output? Remove them from output? What should a UTF-16 encoded string of "Iñtërnâtiônàlizætiøn" look like? In which endian?

This discussion has been archived. No new comments can be posted.

URI Help Needed

Comments Filter:
  • I'm pretty sure that URIs should be represented in 'network byte order', in other words, 'Big-Endian'

    But take a look at this... http://rf.net/~james/perli18n.html#Q17 [rf.net]

    As I poke around more, it looks like there is some more interesting info in section 2.5 of http://www.gbiv.com/protocols/uri/rfc/rfc3986.html #UCS [gbiv.com]

    • Thanks for the head's up on the Perl, Unicode and i18N FAQ [rf.net]. The endianness of a URI itself is meaningless, since it must be encoded as ASCII - hense 1 byte per URI-character. In fact, it has little to do with my query. I specified the endianness of the UTF-16 encoding since it uses two bytes per character.

      The problem is when escaping from character encodings don't look like ASCII, everything goes crazy. Say I have the Java String:

      "http://www.google.com#thing"

      Strings are stored as UTF-16 (assume big-

  • The URL I get for, e.g., the Bulgarian Wikipedia's article on Napoleon Bonaparte is:

    http://bg.wikipedia.org/wiki/%D0%9D%D0%B0%D0%BF%D 0 %BE%D0%BB%D0%B5%D0%BE%D0%BD_%D0%91%D0%BE%D0%BD%D0% B0%D0%BF%D0%B0%D1%80%D1%82 [wikipedia.org]

    which encodes one byte for each UTF-8 byte (not character) in the string. Pretty much every other byte is a %D0.

    Also, your string "a b" as "%00a%00%20%00b" is wrong. If anything, it's "a%00%20b", because the 'a' and 'b' are 16-bit already. What's more likely is "a%u20b"="a%u0020b" or similar
    • which encodes one byte for each UTF-8 byte (not character) in the string.

      That's what I meant.

      Also, your string "a b" as "%00a%00%20%00b" is wrong. If anything, it's "a%00%20b", because the 'a' and 'b' are 16-bit already.

      I'm not entirely convinced. RFC 3986 defines the encoding procedure in terms of 8-bit bytes. The letter 'a' is two bytes in UTF-16, so the %00" byte needs encoded (according to the specs I have on my desk). If you didn't, then there would be problems encoding non-textual data i

If a subordinate asks you a pertinent question, look at him as if he had lost his senses. When he looks down, paraphrase the question back at him.

Working...