URI Help Needed

Journal Quantum Jim's Journal: URI Help Needed 4

Journal by Quantum Jim on Tuesday March 22, 2005 @01:46AM

I have been hacking a URI encoder, and I need some help. The encoder has to percent-encode binary octets in a URI that aren't UNRESERVED characters (i.e. Anything except ASCII letters, numbers, `-`, `.`, `_`, and `~`). The specs also say that the encoding value of escaped octet depends on the character encoding used.

So say I have the string, "Iñtërnâtiônàlizætiøn" in a URI. If it was in UTF-8, after encoding it the ASCII should be:

I%C3%B1t%C3%ABrn%C3%A2ti%C3%B4n%C3%A0liz%C3%A6ti%C3%B8n

If it was in Latin-1 (ISO-8859-1) on the other hand:

I%F1t%EBrn%E2ti%F4n%E0liz%E6ti%F8n

My problem is that I need to know what happens if you start with UTF-16 in big-endian, little-endian, or self-identifying-endian mode? The main concern is that UTF-16 uses two octets per character while UTF-8, Latin-1, and ASCII use up one. Straight encoding UTF-16 (any mode) octets produces lots of `%00` escapes since a letter in Latin-1 would have a byte of zeros appended to it in UTF-16. So the sequence "a b", could be in Latin-1 "a%20b", but in UTF-16LE "%00a%00%20%00b". This output confuses me (and lots of software).

So what should I do with all the %00 (NUL?) characters? Keep them in the output? Remove them from output? What should a UTF-16 encoded string of "Iñtërnâtiônàlizætiøn" look like? In which endian?

URI Help Needed

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 4 Comments Log In/Create an Account

Comments Filter:

big-endian (Score:2)

by RevMike ( 632002 ) writes:

I'm pretty sure that URIs should be represented in 'network byte order', in other words, 'Big-Endian'

But take a look at this... http://rf.net/~james/perli18n.html#Q17 [rf.net]
As I poke around more, it looks like there is some more interesting info in section 2.5 of http://www.gbiv.com/protocols/uri/rfc/rfc3986.html #UCS [gbiv.com]
- Re:big-endian (Score:2)
  
  by Quantum Jim ( 610382 ) writes:
  
  Thanks for the head's up on the Perl, Unicode and i18N FAQ [rf.net]. The endianness of a URI itself is meaningless, since it must be encoded as ASCII - hense 1 byte per URI-character. In fact, it has little to do with my query. I specified the endianness of the UTF-16 encoding since it uses two bytes per character.
  The problem is when escaping from character encodings don't look like ASCII, everything goes crazy. Say I have the Java String:
  "http://www.google.com#thing"
  
  Strings are stored as UTF-16 (assume big-
UTF-8 (Score:2)

by Geoffreyerffoeg ( 729040 ) writes:

The URL I get for, e.g., the Bulgarian Wikipedia's article on Napoleon Bonaparte is:

http://bg.wikipedia.org/wiki/%D0%9D%D0%B0%D0%BF%D 0 %BE%D0%BB%D0%B5%D0%BE%D0%BD_%D0%91%D0%BE%D0%BD%D0% B0%D0%BF%D0%B0%D1%80%D1%82 [wikipedia.org]

which encodes one byte for each UTF-8 byte (not character) in the string. Pretty much every other byte is a %D0.

Also, your string "a b" as "%00a%00%20%00b" is wrong. If anything, it's "a%00%20b", because the 'a' and 'b' are 16-bit already. What's more likely is "a%u20b"="a%u0020b" or similar
- Re:UTF-8 (Score:2)
  
  by Quantum Jim ( 610382 ) writes:
  
  which encodes one byte for each UTF-8 byte (not character) in the string.
  That's what I meant.
  Also, your string "a b" as "%00a%00%20%00b" is wrong. If anything, it's "a%00%20b", because the 'a' and 'b' are 16-bit already.
  I'm not entirely convinced. RFC 3986 defines the encoding procedure in terms of 8-bit bytes. The letter 'a' is two bytes in UTF-16, so the %00" byte needs encoded (according to the specs I have on my desk). If you didn't, then there would be problems encoding non-textual data i

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Journal Quantum Jim's Journal: URI Help Needed 4

URI Help Needed More Login

big-endian (Score:2)

Re:big-endian (Score:2)

UTF-8 (Score:2)

Re:UTF-8 (Score:2)

Slashdot Top Deals

Slashdot