Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
User Journal

Journal arth1's Journal: Letter frequencies in URLs

Doing some maintenance on a few squid cache servers, I decided to look into the letter frequency distributions for URLs, and how it matches normal written text.
Four caches were scanned for the URLs of currently cached content only, constituting around 1.5 million URLs.

In short, the results have some of the same characteristics as normal text, but with notable exceptions. You don't get an etaoin shrdlu; there are a lot of h, t, p, colons and slashes in URLs which skew the results. I'm also surprised that w scored so low, given all the URLs that start with www.

If anyone else finds a use for this, here is the data. Each character in the URL is followed by the number of times it was used in each cache, plus the total for all four caches.

/: 83198 130244 3028097 2929538 6171077
t: 73026 99729 2727455 2641930 5542140
e: 52801 95537 1746624 1753865 3648827
.: 35317 60175 1478231 1467006 3040729
o: 40941 86873 1423124 1448453 2999391
a: 43075 72450 1408451 1384211 2908187
c: 36078 64921 1308435 1295986 2705420
s: 41946 76684 1251987 1278493 2649110
p: 28248 44907 1214805 1190698 2478658
m: 29609 45768 1168769 1195505 2439651
h: 22543 41992 1029463 1019494 2113492
i: 37846 58586 974977 994693 2066102
n: 30006 51596 815477 795344 1692423
r: 26958 53239 801514 774606 1656317
g: 23689 57734 666533 790131 1538087
d: 23304 36637 746244 697523 1503708
:: 15442 27059 639115 649013 1330629
w: 25563 41061 622672 629215 1318511
1: 9697 12580 577523 561429 1161229
l: 21855 32824 560110 542960 1157749
2: 9890 13516 492565 514385 1030356
u: 11878 15246 440808 431176 899108
0: 10333 13106 404229 445998 873666
v: 7450 8415 328991 292590 637446
b: 9980 26743 280533 285767 603023
3: 6296 6905 299391 272352 584944
f: 9866 25830 265685 266037 567418
4: 4738 5931 273161 244104 527934
k: 4202 5641 235501 230456 475800
5: 5957 6920 212941 235172 460990
7: 6497 7333 230677 200956 445463
9: 4327 5215 206613 195295 411450
8: 5363 6697 210689 178565 401314
6: 5761 6487 209092 175203 396543
x: 3853 5755 168401 144265 322274
-: 3516 11325 124398 133481 272720
y: 4348 5272 114803 96971 221394
_: 2301 2683 87749 80901 173634
j: 4436 5058 89043 72567 171104
=: 1555 1437 37342 35214 75548
q: 1494 1538 32910 37861 73803
z: 741 907 29563 30037 61248
,: 3282 2848 21099 14688 41917
&: 493 413 12558 9222 22686
%: 220 460 9640 11420 21740
;: 2878 2254 8281 8281 21694
?: 322 294 4796 9264 14676
+: 45 35 1333 1758 3171
~: 31 7 996 735 1769
$: 0 0 425 670 1095
^: 6 0 420 228 654
*: 27 10 187 188 412
!: 0 2 282 122 406
[: 0 0 292 23 315
]: 0 0 272 23 295
|: 8 8 77 167 260
@: 10 0 113 38 161
(: 0 0 75 55 130
): 0 0 69 55 124
{: 0 0 75 0 75
\: 0 0 6 4 10
': 0 0 1 1 2

Does it have any practical use?
Perhaps. In proxy.pac files, a common method of load balancing based on URLs, known as the Sharp Superproxy script, is to sum the ASCII values of the cache entries, and mod it by the number of servers, to pick a server to use. .pac files are javascript, and javascript does not have an easy method to return the ascii value for a character. So what's generally used is a function like:

function atoi(charstring) {
    if (charstring=="a") return 0x61; if (charstring=="b") return 0x62;
    if (charstring=="c") return 0x63; if (charstring=="d") return 0x64;
//.....
}

This can be speeded up by ordering the list in the order of frequency, starting with "/", "t", "e", ".", "o", "a" - just moving those few to the front, reduces the latency of the script significantly.

Also, hashing in URL history handling can be sped up if the most prevalent buckets are created. This could also be useful for other URL collections, like AV software URL matching. I am unaware of any that work directly with character based lookups, but it is certainly one way to do it.

Other uses?
In pen testing, having a frequency table like this can greatly aid in URL discovery speed.

But all in all, it was a fun exercise. Note that the variations may be great, especially for the bottom half of the list. Also note that the low count for the letter 'x' in the URLs might not match your users.

This discussion has been archived. No new comments can be posted.

Letter frequencies in URLs

Comments Filter:

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...