Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
User Journal

Journal: Letter frequencies in URLs

Journal by arth1

Doing some maintenance on a few squid cache servers, I decided to look into the letter frequency distributions for URLs, and how it matches normal written text.
Four caches were scanned for the URLs of currently cached content only, constituting around 1.5 million URLs.

In short, the results have some of the same characteristics as normal text, but with notable exceptions. You don't get an etaoin shrdlu; there are a lot of h, t, p, colons and slashes in URLs which skew the results. I'm also surprised that w scored so low, given all the URLs that start with www.

If anyone else finds a use for this, here is the data. Each character in the URL is followed by the number of times it was used in each cache, plus the total for all four caches.

/: 83198 130244 3028097 2929538 6171077
t: 73026 99729 2727455 2641930 5542140
e: 52801 95537 1746624 1753865 3648827
.: 35317 60175 1478231 1467006 3040729
o: 40941 86873 1423124 1448453 2999391
a: 43075 72450 1408451 1384211 2908187
c: 36078 64921 1308435 1295986 2705420
s: 41946 76684 1251987 1278493 2649110
p: 28248 44907 1214805 1190698 2478658
m: 29609 45768 1168769 1195505 2439651
h: 22543 41992 1029463 1019494 2113492
i: 37846 58586 974977 994693 2066102
n: 30006 51596 815477 795344 1692423
r: 26958 53239 801514 774606 1656317
g: 23689 57734 666533 790131 1538087
d: 23304 36637 746244 697523 1503708
:: 15442 27059 639115 649013 1330629
w: 25563 41061 622672 629215 1318511
1: 9697 12580 577523 561429 1161229
l: 21855 32824 560110 542960 1157749
2: 9890 13516 492565 514385 1030356
u: 11878 15246 440808 431176 899108
0: 10333 13106 404229 445998 873666
v: 7450 8415 328991 292590 637446
b: 9980 26743 280533 285767 603023
3: 6296 6905 299391 272352 584944
f: 9866 25830 265685 266037 567418
4: 4738 5931 273161 244104 527934
k: 4202 5641 235501 230456 475800
5: 5957 6920 212941 235172 460990
7: 6497 7333 230677 200956 445463
9: 4327 5215 206613 195295 411450
8: 5363 6697 210689 178565 401314
6: 5761 6487 209092 175203 396543
x: 3853 5755 168401 144265 322274
-: 3516 11325 124398 133481 272720
y: 4348 5272 114803 96971 221394
_: 2301 2683 87749 80901 173634
j: 4436 5058 89043 72567 171104
=: 1555 1437 37342 35214 75548
q: 1494 1538 32910 37861 73803
z: 741 907 29563 30037 61248
,: 3282 2848 21099 14688 41917
&: 493 413 12558 9222 22686
%: 220 460 9640 11420 21740
;: 2878 2254 8281 8281 21694
?: 322 294 4796 9264 14676
+: 45 35 1333 1758 3171
~: 31 7 996 735 1769
$: 0 0 425 670 1095
^: 6 0 420 228 654
*: 27 10 187 188 412
!: 0 2 282 122 406
[: 0 0 292 23 315
]: 0 0 272 23 295
|: 8 8 77 167 260
@: 10 0 113 38 161
(: 0 0 75 55 130
): 0 0 69 55 124
{: 0 0 75 0 75
\: 0 0 6 4 10
': 0 0 1 1 2

Does it have any practical use?
Perhaps. In proxy.pac files, a common method of load balancing based on URLs, known as the Sharp Superproxy script, is to sum the ASCII values of the cache entries, and mod it by the number of servers, to pick a server to use. .pac files are javascript, and javascript does not have an easy method to return the ascii value for a character. So what's generally used is a function like:

function atoi(charstring) {
    if (charstring=="a") return 0x61; if (charstring=="b") return 0x62;
    if (charstring=="c") return 0x63; if (charstring=="d") return 0x64;
//.....
}

This can be speeded up by ordering the list in the order of frequency, starting with "/", "t", "e", ".", "o", "a" - just moving those few to the front, reduces the latency of the script significantly.

Also, hashing in URL history handling can be sped up if the most prevalent buckets are created. This could also be useful for other URL collections, like AV software URL matching. I am unaware of any that work directly with character based lookups, but it is certainly one way to do it.

Other uses?
In pen testing, having a frequency table like this can greatly aid in URL discovery speed.

But all in all, it was a fun exercise. Note that the variations may be great, especially for the bottom half of the list. Also note that the low count for the letter 'x' in the URLs might not match your users.

Security

Journal: Slashdot clandestinely scanning its users 2

Journal by arth1

I just discovered something I'm not sure I like.

Whenever I post something to slashdot, slashdot connects back to port 80 on the machine I post from, looking for an open proxy on port 80.
This isn't behavior I really like to see. It's unsolicited, and more to the point, it takes advantage of a local firewall possibly being temporarily open for traffic FROM an address for a short while after connecting TO it.
There might be a "good cause", like collecting a list of open proxies for the poor guy behind the Great Firewall of China or something similar, but it's still unsolicted, clandestine and not documented.

Here are a couple of web log entries showing this:
216.34.181.45 - - [10/Sep/2008:15:47:47 -0400] "GET http://news.slashdot.org/ok.txt HTTP/1.0" 404 271 "-" "libwww-perl/5.812"
216.34.181.45 - - [10/Sep/2008:20:32:18 -0400] "GET http://mobile.slashdot.org/ok.txt HTTP/1.0" 404 273 "-" "libwww-perl/5.812"

United States

Journal: New federal "security" regs on hundreds of common chemicals 3

Journal by Ungrounded Lightning

Big brother is at it again. The Department of Homeland Security is issuing new regulations requiring reporting on, and guarding of, hundreds of common chemicals with "terrorist applications" (such as propane, hydrogen peroxide, chlorine, ...). This impacts farms, universities, industries from pool supplies to medicine to janitorial, small business, startups, and the general public.

Wireless Networking

Journal: Total bandwidth with MIMO and "smart antennas" 5

Journal by Ungrounded Lightning

A thread in the Slashdot article The 700MHz Question drifted into a discussion between me and rcw-home/rcw-work on using multiple antennas to synthesyze multiple patterns. This allows a particular hunk of bandwidth to be reused to generate several full-bandwidth links simultaneously - either between a base station and several remote stations or even between the base station and a single remote station that itself has multiple antennas.

The thread is beginning to horzon out on my user info history. So this journal entry is a new venue for its continuation after rcw-*'s most recent post.

I'll respond to that after he posts here to indicate that he's also making the move.

Wireless Networking

Journal: 802.11n Good Enough to Replace Ethernet for Enterprise

Journal by anaesthetica
A new report says that the increased speeds and other features of 802.11n should be enough to replace Ethernet on any companies' WANs in the next two to three years. While 802.11n speeds still fall far short of those of gigabit Ethernet, The Burton Group believes that they should be good enough for most uses. In fact, in its list of recommendations on when to deploy 802.11n, one of the criteria listed is "when fast Ethernet (100Mbps) throughput is good enough."
United States

Journal: Presidential Candidates' Tech Records

Journal by anaesthetica
Technology Daily features an overview of each major U.S. Presidential candidate's record on technology issues. The information was "compiled from the Congressional Record, speeches and statements on campaign Web sites." Third party candidates are ignored, but eight Democrats and eight Republicans are covered. This information is an important jumping-off point for the informed Slashdot voter.
Patents

Journal: James Madison on Intellectual Property

Journal by anaesthetica
The Volokh Conspiracy, a legal blog, writes about James Madison and his opinion on intellectual property. Madison, "Father of the Constitution" and author of the Bill of Rights, disliked the idea of intellectual "property," viewing it as a dangerous grant of monopoly. While he stressed that such grants ought to be strictly limited, he apparently went further saying that the government ought to be able to buy back the grant of monopoly to protect the public from being fleeced or placed under inconvenient restrictions.
Mozilla

Journal: Camino 1.5 Released

Journal by anaesthetica
It's been a long time coming, but the Camino browser for Mac OS X has reached a new milestone. Camino 1.5 is based on Gecko 1.8.1, and includes a slew of new features including spell checking, session saving, improved pop-up blocking, enhanced plug-in control, and window zooming, among others. All this comes wrapped in a website redesigned by Jon Hicks. You can read more about the release at Camino Planet.
Mozilla

Journal: The State of Mozilla as a Platform

Journal by anaesthetica

A number of stories have recently surfaced asking where Mozilla is going as a platform and whether it risks being outflanked by proprietary rivals. Chris Messina, a former Flock developer and SpreadFirefox volunteer, posted a 50-minute vlog enumerating his concerns about Mozilla. His discussion centered around Mozilla 2 potentially missing the forest for the trees, becoming overly focused on the short-term successes that Firefox has enjoyed, while failing to outrun the proprietary flanking actions being undertaken by Adobe and Microsoft for the next generation internet technologies.

Richard McManus at Read/WriteWeb expands the discussion of Mozilla's direction. His central concern is the adoption of microformats and what that will mean for Mozilla's position. He comments that microformats are already a step in the direction that Messina is pointing toward--a web that remains open.

Mike Shaver, technology strategist for Mozilla, has posted his own discussion of Adobe and Microsoft's proprietary tools intended to close off the web--Apollo and Silverlight--and what this means for Mozilla, if anything.

Finally, concerning Mozilla missing the forest for the trees, Ben Goodger, lead Firefox developer, reports on a Mozilla Corp board member, Brendan Eich, essentially writing off the non-Firefox products offered by Mozilla. Goodger wonders whether it would be a better strategic move for non-Firefox developers to begin seeking greater autonomy from Mozilla Corp, a speculation that Mike Pinkerton, lead developer of Camino, caught flack for a couple months ago when he opened the possibility of dropping Gecko for WebKit.

Linux Business

Journal: Siracusa: Linux Fails to Think "Across Layers" 521

Journal by anaesthetica
John Siracusa writes a brief article at Ars Technica pointing out an exchange between Andrew Morton, a lead developer of the Linux kernel, and a ZFS developer. Morton accused ZFS of being a "rampant layering violation." Siracusa states that this attitude of refusing to think holistically ("across layers") is responsible for all the current failings of Linux--desktop adoption, user-friendliness, consumer software, and gaming. ZFS is effective because it crosses the lines set by conventional wisdom. Siracusa ultimately believes that the ability to achieve such a break lies more with an authoritative, top-down corporate capacity, rather than with the grass roots, fractious Linux community.
Education

Journal: Schools Ending Laptop Programs 308

Journal by anaesthetica
The New York Times reports that schools are abandoning their laptops-for-students programs. It turns out that the expense of providing laptops, expense of repairing laptops, difficulties of school network management, and discipline problems stemming from pornography, cheating, and cracking more than outweighed the educational benefits. Indeed, a number of schools have concluded that far from improving student achievement, laptops either had no effect or actively hindered academic performance. Apparently, politicians embracing technology as a quick fix for social problems doesn't always work out.
Censorship

Journal: Digg Users Revolt Over HD-DVD Key

Journal by anaesthetica
Social news site Digg has been flooded with stories reprinting the HD DVD processing key covered earlier here. At one point, the entire front page was comprised of stories which in one way or another were related to the hex numbers that were removed by Digg administrators. Digg users quickly pointed to the HD DVD sponsorship of Diggnation, the Digg podcast show. Is this outburst a hissy fit thrown by immature users acting without regard to Digg's legal liability, or is it a legitimate act of civil disobedience?
Movies

Journal: Hollywood vs. Sealand 3

Journal by Ungrounded Lightning

In a slashdot posting titled "Hollywood vs. Sealand" on April 2 2007, I:
  - Made a movie proposal,
  - Asserted copyright,
  - Offered to license it,
  - Threatened possible infringement suits if such a movie is made sans license, and
  - Directed anyone wishing to license it to contact me by leaving a message in my journal. B-)

This journal entry is to receive such messages.

Real Users find the one combination of bizarre input values that shuts down the system for days.

Working...