Forgot your password?
typodupeerror
User Journal

alex fernandez's Journal: Is there a terminology war?

Journal by alex fernandez
the discovery

On October 10, I found esr's article about open source terminology. Several flaws immediately stood out: the author was searching for "open source" in sourceforge.net, but all pages included a copyright notice "(c) 2004 Open Source Technology Group". And it appeared in about 95% of the pages! This introduced a bias into his results; it's why "free software" was apparently used like 20 times less often than "open source".

But how much of a bias? When searching for pages without "open source technology group", only developer-created pages remained (about a hundred thousand from the original million+); this time usage was much more balanced. My preliminary results showed something like 60%-40% in favor of "open source". Mind you, these were pages created by developers, for developers, even if there are some false positives (e.g. GPL license copies appear a lot).

A similar but opposite effect could be noticed in searches on savannah.gnu.org and savannah.nongnu.org: all pages included a copyright notice saying "(c) 200[...] Free Software Foundation". This time it was almost all pages; filtering the copyright notice out of searches leaves you with a handful of pages (13 to be exact). Not meaningful enough for a comparison.

correspondence

At this point I emailed esr to let him know about this and he pointed out some possible mistakes, which were rebutted. One of them was that Yahoo! should not index CGI-generated pages. Well, it does; look for the copyright notice text and you'll soon find out. Another one was that the copyright notice might not have been there at the time: osdl had just become ostg. This is not correct either (just look at this blog and a comment on lwn.net's announcement at the time).

the numbers

For the sake of correctness, these are the results from October 13, which were sent to esr (format is 'search term'\nresults):

'site:sourceforge.net (("open source" OR "free software") AND NOT "open source technology group")'
160,000

'site:sourceforge.net (("open source" AND "free software") AND NOT "open source technology group")'
18,100

'site:sourceforge.net ("open source" AND NOT "open source technology group")'
99,000

'site:sourceforge.net ("free software" AND NOT "open source technology group")'
73,600

'site:sourceforge.net (("open source" AND NOT "free software") AND NOT "open source technology group")'
82,100

'site:sourceforge.net (("free software" AND NOT "open source") AND NOT "open source technology group")'
55,300

Here the error margin is about 2.8%. Developer usage of "free software" is 46% non-exclusive, 35% exclusive; for "open source", it is 56% non-exclusive, 51% exclusive. As you see, nothing as earth-shattering as esr's claims.

letters

Once the possible errors were fleshed out, from October 13 to October 24 I did not hear from the author again. My natural reaction was to wait for a page update, where the results would be corrected; esr is a busy man, so it would not hurt to wait some time. Then yesterday I wrote to inform him that I was ready to publish my findings as a letter to the editor on lwn.net; he replied immediately to say:

Let's write a joint one. I don't want to ignore your concerns, and I don't want an artificial controversy either.

It looked like a reasonable proposition, and an honor too. So I sent him what I had written so far, including new findings.

what? more?

Upon looking at results for "open source" from news.com, I noticed a funny sequence in search results: "... Operating systems. Open source. Standards." It looked like automatically generated text, always the same. But trying to filter out this text did not work. Well, as it happens "open source" is a menu item; just hover your mouse on "enterprise software" in the navigation bar. If you filter out "open source", only 121,000 pages remain (from a total of 2.1 million pages indexed), and most of the news items are gone. Again, this removes all significance from esr's results in news.com.

It also adds a new interesting effect. In whole-web usage, esr estimated a rate of false positives in the term "free software", but assumed that all occurrences of "open source" were real (and there were some 32 million of them). In my own searches, out of 24 million occurrences of "open source" some 20% of them were false positives from just two sites (sourceforge.net and news.com). It looks like a new source of false positives.

Furthermore, results seem to vary wildly from time to time. In my last round of searches, usage of "free software" as of October 24 is 47% non-exclusive, 36% exclusive; for "open source", it is 65% non-exclusive, 54% exclusive, with an error margin of about 1.2%. Also, different search engines yield different rates. Google and A9 cannot probably be trusted since numbers don't add up, but Teoma shows a higher rate for "free software". Errors in the estimations seem to be very high.

purpose

I'm waiting for esr to complement the joint letter to lwn.net's editor. My purpose is not to start any artificial controversy either, that's why I disabled comments; this entry is intended to document the process so far.

Kill Ugly Processor Architectures - Karl Lehenbauer

Working...