Even then not a reasonable comparison. The ability for the scanned proprietary softwares' teams to decide on inclusion feels to me like it would really influence the stats.
Would you expect there to exist any correlation between how shoddy software is and how likely the authors are to share information about how shoddy their software is? I would expect some correlation.
Let's accept the premise that proprietary vendors only submitted what they considered their best code. If the code bases tested matched similar function OSS codebases, then it is a valid comparison of similar types of software.
I don't see how you're answering the grandparent's concern here. Are you assuming that the code quality is only a function of the function of the code? Otherwise, why would an OSS project in category X be better than average just because a better-than-average proprietary X was submitted?
No, I am saying it would be a valid comparison between OSS and proprietary code if programs that perform the same function were tested in each category. OSS quality has no impact on proprietary quality, and vice versa; but if you compare OSS and proprietary web browsers tehn it would be a reasonable comparison of teh quality of each vs the other.
Let's make this a bit more concrete: Imagine that the defect density of proprietary projects is normal-distributed with a mean of 0.6 and a standard deviation of 0.1. Then you would expect the defect density to vary between less than 0.5 and more than 0.7, with 68% lying between those numbers. If high quality projects are preferentially submitted for review, then the mean of that subset will obviously be lower than 0.6. Let's say that the projects that were sent in were a web browser (0.55), a pdf reader (0.49) and a video player (0.50). You're saying that it would be fair if we compared this with open source web browsers, pdf readers and video players. But just as these categories happened to fluctuate low in error density in the proprietary side, these categories may just as well happen to have atypically high error rates on the OSS side. In fact, unless the quality of the project is strongly correlated with the category it is in, then you would expect the mean on the OSS web browsers, pdf readers and video players to be the same as the total mean, since *they were not selected based on their own quality*, unlike the proprietary software.
The problem is you are assuming they all have a normal distribution. While that may be true for a large data set (in fact the central limit theorem would say so); we don't know what the distribution is for individual categories. While we don't know what exactly made the data set we do know teh results for various sizes of code. It seems a reasonable assumption that similar types of programs would have similar size code bases so you don't have really good small proprietary programs skewing results for really large ones or vice versa; and similarly for OSS code.
Your argument about comparisons of similar types of software would only make sense if there were only one proprietary program of the type in question, and that is the one that was submitted for testing. Otherwise, you would expect to be comparing a better-than-average properietary foo with an average OSS foo.
Not really; especially a the high end where there are relatively few products in each category. There aren't that many commercial choices in many of the categories; and also relatively few OSS as well. As a result it is less likely all the bad ones got left out, if they were you wouldn't have much of a data set. So, assuming they had similar software in each category the comparison drawn would be valid. Could we say any one program is equal to another? No, but we could say that for large programs OSS software is as bug free as proprietary; and vice versa.
I think the grandparent is completely correct that there is a potential bias here. But I have no idea how large it is.
I am not saying there is no bias and without seeing specifically what comprised the test data we don't know what and how big any bias is. I take issue with comments that the results are invalid because people assume all OSS was tested but proprietary was cherry picked. They don't like the results so the immediate reaction is to claim they are false. My contention is it is a reasonable comparison, especially for large programs, because the number of different programs that perform similar functions, in that category is small enough that if both data sets are reasonably large any selection bias is likely to have a small impact on the final numbers. I think for other size code base, if they have a similar mix of functionality, it would be reasonable as well though there may be more variability in results if you changed the actual programs tested. In the end, that is why I think you really need a list of what is in the results to reach a more definitive conclusion.