One reason this kind of problem occurs is that many collaborative filtering algorithms are measured based on "root mean squared error", basically the square root of the mean of the differences between what was predicted and what the user actually did.
The problem with this metric? It doesn't account for a variety of important things, one of which is that most users value diversity. Another is that in most recommendation systems, what is important is the relative relevance of recommendations to each-other, whereas RMSE is an absolute measure of effectiveness. And a really tricky one is that the recommendation algorithm itself can impact user behavior. For example, the user may raise their standards if the algorithm does a better job.
The unfortunate answer is that the only rock-solid way to measure the effectiveness of recommendation algorithms is to test them with real users, perhaps splitting the user population between different algoritms, and seeing which does best.
I'm pretty familiar with this issue as my day job is building a behavioral ad targeting engine. We learned a long time ago that while RMSE has its uses, there is often limited correlation between an algorithm's ability to predict user behavior retrospectively (which ads they will click on and what products they will buy), and how much additional revenue the algorithm will generate in practice.
Our solution is to use RMSE as a first-blush indication of how good an algorithm is. Secondly, we take the top, say, 10% of ads with the best predictions, and see what the actual click or conversion rate is within this 10%. This requires a higher volume of data, but yields results that are closer to what we find in reality. Lastly, the algorithm then has to prove itself in the wild on a small subset of traffic. Only then can we really know if any algorithm is an improvement on any other.