It needs a lot more qualifiers than that.
For a start, as with an unfortunate number of academic studies, it appears that the sample population consisted of undergraduates and recent graduates. That alone completely invalidates any conclusions as they might apply to experienced professionals with better judgement about when and how to use refactoring techniques.
Even without that, there seem to be a number of fundamental concerns about the data.
One obvious example is that they consider lines of code to be a metric that tells you anything useful beyond the width you need to allow for the line number margin in your text editor. I doubt most experienced programmers would agree that a LOC count in isolation tells us anything useful about maintainability or that the mere fact that LOC went up or down after a change necessarily meant the code had become better or worse in any useful sense.
Another concern is that they talk about "analysability", but this seems to be measured only by reference to a brief examination of a small code base in one of two versions, unrefactored and refactored. I'd like to know what the actual code looked like before I read anything at all into that data -- what refactoring was performed, what was the motivation for each change, and how do they know those two small code bases were representative of either refactoring in general or the effectiveness of refactoring on larger code bases or code bases that developers have more time to study and work with?
I'm all for empirical data -- goodness knows, we need more objective information about what really works in an industry as hype-driven and accepting of poor quality as ours -- but I'm afraid this particular study seems to be so flawed that it really tells us very little of value.