Comment Re:Can they do it with corporate code? (Score 2) 220
Did you read the part in the article where they're actually doing the matching based on the ASTs (abstract syntax trees), and so are able to identify authors even after the code goes through an obfuscator? Relevant quotes:
Their real innovation, though, was in developing what they call “abstract syntax trees” which are similar to parse tree for sentences, and are derived from language-specific syntax and keywords. These trees capture a syntactic feature set which, the authors wrote, “was created to capture properties of coding style that are completely independent from writing style.” The upshot is that even if variable names, comments or spacing are changed, say in an effort to obfuscate, but the functionality is unaltered, the syntactic feature set won’t change.
Accuracy rates weren’t statistically different when using an off-the-shelf C++ code obfuscators. Since these tools generally work by refactoring names and removing spaces and comments, the syntactic feature set wasn’t changed so author identification at similar rates was still possible.
Regarding the first quote: The author of the article probably didn't realize that ASTs aren't a new thing; it's just this application of ASTs that's new. ASTs are as old as the hills. I learned about them from the Dragon Book, and by the time that was written they were old hat.