Comment Re:plagiarism differs in science vs. English Lit. (Score 1) 111
... Yet, syntactic matching appears to be exactly what this program is doing.
What constitutes "plagiarism" in a scientific paper is very different from plagiarism in journalism or English literature. In scientific writing, it is expected that authors will use the same flat, impersonal style and repeat definitions and the results of others to save the reader the time of having to look them up. So, simple pattern matching between science papers will result in a great many false positives. In science (and math) writing what matters is the new result which the author is claiming. It seems to me that it would be nearly impossible for a computer program to detect the distinction.
Hours of speculation and typing can save one minute of reading TFA. From the article:
"Unlike other plagiarism detectors, it does not use phrases or similar words to check for copying. Helio Text actually looks at the entirety of the text."
So no, it does not. It uses instead some sort of similarity metric computed from analyzing the entire text. This is possibly similar to the text distance metrics used in vector space search engine models (see: en.wikipedia.org/wiki/Vector_space_model ). They will be publishing a paper online in PLoS ONE.
I did RTFA. However, there is no code, no algorithm description, no indication whatsoever in TFA describing exactly how their program operates. From the vague references in TFA it appears that this is nothing more than a glorified, article+abstract-wide, pattern matcher. Perhaps it is a little more clever and uses something similar to Google's page ranking algorithm via applying distance metrics to textual spaces. However, that is also a form of syntactic analysis rather than a context analysis. Barring further information on the algorithm, I can't see how your description invalidates my previous point.