Synonymous Soured - Slashdot User

Comment Re:Sorry, that's not right (Score 2, Informative) 268

by Synonymous Soured on Monday January 27, 2003 @07:56PM (#5170500) Attached to: Using gzip As A Spam Filter

Pre-coffee fog. Sorry. Typing got ahead of brain. Tripped up confounding the words-as-symbols/bytes-as-symbols distinction with the model markovity.

You are correct about the order-1 assertion. That should indeed have been order-N, where N is the length of the longest prefix string maintained explicitly or implicitly by a Ziv-Lempel dictionary or backpointer set. The Ziv-Lempel engines can be regarded as using shortened N-grams to represent classes of longer, yet-unseen N-grams; and they do use Markov models, where the stationary and transition probabilities are all set equal. In these cases, the probabilities only count for being zero or non-zero.

A "Bayesian Spam Filter" is order-0 if it relies only on token frequencies, where the tokens are complete strings, and not conditional occurrences of word pairs. The assertion is that a spam filter mechanism would be improved if it relied on a higher-order underlying model, and if the symbols were taken to be bytes and not words. The probability of a string is thus the product of the probabilities of its symbol sequence under the order-N model. But any higher-order model, even one using within-message word digrams or trigrams, would probably be an improvement.

Slashdot Top Deals