Bayesian spam filtering is a statistical method of detecting spam emails based on Bayes’ theorem to calculate the probability that an email is actually a spam email. Most spam filters today such as SpamAssassin uses Bayesian filtering.
Bayes’ theorem
In probability theory and statistics, Bayes’ theorem (alternatively Bayes’ law or Bayes’ rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event. So for example, since most emails containing the word Viagra tend to be spam, there is a high probability that any emails that have the word Viagra is actually spam.
Accuracy of spam detection
The main issue with Bayesian filtering is that it requires prior data like keywords that are associated with spam or non-spam. This means the filter needs to be initially trained with large quantities of emails to be able to determine whether an email is a spam or not. With continual training of the Bayesian filter, accuracy will improve over time.
Advantages
Each user trains their own filter with their own email data. So what User A considers as spam may be a legitimate email to User B. Therefore the risk of false-positive is lessened over time as the filters are fine-tuned with individual data. Most Bayesian spam filters are able to train themselves automatically while also incorporating input from users who mark emails as spam. This combination makes the Bayesian spam filter a powerful tool to weed out spammy emails.
Disadvantages
Spammers are always looking for ways to get around spam filters. Just as the spam filters adapt over time, so do the spammers. One technique they use is called Bayesian poisoning. By including large amounts of words from a legitimate source like a news site, the Bayesian spam filter will calculate a lower probability that the email is spam. Another common method is using alternative spellings to confuse the spam filter. E.g. Viagra could become Viaagra or V!agra.