YouTube Twitter Facebook YouTube Twitter Facebook







about bill

Spam e-mail

July 22, 2003

A public radio commentary

Right now America On-Line filters out 780 million unsolicited e-mail advertisements, commonly called spam. To put that in perspective: That's 100 million more than they actually deliver. Yet our in-boxes still fill with spam. Why is this spam so hard to filter?

Getting a computer to filter spam automatically is difficult because spam deals with the essence of what makes us human - and different from anything else on earth - communication by words. It is the content of the message that determines whether or not we want to read it. That's why the most common filters fail.

Existing spam filters look for specific words in the message. For example, I get offers for inkjet cartridges, so I exclude the word "inkjet". But now the spammers have started spelling the word with dashes: i-n-k-j-e-t. And what if a friend writes to me about his inkjet printer? I'd miss his e-mail.

The key, then, is to assess the actual content of an e-mail - and determine whether to keep or toss out the message. There is hope of having your computer do this because of an 18th century English Minister.

Very little is known about Reverend Thomas Bayes except that he left a single scientific paper with a revolutionary explanation of how to estimate the probability of an unknown event - an event like an e-mail arriving at your mail box that you want to read, or don't want to read.

Over 200 years later a Harvard-trained computer scientist, Paul Graham, wrote a paper called "A plan for Spam" showing how to use Bayes work to rid the world of spam.

Graham wrote a program to rate the "spamminess" of each word in an e-mail. If 90% of the spam, for example, has the word "viagra" in it, then viagra gets a negative rating of 90 per cent. To prevent excluding message he wants, his program rates words that turn up in regular messages too. If his friends often e-mail about going to a movie, for example, the program takes into account the words likely to be used in such a message.

So, when Graham's program sees a new e-mail, it calculates how likely it is to be spam or desired mail, based on all the words in it. Graham's methods can keeps up to 99% of spam out of an in-box, with rarely a personal message lost.

This filtering is unlikely to be defeated because it assesses the content of a message. To evade the filter, spammers would need to change their message to be like one from your regular correspondents, and that would be one with no sales message in it.

I think Reverend Thomas Bayes would approve of this use of his mathematical work. He once wrote: "So far as Mathematics do not tend to make men more sober and rational thinkers, wiser and better men, they are only to be considered as an amusement, which ought not to take us off from serious business." Surely he would approve using his mathematical work so I can get fewer e-mails about viagra.

Copyright 2003 William S. Hammack Enterprises