Sciencetext Tips & Tricks

Blogging tips, browsing tricks and computing hacks

Spam Analysis

June 25th, 2008 · by David Bradley

Control comment spamAnyone who says they have never had a problem with email spam is either my Dad, who has never touched a computer in his life (bless him), or they have staff to read their emails. Spam is ubiquitous in the online world, it is everywhere, and it is omnipresent.

If you’re using Google Mail you may not see much, the spam filters on that system are very good (at least in my experience). Moreover, if you’re then POP3 downloading your GMail into a desktop email client with Bayesian statistical filtering then you may see even less. Forward to your Linux-based server and employ Spam Assassin and you may well see only very rare spam emails. However, just take a look at your space-draining spam folders and you will realize that, although you may not see much spam, it’s still a problem.

Computer scientists in France think they may have come up with a new answer to finding the perfect spam filter. Writing in the International Journal of Web and Grid Services recently (2008, vol 4, , they describe how they can filter spam very effectively using a process known as Kolmogorov complexity analysis. This approach works, not by analyzing the headers or the body of an incoming email, but by classifying it based on how well it can be compressed (akin to WinZip or Stuffit compression) and then comparing this compression ratio to that of previously whitelisted or blacklisted emails.

Andrei Nikolaevich Kolmogorov (1903-1987) was a Soviet mathematician, considered one of the most pre-eminent of the twentieth century. He made major advances in probability theory, topology, intuitionistic logic, turbulence, classical mechanics and computational complexity. It is within Kolmogorov’s work on logic that Gilles Richard and Andrei Doncescu of the University of Toulouse hope to find a solution to spam filtering, as they explain:

The main idea is to give a formal meaning to the notion of ‘information content’ and to provide a measure of this content. Using such a quantitative approach, it becomes possible to define a distance, which is a major tool for classification purposes.

The researchers have validated their approach by proceeding in two steps:

First, they used the classical compression distance over a mix of spam and legitimate emails to determine if they can be properly clustered without any supervision. This step could then show whether there is an underlying structure to spam emails that might be exploited in filtering.

In the second step, they implemented a simple machine-learning system, a so-called k-nearest neighbors algorithm, which then classifies emails according to how closely they resemble others in the queue. The approach requires no deep analysis of the header or body of the incoming email as is necessary with Spam Assassin type systems and Bayesian filtering. Instead, it works by simply measuring how different is the possible compression of known legitimate and spam emails.

Using this approach, the researchers were able to filter spam with 85% using this approach alone. However, its real strength will lie in turning to a more powerful classification technique (Support Vector Machines for instance) and in coupling it to another anti-spam technique, such as Bayesian analysis, Richard told me.

4 responses so far ↓

  • andrew // Jun 25, 2008 at 4:51 pm

    I read an article about a new technology called ReceiverNet from Abaca. ReceiverNet technology characterizes each protected user based on the percentage of spam they receive and then uses those reputations to rate the incoming message flow. I changed my spam filtering system to Abaca’s Email Protection Gateway and it blocked Replica watches spam mails, Subpoena Phishing mails and many more. I found that Abaca’s ReceiverNet service has 99% efficiency in blocking spam mails and they guarantee their results . For more information, log on to http://abaca.com/.

  • David Bradley // Jun 25, 2008 at 6:18 pm

    Sounds like an interesting approach that saves on all this mathematical analysis. Anyone else got a good system in place that works as well as Abaca?

  • Phil Whelan // Jun 26, 2008 at 8:14 pm

    Abaca approach sounds like an interesting. 99% is quite amazing! I’m going to check it.

    David, yes, we have an approach that uses even less mathematical analysis, using the idea that spammers are impatient. We slow down connections of unknown senders, and in doing so have found that most zombie machines sending the spam disconnect within a few seconds.

    Phil Whelans last blog post..Sign up for a MailChannels Email System Load Test

  • David Bradley // Aug 19, 2008 at 7:04 pm

    One of the most peculiar spam subject lines I saw recently read: “Heather to have other leg amputated”.

    Nasty people.

Leave a Comment

Comments are checked for spam before appearing, no need to post it twice.

Related Posts