Spam or Ham?
May 13th, 2009 by David Bradley >> No Comments
A new approach to spam filtering could use your web browsing habits to help your email program filter out spam and find the ham.
A computer desktop system that follows your web surfing habits and then uses this behavior to filter out spam from your email is being developed by researchers in Japan.
Taiki Takashita, Tsuyoshi Itokawa, Teruaki Kitasuka, and Masayoshi Aritsugi of the Department of Computer Science and Communication Engineering at Kumamoto University, explain how the system finds “ham” words based on the way a user browses the web and differentiates between these ham words and the “spam” words found in the user’s incoming email. “The method reduces troublesome maintenance of a spam filter,” the researchers say, which normally involves the user confirming a false negative or blacklisting a particular spam email that has not been filtered.
“Our method can detect some spam which is hard to classify correctly using the existing Bayesian statistics filters,” the team says, “We show that a combination of a Bayesian filter and our method reduces the number of false negatives.”
In 2001, just 5% of email traffic across the internet was unsolicited marketing messages, known as “spam” after the homogenized pork product made famous by the British comedy team Monty Python in a humorous song entitled “Spam, spam, spam…” Today, it is estimated that between 90 and 95% of all e-mails are spam or carry some form of malicious payload. Some observers suggest that email spam could have serious environmental consequences given the huge amount of computer and user time wasted on managing such a vast flux of internet traffic.
There are two main anti-spam technologies: sender-side technologies operate at the earliest stage and are designed to prevent malicious users sending spam in the first place. Given the massively distributed nature of spam sources and the existence of spam-bot networks built from compromised computers across the globe that can send millions of messages each day, this is the most difficult to implement.
Thus, spam management is usually addressed using receiver-side technologies, which operate either at the level email company (in the case of Google Mail), the internet service provider (ISP) level or on the user’s email program. To provide a novel approach to spam control, Takashita and colleagues have focused on the latter: filtering.
There are numerous approaches available to filter spam. The simplest involves creating a blacklist of spam words. If these words are found in an incoming email it is labeled as spam. Additional filters might look for web addresses embedded in an incoming email and assign the spam tag if there are more than a threshold number of URLs in the email or if those URLs point to blacklisted sites or are obfuscated in some way. This URL filtering approach also helps filter out fraudulent phishing messages too.
At any point an email user could manually flag an email as spam or de-flag a ham email. Bayesian statistics has been used to augment and automate this filtering approach by “learning” from which emails are blacklisted or whitelisted what statistical combination of words in a new email is likely to suggest spam or ham.
One thing most email users also do with their computer is to browse the web. Takashita and colleagues have used this fact to help develop a filtering algorithm that extracts a user’s preferences based on their Web browsing habits and applies this behavior to filtering out email spam by combining it with conventional Bayesian email filtering. Their approach evokes no privacy issues as it is done entirely client side and the browsing data is simply fed to a desktop tool. By necessity the tool would run with, or within, the web browser and email programs.
The method consists of three stages: the first stage creates a ham words list from browsed web pages and applies a statistical analysis to this list, the second stage provides the filtering functionality of received e-mails with ham words list, and the third stage, which is optional, allows the user to intervene and whitelist or blacklist emails that have been flagged incorrectly.
In their preliminary tests of the approach, they have managed to half the number of false negatives in filtering several thousand emails compared with a filtering test that used Bayesian statistics alone.
It all sounds rather clever, but I can think of an immediate problem with this approach in that an interest in visiting certain niche sites mentioning particular body parts would not necessarily translate to a desire to read emails concerned with the enlargement of said body parts. Or more seriously, just because you are searching for information about a particular medical disorder does not mean you would want to receive endless marketing emails offering you drugs for that particular disease. I could think of several other examples of where my journalistic browsing habits would likely lead to almost no spam being filtered at all!
Taiki Takashita, Tsuyoshi Itokawa, Teruaki Kitasuka, & Masayoshi Aritsugi (2008). Extracting user preference from Web browsing behaviour for spam filtering Int. J. Advanced Intelligence Paradigms, 1 (2), 126-138