Self Organizing Maps Take Control of Your Email
August 29th, 2007 by David Bradley >> 3 Comments

Sig Fig Exclusive Research News—Most of us want to be able to categorize, organize and generally filterize filter the vast quantities of email we receive as automatically as possible. There are various tools built into most email programs that allow some degree of filtering, but none is perfect. When you consider that a typical business email account may handle thousands if not tens of thousands of emails every week, that is an awful lot of email organization that has to be done to prevent information overload.
Now, Helmut Berger and Michael Dittenbach both senior researcher of the iSpaces research group at the E-Commerce Competence Center (EC3), in Vienna, Austria, working with Associate Professor Dieter Merkl of the Technical University of Vienna have reviewed the various technical solutions to data preparation for email categorization.
Text categorization duties can be used to identify document type and allow filtering into appropriate folder or a particular expert recipient, authorship attribution and identification of priority emails based on sender, it can allow the collation and analysis of standardized responses to a survey or questionnaire, for instance, and various other incoming message, and, of course in filtering out spam messages.
The researchers have studied various supervised and unsupervised machine learning techniques that could carry out the task, including support vector machines, decision tree learners, instance-based classifiers, naïve Bayes classification approaches and self-organizing maps all of which can be implemented as straightforward algorithms across an email system. They used either a word-based or character “n-gram” representation of email documents in order to assess the performance of each of these approaches.
The “n-gram” approach should help any categorization system to handle the noisy nature of email messages where misspellings, special characters, and abbreviations are common as well as incorrect transliteration from format to format. Anyone has ever seen dozens and dozens of strings like “=A30″ and “=20″ and html coding in between every word and at the start and end of every line of a forwarded email or an email from MS Outlook arriving into a differently compliant email program will know what a headache that kind of noise can be.
The key to success they found was in the specific analysis of email header information as part of the document representation. After all, say the researchers, besides the body content of an email, the headers contain invaluable information that might be exploited in classification of the incoming message. Surprisingly, they found that organization was affected to a much lesser degree by whether or not the word-based document representation was used rather than the n-gram character analysis. Perhaps categorizing based on real word analysis counters the presence of noise just as effectively as the character approach. Their main conclusion is that support vector machines (SVMs) rather than the commonly used Bayesian and other approaches is apparently the most successful at organizing email. Unattended self-organizing maps lagged only a little behind the SVM approach, surprisingly perhaps, given that no user input or training is needed.
That said, all six approaches tested showed at least 90% accuracy. However, with tens of thousands of emails, 10% falsely or negatively classified as something, spam, for example, that they are not could cause almost as big a headache as the information overload the filtering aims to tackle.
The team reports details of their study in the International Journal of Intelligent Information and Database Systems, 2007, 1, 91-121.

"Deceived Wisdom: Why What You Thought Was Right Is Wrong" from David Bradley. Available now on 


Leave a comment ↓
Kannan.M.S. // Aug 30, 2007 at 9:47 am
A good article for self-disciplined personnel and who believe in systems they and others build.
Structured approach pays off after all.
DNA Networks // Sep 3, 2007 at 7:46 pm
I’m not sure what kind of accuracy I get with Google mail, but it has to be very high! It is rare that I find something in Spam that isn’t spam.
Google mail for your domain works great.
David Bradley // Sep 3, 2007 at 10:13 pm
I probably see about 100-200 spams a day in my main google account and roughly 1-2 of those messages are false positives. Other people’s mileage varies…