您的电子邮件自已组织的地图作为控制
2007年8月29日 · 由大卫布雷得里

信号专属研究新闻最我们想要能分类,组织和一般 filterize 过滤我们一样自动地接受尽可能电子邮件的浩大的数量。 有各种各样的工具被打制入允许某一程度过滤的多数电子邮件程序,但什么都不是完善的。 当您考虑时一个典型的企业电子邮件也许处理数以万计如果不成千上万电子邮件每周,那是可怕的全部必须完成防止信息超载的电子邮件组织。
现在, Helmut Berger和迈克尔Dittenbach两iSpaces研究小组的高级研究员在电子商务能力中心(EC3),在维也纳,奥地利,运作与维也纳技术大学节食者Merkl副教授回顾了各种各样的技术解答对数据准备为电子邮件范畴。
文章分类责任可以用于辨认文件类型当然,并且允许过滤入适当的文件夹或优先权电子邮件的特殊专家的接收者、着作归属和证明根据发令者的,它可能允许对对一张勘测的规范化的反应的核对和分析或查询表,例如和各种各样的传入的消息,和,在过滤掉发送同样的消息到多个新闻组消息。
研究员学习了可能执行任务,包括支持传染媒介机器、判定树学习者、基于事例的量词、naïve贝斯分类方法和自组织映射可以被实施作为直接的算法横跨电子邮件系统的各种各样的被监督的和未加监督的机器学习技术。 他们使用了电子邮件文件的一个“n克”表示法为了估计每一个种这些种方法表现的基于词的或字符。
“n克”方法应该帮助所有范畴系统处理电子邮件的喧闹的本质,拼错、特性和简称是共同的并且不正确意译从格式到格式。 任何人看见十二个和许多串象“=A30编码在每个词之间和在批转的电子邮件或电子邮件的每条线的开始和结尾″和“=20 ″和html从到达入一个另外服从的电子邮件程序的MS Outlook将知道什么头疼种类噪声可以是。
他们发现的成功的钥匙在对电子邮件报头信息的具体分析作为本文表示法一部分。 终究研究员说,除电子邮件的身体内容以外,倒栽跳水包含在传入的消息的分类也许被利用的无价的信息。 Surprisingly, they found that organization was affected to a much lesser degree by whether or not the word-based document representation was used rather than the n-gram character analysis. Perhaps categorizing based on real word analysis counters the presence of noise just as effectively as the character approach. Their main conclusion is that support vector machines (SVMs) rather than the commonly used Bayesian and other approaches is apparently the most successful at organizing email. Unattended self-organizing maps lagged only a little behind the SVM approach, surprisingly perhaps, given that no user input or training is needed.
That said, all six approaches tested showed at least 90% accuracy. However, with tens of thousands of emails, 10% falsely or negatively classified as something, spam, for example, that they are not could cause almost as big a headache as the information overload the filtering aims to tackle.
The team reports details of their study in the International Journal of Intelligent Information and Database Systems, 2007, 1, 91-121.


















3 responses so far ↓
Kannan.M.S. // Aug 30, 2007 at 9:47 am
A good article for self-disciplined personnel and who believe in systems they and others build.
Structured approach pays off after all.
DNA Networks // Sep 3, 2007 at 7:46 pm
I’m not sure what kind of accuracy I get with Google mail, but it has to be very high! It is rare that I find something in Spam that isn’t spam.
Google mail for your domain works great.
David Bradley // Sep 3, 2007 at 10:13 pm
I probably see about 100-200 spams a day in my main google account and roughly 1-2 of those messages are false positives. Other people’s mileage varies…
Leave a Comment