Åë½Å ¹«È­°ú Ȩ ÆäÀÌÁö Åë½Å ¹«È­°ú Newsfeed emailÀÇ Åë½Å ¹«È­°ú Åë½Å ¹«È­°ú ´ëºÎºÐÀÇ ´ëÁßÀûÀÎ Æ÷½ºÆ®
À¯È¿ ¼ýÀÚ
blogging, ã¾Æº¸°í, ±â¼ú ³¡¿¡ µµ¿Í¼­ ´ç½ÅÀ», µ½±â

°¢ÀÚ Æí¼º Áöµµ´Â ´ç½ÅÀÇ ÀüÀÚ ¿ìÆíÀÇ Àå¾ÇÇÑ´Ù

2007³â 8¿ù 29 ¡¤ µ¥ºñµå ºê·¹µé¸®ÀÇ

Æí¼º ÀüÀÚ ¿ìÆí

Åë½Å ¹«È­°ú ÀúÈñÀÇ µ¶Á¡ÀûÀÎ ¿¬±¸ ´º½º °¡ÀåÀº ºÐ·ùÇÒ ¼ö ÀÖ°í ½Í, ±×¸®°í ÀϹÝÀûÀ¸·Î Æí¼ºÇÑ´Ù filterize ¿ì¸®°¡ µÇµµ·Ï ÀÚµ¿À¸·Î ¹Þ´Â ÀüÀÚ ¿ìÆíÀÇ ±¤¸·ÇÑ ¾çÀ» °Å¸£½Ê½Ã¿À. °Å¸£±âÀÇ ¾î¶² Á¤µµ¸¦ Çã¿ëÇÏ´Â ´ëºÎºÐÀÇ ÀüÀÚ ¿ìÆí ÇÁ·Î±×·¥À¸·Î Á¶¸³µÈ °¢Á¾ °ø±¸°¡ ÀÖ´Ù, ±×·¯³ª ¾Æ¹«µµ´Â ¿ÏÀüÇÏ´Ù ¾ø´Ù. ÀüÇüÀûÀÎ »ç¾÷ ÀüÀÚ ¿ìÆí °èÁ¤Àº ¼ö¸¸ ÀüÀÚ ¿ìÆí ¸ÅÁÖ¸¶´Ù ¼öõÀ» Ãë±ÞÇÒÁöµµ ¸ð¸¥´Ù ±×·¸Áö ¾ÊÀ¸¸é ´ç½ÅÀÌ °í·ÁÇÒ ¶§, Àú°ÍÀº Á¤º¸ °úºÎÇϸ¦ ¹æÁöÇϱâ À§ÇÏ¿© ³¡³ª¾ß ÇÏ´Â ÀüÀÚ ¿ìÆí Á¶Á÷ÀÇ Áöµ¶ÇÑ Á¦ºñÀÌ´Ù.

Áö±Ý, Helmut Berger¿Í ÀüÀÚ »ó°Å·¡ Àû¼º ¼¾ÅÍ¿¡ iSpaces ¿¬±¸ ´ÜüÀÇ °íÀ§ ¿¬±¸¿ø µÎ ¸¶ÀÌŬ Dittenbach (EC3), ºñ¿£³ª, ºñ¿£³ªÀÇ ±â¼úÀûÀÎ ´ëÇÐÀÇ ºÎ±³¼ö Dieter Merkl¿Í ÀÛµ¿ÇÏ´Â ¿À½ºÆ®¸®¾Æ¿¡¼­ ÀüÀÚ ¿ìÆí ºÐ·ù¸¦ À§ÇÑ ÀÚ·á Áغñ¿¡ °¢Á¾ ±â¼úÀûÀÎ ÇØ°áÃ¥À» °ËÅäÇß´Ù.

º»¹® ºÐ·ù Àǹ«´Â ¹®¼­ ¾ç½ÄÀ» È®ÀÎÇϱâ À§ÇÏ¿© ÀÌ¿ëµÇ°ñ ¹ß¼ÛÀÎ, ¿¡ ±Ù°ÅÇÑ ¿ì¼±±Ç ÀüÀÚ ¿ìÆíÀÇ ÀûÇÕÇÑ Æú´õ·Î ¶Ç´Â ƯÁ¤ÇÑ ³ë·ÃÇÑ ¼ö·ÉÀÎ, ÀúÀÚ ¼Ó¼º ¹× ID °Å¸£´Â Çã¿ëÇϱâ À§ÇÏ¿© ½ºÆÔ ¸Þ½ÃÁö¸¦ ¹ÛÀ¸·Î °Å¸£±â¿¡¼­ Á¶»ç¿¡ Ç¥ÁØÈ­ÇÑ ÀÀ´äÀÇ ´ëÁ¶ ±×¸®°í ºÐ¼® ¶Ç´Â Áú¹®Áö, ¿¹¸¦ µé¸é, ¹× ¿©·¯°¡Áö ´Ù¸¥ ¼ö½Å ¸Þ½ÃÁö, ±×¸®°í, ´ç¿¬È÷ Çã¿ëÇÒ ¼ö ÀÖ´Ù.

¿¬±¸¿øÀº ÀüÀÚ ¿ìÆí ü°èÀÇ ¸ÂÀºÆí¿¡ ¶È¹Ù¸¥ »ê¹ýÀ¸·Î ½ÇÇàµÉ ¼ö ÀÖ´Â Áö¿ø º¤ÅÍ ±â°è, °áÁ¤ Æ®¸® Á¦ÀÚ, °æ¿ì ±Ù°ÅÇÑ ºñ¹ÐºÐ·ùÀÚ, naïve º£À̽º ºÐ·ù Á¢±Ù ¹× ÀÚ°¡ ±¸¼º Áöµµ¸¦ Æ÷ÇÔÇÏ¿© ¾÷¹«¸¦, ½ÇÇàÇÒ ¼ö ÀÖ´ø °¢Á¾ °¨µ¶Çϰí unsupervised ±â°è ÇнÀ¹ýÀ» °øºÎÇß´Ù. ±×µéÀº ³¹¸» ±âÃʸ¦ µÎ´Â »ç¿ëÇß´Ù ¶Ç´Â ÀüÀÚ ¿ìÆí ¹®¼­ÀÇ Æ¯¼º "n ±×·¥" ´ëÇ¥´Â À̵éÀÇ °¢°¢ÀÇ ¼º°ú¸¦ »çÁ¤Çϱâ À§ÇÏ¿© Á¢±ÙÇÑ´Ù.

"n ±×·¥" Á¢±ÙÀº ¾î¶² ºÐ·ù ¿ÀÀÚ, Ư¼º ¹× ¿ä¾àÀÌ Ã¼Àç¿¡¼­ üÀç¿¡ ºÎÁ¤È®ÇÑ À½¿ª »Ó¸¸ ¾Æ´Ï¶ó ÀϹÝÀûÀÎ ÀüÀÚ ¿ìÆí ¸Þ½ÃÁöÀÇ ½Ã²ô·¯¿î º»ÁúÀ» Ãë±ÞÇϱâ À§ÇÏ¿© ü°èµçÁö µµ¿Í¾ß ÇÑ´Ù. ´©±º°¡´Â ÀÌÁ¦±îÁö ¼ÒÀ½ÀÇ Á¾·ù´Â ÀÏ ¼ö ÀÖ´Ù µÎÅë ¹«¾ùÀ» °¢ ³¹¸»ÀÇ Áß°£ ±×¸®°í °è¼Ó ´Ù¸£°Ô °íºÐ°íºÐÇÑ ÀüÀÚ ¿ìÆí ÇÁ·Î±×·¥À¸·Î µµÂøÇÏ´Â MS Outlook¿¡¼­ ¹ß¼ÛÇÑ ÀüÀÚ ¿ìÆí ÀüÀÚ ¿ìÆíÀÇ °¢ ¼±ÀÇ ³¡°ú ½ÃÀÛ¿¡ ¾ÏÈ£·Î ÇÏ´Â "=A30 ¡È ¹× "=20 ¡È ¹× html °°ÀÌ ´Ù½º ±×¸®°í ´Ù½º ²öÀÌ ¾Ë °ÍÀ̶ó´Â Á¡À» º»´Ù.

±×µéÀÌ Ã£¾Æ³½ ¼º°ø¿¡ ¿­¼è´Â ¹®¼­ ´ëÇ¥ÀÇ ÇÑ ºÎºÐÀ¸·Î ÀüÀÚ ¿ìÆí Çì´õ Á¤º¸ÀÇ Æ¯Á¤ÇÑ ºÐ¼®¿¡ ÀÖ¾ú´Ù. ¾î·µç ÀüÀÚ ¿ìÆíÀÇ ¸ö ³»¿ë¿Ü¿¡ ¿¬±¸¿øÀ», ¿ìµÎ¸Ó¸® Æ÷ÇÔÇÑ´Ù ¼ö½Å ¸Þ½ÃÁöÀÇ ºÐ·ù¿¡¼­ ÀÌ¿ëµÉÁöµµ ¸ð¸¥ °ªÀ» Çì¾Æ¸± ¼ö ¾ø´Â Á¤º¸¸¦ ¸»ÇϽʽÿÀ. Surprisingly, they found that organization was affected to a much lesser degree by whether or not the word-based document representation was used rather than the n-gram character analysis. Perhaps categorizing based on real word analysis counters the presence of noise just as effectively as the character approach. Their main conclusion is that support vector machines (SVMs) rather than the commonly used Bayesian and other approaches is apparently the most successful at organizing email. Unattended self-organizing maps lagged only a little behind the SVM approach, surprisingly perhaps, given that no user input or training is needed.

That said, all six approaches tested showed at least 90% accuracy. However, with tens of thousands of emails, 10% falsely or negatively classified as something, spam, for example, that they are not could cause almost as big a headache as the information overload the filtering aims to tackle.

The team reports details of their study in the International Journal of Intelligent Information and Database Systems, 2007, 1, 91-121.

3 responses so far ¡é

  • Kannan.M.S. // Aug 30, 2007 at 9:47 am

    A good article for self-disciplined personnel and who believe in systems they and others build.
    Structured approach pays off after all.

  • DNA Networks // Sep 3, 2007 at 7:46 pm

    I¡¯m not sure what kind of accuracy I get with Google mail, but it has to be very high! It is rare that I find something in Spam that isn¡¯t spam.

    Google mail for your domain works great.

  • David Bradley // Sep 3, 2007 at 10:13 pm

    I probably see about 100-200 spams a day in my main google account and roughly 1-2 of those messages are false positives. Other people¡¯s mileage varies¡¦

Leave a Comment

Comments are checked for spam before appearing, no need to post it twice.

Related Posts