namics Weblog
namics Weblog.
Persönliche Stimmen und Meinungen von Mitarbeiterinnen und Mitarbeitern.
namics @ www.flickr.com

Links

  • Sharepoint Weblog
  • about:namics
  • namics Website

AKTUELLE ARTIKEL

  • Firmenpolitik oder Sabotage
  • Erfolgsfaktoren für Intranet-Wikis in Unternehmen (Vortrag)
  • Zwei Fragen zu Online Kommunikation
  • Ich kann nicht mehr alles lesen, aber cool sieht es aus
  • Vortrag: Das Wiki wird erwachsen
  • Bei mehr als 1GB/Sekunde vorher melden: Die Wolkenfront ist da
  • Bildersuche nach Farbe (auf Basis von Flickr)
  • Vortrag auf dem ECM World Summit
  • Gleichberechtigte Sichten im Intranet
  • Pragmatisches User Centered Design bei bahn.de

Kategorien

  • Accessibility
  • Blogging
  • Business
  • CEO-Post
  • Collaboration
  • Design
  • Fehlermeldungen
  • Gesellschaft
  • Information Retrieval
  • Lotusphere
  • Mobile
  • Online Marketing
  • Orbit-iEX
  • Project Management
  • SEO+SEM
  • Technologie
  • Vorträge
  • Web Analytics

Archive

  • November 2008
  • Oktober 2008
  • September 2008
  • August 2008
  • Juli 2008
  • Juni 2008
  • Mai 2008
  • April 2008
  • März 2008
  • Februar 2008
  • Januar 2008
  • Dezember 2007
  • November 2007
  • Oktober 2007
  • September 2007
  • August 2007
  • Juli 2007
  • Juni 2007
  • Mai 2007
  • April 2007
  • März 2007
  • Februar 2007
  • Januar 2007
  • Dezember 2006
  • November 2006
  • Oktober 2006
  • September 2006
  • August 2006
  • Juli 2006
  • Juni 2006
  • Mai 2006
  • April 2006
  • März 2006
  • Februar 2006
  • Januar 2006
  • Dezember 2005
  • November 2005
  • Oktober 2005
  • September 2005
  • August 2005
  • Juli 2005
  • Juni 2005
  • Mai 2005
  • April 2005
  • März 2005
  • Februar 2005
  • Januar 2005
  • September 2004
  • August 2004
  • Juli 2004
  • Juni 2004
  • Mai 2004
  • April 2004
  • Februar 2004
  • Februar 2003

XML und Mumbo Jumbo

  • namics ag
  • namics ag
  • namics ag
  • Atom Feed
  • RSS 2.0 Feed
  • Creative Commons License
    Dieses Weblog untersteht der Creative Commons Lizenz
  • Powered by Movable Type 3.35
« Browsertest sind für Warmduscher | Übersicht | Web Statistik: Begriffe, Kennzahlen und Lügen (aka Web Analytics oder Online Business Intelligence) – Teil 2 von 3 »
11
Feb
Spam Experiment - Guilty, until proven innocent - really!
gepostet von am 11.02.2007 um 01:07

A week ago, I had a little schnapps idea - maybe it would better to consider email Spam until proven otherwise. Perhaps we shouldn't concentrate on identifying Spam, but rather real mail.

As it happens, my private email a) gets a lot of Spam, because it receives messages addressed to several different accounts which are published on the net, and b) my email is sent to two separate computers, each running its own copy of the latest version of Thunderbird. This makes it possible to compare the traditional "Mark Spam" method with the "Mark Good" method. The tool is the same, the only difference is which mail get put into the filter.

Mark Good is really simple - if the email is one that I want, click on the "Junk" button. Thunderbird marks it as junk and deletes it. It also learns to recognize junk and automatically deletes similar messages in the future. I then recover the good messages from the Trash, and then really delete everything left in my Inbox.

In some ways, the results of the two approaches are quite similar. In others, the differences are quite dramatic, and given a choice, I much prefer the "Mark Good" approach.

As I have have not used my home computer in a while (that it runs at 500 MHz and my Notebook at 2GHz might explain the infrequent usage), there were 2310 messages waiting for download yesterday. So using the T-Birds Bayesian Spam filter in its current training state, I downloaded all the messages, letting the Spam filter do its thing. Per thousand messages, there were 124 "really good" messages (i.e. those messages that I, using my human judgement, care about). Of those, 60 -- nearly half -- were classified as bad (Spam) and 19 Spam messages were left in the inbox. All in all, 79 of 1000 messages were put in the wrong category.

A week ago, I starting using the Mark Good approach on my laptop. Before starting, I reset the learning database, used a day's worth of messages to train the Spam filter (by identifying good email as Spam in the UI. Good mail gets put in the Junk folder, everything else stays in the Inbox). During the 6 1/2 days since, I have received 935 messages. Normalizing again, 102 per 1000 were "really good". Of those, 98 were recognized properly, only 4 were left together with the Spam and only 63 per 1000 where misclassified.

I'm not sure if the difference between 6.3% and 7.9% is really that significant, but what is dramatic is how the two approaches handle their mistakes differently.

Mark Good put 59 Spam mails together with the good ones. Mark Spam put 60 good emails together with the Spam. Which is preferable? To my eyes, no contest: It is much easier to sort out 60 Spams from 102 Goods, than to find 60 Goods among 936 Spams.

Only one thing troubles me about the mark good approach: what about the 4 / 1000 real mails that get classified among the Spam: - would I really make the effort to go find these mails? Or would they just get lost in the Spam?

Experiment
"Mark Good"
Control
"Mark Spam"
Messages Processed
935
2310
per 1000 Messages:
1000 1000
Really Good 102 124
Considered Good 156 64
Considered Spam 844 936
False Good 59 19
False Spam 4 60
Wrongly filtered 63 79

What's next? Well it would be great if someone would verify these findings. Even better if someone from the Thunderbird group would look into this, maybe build this approach into a beta version....?


TRACKBACK

TrackBack URL for this entry:
http://blog.namics.com/mt/mt-tb.cgi/781

KOMMENTARE

Even though I do not use Thunderbird, I find your experiment and the outcomes very interesting. Having used GMail for quite a while, I can confirm that because of the quality of the spam filter I do not check my spam for false detections anymore. I only had some newsletters classified as spam in the beginning. The risk of loosing important information is just too low compared to the time it takes to scan through hundreds of spams every day.

gepostet von Andi am 11.02.07 12:55

KOMMENTAR SCHREIBEN

Name:

E-Mail Adresse:

URL:

Bitte das Ergebnis von 1 + 2 als Ziffer (Spamschutz):