Unoffical empeg BBS

Quick Links: Empeg FAQ | RioCar.Org | Hijack | BigDisk Builder | jEmplode | emphatic
Repairs: Repairs

Topic Options
#125195 - 08/11/2002 05:17 Bayesian email filters
andy
carpal tunnel

Registered: 10/06/1999
Posts: 5914
Loc: Wivenhoe, Essex, UK
Has anyone had any experience of the various Bayesian email filters that are around ? I am thinking of implementing one of them on my Linux mail server, helpfully I have 5+ years of spam/non-spam messages to feed it (though I probably need to clean up the data a bit, I think there is still some spam lurking in the non-spam folders).

The filter that I am leaning towards at the moment is this one http://www.lbreyer.com/unix.html , but I haven't tried any of the filters yet.
_________________________
Remind me to change my signature to something more interesting someday

Top
#125196 - 17/11/2002 17:26 Re: Bayesian email filters [Re: andy]
andy
carpal tunnel

Registered: 10/06/1999
Posts: 5914
Loc: Wivenhoe, Essex, UK
Time for an update I guess...

I've experimented with two systems so far.

The first one that I tried was CRM114 http://sourceforge.net/projects/crm114 which has the important benefit that it comes with pre-populated spam/non-spam databases, meaning you don't need to train it with your own data before it starts becoming useful.

It's not just it's name that makes CRM114 very odd (named after a comms system in Dr. Strangelove). It has a bizarre language for it's configuration that makes obscure Perl look normal...

CRM114 was very effective at spotting spam, in the 12 hours or so that I used it only a handful of the hundreds of spam messages I get a day slipped through (and that was without me training it at all with my own archive of messages). More importantly it didn't falsely identify any real messages as spam.

I had to stop using it though, because it has one major flaw as it comes out of the box. The problem is that it scans all the data in every attachment in every message. I happened to be sending myself a couple of messages with 1Mb+ attachments when I realised that CRM114 was spending five minutes trying to scan these large messages before procmail timed it out and delivered the message anyway. In the process it was holding up the mail queue for that user.

It is probably possible to tweak CRM114 to ignore huge attachments, but I have no desire to add another very obscure language to my existing list of obscure languages just so that I can filter my email...

So I moved onto bogofilter http://bogofilter.sourceforge.net/

Bogofilter does not come with any pre filled databases, so the first thing I did was feed it with my archive of spam/non-spam messages. Then I fed it some messages to filter.

The good news is that bogofilter doesn't struggle with large messages, because it ignores encoded attachments (on the basis that spam messages will get caught by the data in the headers/any plain text anyway).

When it started out it wasn't as accurate as CRM114, about 10% of my messages were getting classified wrongly. But I kept training it, telling it which messages it had classified wrongly and it quickly got better.

After a couple of days it is now only letting about 5% of my spam slip through and it hasn't put a real message in the spam bin in the last two days. Hopefully it will improve as I continue to train it, but it is looking very promising already. It is certainly working far better than SpamAssasin did when I tried it, I was getting far too many false positives with it.

I didn't try dbacl in the end, I can't remember why, so I'll have to give it a go at some point.

P.S. I'm not running this directly on my normal email at the moment, I have an alias set up to forward all my email to another account that I can experiment with it. I won't start using it for real until I have run with it for a few weeks and am comfortable that it is not throwing away real messages and even then I'll check the spam bin before hitting delete...
_________________________
Remind me to change my signature to something more interesting someday

Top
#125197 - 07/01/2003 16:38 Re: Bayesian email filters [Re: andy]
tonyc
carpal tunnel

Registered: 27/06/1999
Posts: 7058
Loc: Pittsburgh, PA
Found a little gem... For those of us using Windows, it looks like a quasi-Bayesian mail filtering proxy has arrived in the form of POPFile. Looks good, haven't had a chance to train it yet, but it works with Eudora, and with my multiple accounts. With some hacking of Eudora's ini file, I even got it so that I can check my Yahoo email by proxying through POPFile, which then proxies through YahooPOPS, which checks my Yahoo web mail. How sick is that???
_________________________
- Tony C
my empeg stuff

Top
#125198 - 07/01/2003 17:07 Re: Bayesian email filters [Re: tonyc]
tonyc
carpal tunnel

Registered: 27/06/1999
Posts: 7058
Loc: Pittsburgh, PA
Hahaha! With even more trial and error I've now got Eudora checking my five traditional POP accounts, my Yahoo! mail (through YahooPops) and my Hotmail through HotPOP3. All proxied through POPFile.

The Yahoo! and Hotmail accounts should allow me to train my spam filters really quick!!!


Edited by yn0t_ (07/01/2003 17:08)
_________________________
- Tony C
my empeg stuff

Top
#125199 - 08/01/2003 07:34 Re: Bayesian email filters [Re: andy]
tonyc
carpal tunnel

Registered: 27/06/1999
Posts: 7058
Loc: Pittsburgh, PA
So after only one evening of spam training, it already is doing a respectable job. Last night out of 12 emails I had one false negative and one false positive. I'm expecting the results will get better as it gets more data.

One confusing thing in my original post, I said "for those of us running Windows." This program is actually written in straight Perl, and should therefore run on anything that runs Perl. I just mentioned Windows because there didn't look to be many solutions for Windows users, short of switching POP clients. The thing I like about this is that it works with your exisitng POP clients, and flags the messages before your POP client downloads them (for easy filtering by the POP client.) It's got a really slick HTTP-based configuration interface as well.

The things keeping me from going with one of these spam programs before were the need to switch clents and the difficulty of configuring them, but this program really has the right approach. Y'all should check it out if you're sick of spam. Oh yeah, it's free (but the author accepts donations.)
_________________________
- Tony C
my empeg stuff

Top