Friday, January 25, 2013

Improving Bayesian Filters

Last time we looked at a Simple Introduction to Naive Bayesian Filters (or classifiers), and saw how they work using the example of a box of chocolates. It turns out there's a simple trick you can use to make Bayesian filters more effective. I haven't seen it applied before, but it seems like an obvious technique in certain situations, and I found that it improves the efficacy of filtering by up to 10% in my test cases.

Bayesian filters are used for things like spam filters, where they look for certain words and phrases and rate the "spamminess" of emails. The presence of certain words tips off the spam filter and sends spam to your junk folder. When you're filtering spam, you don't really care about the non-spammy words that an email contains, and you don't care about words that are missing. There are over 600,000 words in the English language. Most of them will not be in an email. You simply don't care about the words that aren't there, if you're filtering email messages for spam. Applications like spam filtering care specifically about the presence of tokens (words) with a high probability of spamminess.

There are other use-cases, though, where you might care not only about tokens that are present, but also about tokens that are missing. An elephant, for example, always has a trunk. Big fat and gray you may be, but no trunk? Not an elephant. To make use of this information, we calculate the probabilities of a token not being present -- which is just the inverse of the probability for the token being present.

Going back to the box of chocolates example, suppose that every chocolate with nuts has fluted edges, always. No fluted edges, no nuts. In this case, we don't just care about the characteristics that are present (wrapper, no wrapper, etc). We care about the characteristics that are missing, too (like no fluted edges). This techniques works best when the total number of characteristics (aka the "vocabulary") is small enough to consider the presence and the absence of all possible tokens -- a few hundred or maybe a few thousand words.

The dclassify module for node is a simple implementation of a Bayesian filter that uses this "apply inverse" trick. In testing, using the apply-inverse option has improved results by 5 ~ 10% over conventional Bayesian filtering, when working with a vocabulary of approximately 20,000 unique tokens and a training data set of around 4000 documents.

No comments: