Friday, January 25, 2013

Improving Bayesian Filters

Last time we looked at a Simple Introduction to Naive Bayesian Filters (or classifiers), and saw how they work using the example of a box of chocolates. It turns out there's a simple trick you can use to make Bayesian filters more effective. I haven't seen it applied before, but it seems like an obvious technique in certain situations, and I found that it improves the efficacy of filtering by up to 10% in my test cases.

Bayesian filters are used for things like spam filters, where they look for certain words and phrases and rate the "spamminess" of emails. The presence of certain words tips off the spam filter and sends spam to your junk folder. When you're filtering spam, you don't really care about the non-spammy words that an email contains, and you don't care about words that are missing. There are over 600,000 words in the English language. Most of them will not be in an email. You simply don't care about the words that aren't there, if you're filtering email messages for spam. Applications like spam filtering care specifically about the presence of tokens (words) with a high probability of spamminess.

There are other use-cases, though, where you might care not only about tokens that are present, but also about tokens that are missing. An elephant, for example, always has a trunk. Big fat and gray you may be, but no trunk? Not an elephant. To make use of this information, we calculate the probabilities of a token not being present -- which is just the inverse of the probability for the token being present.

Going back to the box of chocolates example, suppose that every chocolate with nuts has fluted edges, always. No fluted edges, no nuts. In this case, we don't just care about the characteristics that are present (wrapper, no wrapper, etc). We care about the characteristics that are missing, too (like no fluted edges). This techniques works best when the total number of characteristics (aka the "vocabulary") is small enough to consider the presence and the absence of all possible tokens -- a few hundred or maybe a few thousand words.

The dclassify module for node is a simple implementation of a Bayesian filter that uses this "apply inverse" trick. In testing, using the apply-inverse option has improved results by 5 ~ 10% over conventional Bayesian filtering, when working with a vocabulary of approximately 20,000 unique tokens and a training data set of around 4000 documents.

Monday, January 21, 2013

A Simple Introduction to Naive Bayesian Filters

In the field of data analysis and machine learning algorithms, naive Bayesian filters remain popular since they're relatively easy to implement yet surprisingly effective.  Most introductions to Bayesian filtering dive straight into mathematics, but Bayesian filters are simple enough that you can intuitively see how they work by stepping through an example.

Let's say we've got a box of chocolates, and some of them have nuts inside, and we don't like nuts. We'd like to be able to tell which chocolates have nuts in them.  The chocolates are all different shapes, sizes and colors; some are wrapped and some aren't. The only way to tell if a chocolate contains nuts is to take a bite and see. By sampling the box of chocolates and keeping track of the shape, size, color and wrapping of all the ones we tried, we could come up with a pretty good idea of which ones are most likely to have nuts.

You could make a list of characteristics: size, color, shape, and wrapping. Then start eating chocolates. Every time you find one with nuts, you put an 'X' beside each of the matching characteristics (large, round, dark, etc). Every time you got a chocolate without nuts, put an "O" beside it's matching characteristics.  By the time you got through half the box, the list might look something like this:

Small:       OOOOO
Medium:      XXOOO
Large:       XXXOO
---
Round:       XXXOO
Square:      XXOOO
Long:        OOOOO
---
Dark:        XXXXX
Light:       OOOOO
White:       OOOOO
---
Wrapping:    XOOO
No Wrapping: XXXX OOOOOOO

The X's tell you where the nuts were.  You can see that large, round, dark chocolates with no wrapping tend to contain nuts.  For each of the characteristics, we can count up the X's and come up with a score for how often they contained nuts. The scores would look something like this:

Small:       0/5 (0%)
Medium:      2/5 (40%)
Large:       3/5 (60%)
---
Round:       3/5 (60%)
Square:      2/5 (40%)
Long:        0/5 (0%)
---
Dark:        5/5 (100%)
Light:       0/5 (0%)
White:       0/5 (0%)
---
Wrapping:    1/4 (25%)
No Wrapping: 4/11 (~36%)

You can see that none of the small chocolates had nuts in them.  At least, not the ones we tasted. But that doesn't necessarily mean small chocolates never contain nuts. Look at the dark chocolates, for example. All the dark chocolates had nuts. So, how about a small, dark chocolate? What are the chances it has nuts? 0%? 100%? Or somewhere in between? What if its round? What if it had a wrapper?

Looking at our chart for small, round, dark chocolates with a wrapper, we see 0%, 60%, 100%, and 25%. How do we take this set of probabilities and turn it into a final score?  As you might guess, when dealing with probabilities, we just multiply them together.  The "0" value presents a problem, though. No matter how high the other values are, if there's a zero anywhere in the list, when we multiply it out, the whole result just goes straight to zero. So we fudge the numbers a bit. Instead of zero, we use a really small positive number (i.e. 0.01).

You've probably noticed that the more probabilities you multiply together, the smaller the final result becomes. This leads to weird-looking results. Even if a chocolate has all the hallmarks of nuts (large, round, dark and unwrapped) the final score comes out to only 0.13 (0.6 * 0.6 * 1 * 0.36 = 0.13).  That doesn't sound right at all! Intuitively, we're almost certain it contains nuts.  It ought to have a really high probability, right? Are we really saying it's only got a 13% chance of containing nuts? Not really. If we calculate a final score for this same chocolate not containing nuts (by multiplying the inverses) it's only 0.1% (one tenth of one percent)!  We get this value by looking up the probabilities for nuts and subtracting from 1, to get the probability of not having nuts.

Here's our table of probabilities again, this time with an extra column for "No Nuts".

             P(Nuts)     P(No Nuts)
Small:       0/5 (0%)    5/5 (100%)
Medium:      2/5 (40%)   3/5 (60%)
Large:       3/5 (60%)   2/5 (40%)
---
Round:       3/5 (60%)   2/5 (40%)
Square:      2/5 (40%)   3/5 (60%)
Long:        0/5 (0%)    5/5 (100%)
---
Dark:        5/5 (100%)  0/5 (0%)
Light:       0/5 (0%)    5/5 (100%)
White:       0/5 (0%)    5/5 (100%)
---
Wrapping:    1/4 (25%)   3/4 (75%)
No Wrapping: 4/11(36%)   7/11(64%)

The probabilities for "no nuts" give us 0.4 * 0.4 * 0.01 * 0.64 = 0.001, or 0.1%.  Comparing the odds of 13 to 0.1, it suddenly seems a lot more likely that this chocolate contains nuts, than not.

Instead of looking at a final value like 13% or 0.1%, we could ask "how much more likely is it that this particular chocolate contains nuts, than not?" In this case, the chocolate is 130 times more likely to contain nuts, than not.

At the end of the day, what we really want is a way to pick any chocolate and automatically label it "nuts" or "no nuts".  If we get this correct most of the time, we'll be happy.  The procedure is simple: pick a chocolate, look at it's characteristics, and calculate a value for "nuts" vs. "no nuts". Whichever value is higher, that's how we label it. That, basically, is a naive Bayesian binary classifier.

Kenka Matsuri

Participants in the Kenka Masturi (Fight Festival) carrying a shrine, Himeji, Japan


© 2012 Darren DeRidder

Translate