I think there’s a rule somewhere that says “You can’t call yourself a data scientist until you’ve used a Naive Bayes classifier”. It’s extremely useful, yet beautifully simplistic. This article is my attempt at laying the groundwork for Naive Bayes in a practical and intuitive fashion.

Let’s start with a problem to motivate our formulation of Naive Bayes. (Feel free to follow along using the Python script or R script found here.)

Suppose we own a professional networking site similar to LinkedIn. Users sign up, type some information about themselves, and then roam the network looking for jobs/connections/etc. Until recently, we only required users to enter their current job title, but now we’re asking them what industry they work in. New users are supplying this info as they sign up, but old users aren’t bothering to update their information. So, we need to build a text classification model to do it for them. Our data looks like this

job_title | job_category |
---|---|

underwriter manager | finance |

mortgage data analyst | finance |

junior underwriter | finance |

sales manager | sales |

junior medical sales associate | sales |

senior sales manager | sales |

sales manager | sales |

senior data analyst | technology |

data analyst | technology |

data manager | technology |

data analyst manager | NA |

junior data analyst | NA |

Our goal – to estimate the probability those two unlabeled job titles should be categorized as Technology, Sales, or Finance, at which point we can make our best guess for the user. Formalizing this a bit, we want to find \(p(C_k \vert \textrm{job\textunderscore title})\) where \(C_1, C_2, \text{and} C_3\) are the classes Technology, Sales, and Finance. (Note: this type of problem is called Document Classification.)

How about that first unlabeled title, “data analyst manager”? We should probably label it as Technology, but how do we train a computer to figure that out? If we had trillions of training samples we might be able to estimate \(p(C_k \vert \textrm{job\textunderscore title}=\textrm{“data analyst manager”})\) empirically (i.e. by measuring the relative frequency of each class for samples where job_title = “data analyst manager”). Unfortunately we only have 10 training samples (none of which have the title “data analyst manager”) so we’ll have to be a little more creative in our approach.

The word “data” seems pretty important. It occurs in all of the Technology samples, none of the Sales samples and only one of the Finance samples. On the other hand the word “manager” appears in every single category, so it’s probably not as useful. The big takeaway here is that we can use word occurrences to build a probabilistic model. Let’s start tracking words then…

### Finance

job_title | analyst | associate | data | junior | manager | medical | mortgage | sales | senior | underwriter |
---|---|---|---|---|---|---|---|---|---|---|

underwriter manager | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |

mortgage data analyst | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |

junior underwriter | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |

### Sales

job_title | analyst | associate | data | junior | manager | medical | mortgage | sales | senior | underwriter |
---|---|---|---|---|---|---|---|---|---|---|

sales manager | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |

junior medical sales associate | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |

senior sales manager | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |

sales manager | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |

### Technology

job_title | analyst | associate | data | junior | manager | medical | mortgage | sales | senior | underwriter |
---|---|---|---|---|---|---|---|---|---|---|

senior data analyst | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |

data analyst | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

data manager | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

### Unlabeled

job_title | analyst | associate | data | junior | manager | medical | mortgage | sales | senior | underwriter |
---|---|---|---|---|---|---|---|---|---|---|

data analyst manager | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |

junior data analyst | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

Updating our model a bit, we want to find \(p(C_k \vert \mathbf{x})\) where \(\mathbf{x}\) is our feature vector of word occurrences. In the case of “data analyst manager” \(\mathbf{x} = (x_1, x_2, … x_{10}) = (1,0,1,0,1,0,0,0,0,0)\).

Before we continue, let’s review some probability rules and Bayes’ Theorem

\(P(A\mid B) = \) \(\frac{P(A \cap B)}{P(B)},\) \(\text{ if } P(B) \neq 0, \!\)

\(P(B\mid A) =\) \(\frac{P(A \cap B)}{P(A)},\) \(\text{ if } P(A) \neq 0, \!\)

\(\Rightarrow P(A \cap B) = P(A\mid B)\, P(B) = P(B\mid A)\, P(A), \!\)\(\Rightarrow P(A\mid B) = \) \(\frac{P(B\mid A)\,P(A)}{P(B)},\) \(\text{ if } P(B) \neq 0. \text{ (this is Bayes’ Theorm)}\)

For us this means

\(p(C_k \vert \mathbf{x})=\) \(\frac{p(\mathbf{x} \vert C_k)p(C_k)}{p(\mathbf{x})}\)

Let’s break down those pieces:

\(p(C_k)\) = frequency of class \(C_k\) / total number of samples

\(p(\mathbf{x} \vert C_k)\) = (frequency of \(\mathbf{x}\) / number of samples) where class = \(C_k\)

\(p(\mathbf{x})\) = frequency of \(\mathbf{x}\) / total number of samples

\(p(C_k)\)

This is the easy part. To calculate \(p(C_k)\) we can use empirical relative frequencies given by our training data.

\(p(\text{Technology}) = \frac{3}{10} = 0.3\) \(p(\text{Sales}) = \frac{4}{10} = 0.4\) \(p(\text{Finance}) = \frac{3}{10} = 0.3\)These probabilities are called “priors”. Our method of estimating them using the training data is common, but not necessary. Suppose we have reason to believe that the true priors for Technology, Sales, and Finance are 0.2, 0.4, and 0.4 respectively. With a Naive Bayes model, we can just plug those babies in for \(p(C_k)\) and our model will adjust accordingly. By contrast, using priors for tree based models like Random Forest is not nearly as easy.

\(p(\mathbf{x} \vert C_k)\)

Now let’s consider \(p(\mathbf{x} \vert C_k)\). Using our training data we would calculate \(p(\mathbf{x} \vert \text{Technology})\) to be 0. (Remember \(p(\mathbf{x} \vert \text{Technology})\) represents the probability that only and all of the words “data”, “analyst”, and “manager” appear in a random job title given that it’s in the Technology class. In our training samples this never occurred, so our empirical probability estimate is 0.) This is a problem. We know \(p(\mathbf{x} \vert \text{Technology})\) should be greater than 0, but we don’t have enough (or in this case, any) samples to accurrately estimate it. The way we’ll get around this problem is to make a **naive assumption – we’ll assume that the features (i.e. the occurrence of words in a job title) are independent variables**. Obviously this is not true. When the word “data” appears in a job title, it immediately increases the probability that the word “analyst” appears in the title. However let’s assume the assumption is valid (or close to being valid). Then

For our example “data analyst manager” this means

\({p(\mathbf{x}|\text{Technology}) = p(x_1=1|\text{Tech})p(x_2=0|\text{Tech}) \ldots p(x_{10}=0|\text{Tech})=}\) \(\frac{2}{3}\frac{3}{3} \ldots \frac{3}{3} \) \(= 0.1481\)

Similarly, we can calculate \(p(\mathbf{x}|\text{Sales})=0\) and \(p(\mathbf{x}|\text{Finance})=0.005487\)

\(p(\mathbf{x})\)

We run into the same issue as before when we try to estimate \(p(\mathbf{x})\). According to our training data \(p(\mathbf{x}) = 0\), but the *true* value of \(p(\mathbf{x})\) should obviously be > 0. However, consider our end-goal which is to determine \(p(\text{Technology} \vert \mathbf{x})\), \(p(\text{Sales} \vert \mathbf{x})\), and \(p(\text{Finance} \vert \mathbf{x})\). Per Bayes’ Theorem, these probabilities are equivalent to

\(p(\text{Technology} \vert \mathbf{x}) =\) \(\frac{p(\mathbf{x} \vert \text{Technology})p(\text{Technology})}{p(\mathbf{x})}\)

\(p(\text{Sales} \vert \mathbf{x}) =\) \(\frac{p(\mathbf{x} \vert \text{Sales})p(\text{Sales})}{p(\mathbf{x})}\)

\(p(\text{Finance} \vert \mathbf{x}) =\) \(\frac{p(\mathbf{x} \vert \text{Finance})p(\text{Finance})}{p(\mathbf{x})}\)

Notice that \(p(\mathbf{x})\) is just a scaling factor. It affects the values of \(p(\text{Technology} \vert \mathbf{x})\), \(p(\text{Sales} \vert \mathbf{x})\), and \(p(\text{Finance} \vert \mathbf{x})\) equally, And since these probabilities must sum to 1 we can solve for \(p(\mathbf{x})\):

\(1 = p(\text{Technology} \vert \mathbf{x}) + p(\text{Sales} \vert \mathbf{x}) + p(\text{Finance} \vert \mathbf{x})\)\(\Rightarrow 1 =\) \(

\frac{p(\text{Technology})p(\mathbf{x} \vert \text{Technology})}{p(\mathbf{x})} +

\frac{p(\text{Sales})p(\mathbf{x} \vert \text{Sales})}{p(\mathbf{x})} +

\frac{p(\text{Finance})p(\mathbf{x} \vert \text{Finance})}{p(\mathbf{x})}\)

p(\text{Technology})p(\mathbf{x} \vert \text{Technology}) +

p(\text{Sales})p(\mathbf{x} \vert \text{Sales}) +

p(\text{Finance})p(\mathbf{x} \vert \text{Finance})\)

So, in our example we can calculate \(p(\mathbf{x})\) to be 0.04609.

### Putting it all together

we can generate probability estimates

\(p(\text{Technology} \vert \mathbf{x}) =\) \(\frac{p(\mathbf{x} \vert \text{Technology})p(\text{Technology})}{p(\mathbf{x})}\) \(= \frac{0.3 \times 0.1481}{0.04609}\) \(= 0.9643\)

\(p(\text{Sales} \vert \mathbf{x}) =\) \(\frac{p(\mathbf{x} \vert \text{Sales})p(\text{Sales})}{p(\mathbf{x})}\) \(= \frac{0.4 \times 0}{0.04609}\) \(= 0\)

\(p(\text{Finance} \vert \mathbf{x}) =\) \(\frac{p(\mathbf{x} \vert \text{Finance})p(\text{Finance})}{p(\mathbf{x})}\) \(= \frac{0.3 \times 0.005487}{0.04609}\) \(= 0.03571\)

Finally, if we want to turn this model into a classifier we just need to use a decision rule like *Label the sample using the class with the highest probability*.

Awesome! But there’s still a nagging issue… Suppose we try to classify the title “junior data analyst”. With our current model, we’d get

\({p(\mathbf{x}|\text{Technology}) = p(x_1=1|\text{Tech})p(x_2=0|\text{Tech}) \ldots p(x_{4}=1|\text{Tech}) \ldots =}\) \(\frac{2}{3}\frac{3}{3} \ldots \frac{0}{3} \ldots \) \(= 0\)

\({p(\mathbf{x}|\text{Sales}) = p(x_1=1|\text{Tech})p(x_2=0|\text{Tech}) \ldots p(x_{4}=1|\text{Tech}) \ldots =}\) \(\frac{2}{4}\frac{3}{4} \ldots \frac{1}{4} \ldots \) \(= 0\)

\({p(\mathbf{x}|\text{Finance}) = p(x_1=1|\text{Tech})p(x_2=0|\text{Tech}) \ldots p(x_{4}=1|\text{Tech}) \ldots =}\) \(\frac{2}{3}\frac{3}{3} \ldots \frac{1}{3} \ldots \) \(= 0.005487\)

Our intuition tells us that “junior data analyst” is more likely to be a Technology job than a Finance job, but since the word “junior” never appeared in any of the Technology training samples, \(p(x_{4}=1|\text{Technology}) = 0\) which will ultimately cause \(p(\text{Technology} \vert \mathbf{x})\) to be 0. In order to deal with this issue, we’ll introduce something called Additive Smoothing which will ensure that \(p(x_i|C_k)\) is never 0 by making

\(p(x_i|C_k) =\) \(\frac{frequency(x_i|C_k) + \alpha}{frequency(C_k) + \alpha n_i} \) where \(n_i\) is the number of possible values for \(x_i\) (in our example, 2).

If \(\alpha = 1\) this is known as Laplace Smoothing. In our example, Laplace Smoothing produces the following:

\(p(\mathbf{x}|\text{Technology}) =\) \(\frac{2+1}{3+1 \cdot 2}\frac{3+1}{3+1 \cdot 2} \ldots \frac{0+1}{3+1 \cdot 2} \ldots \) \(= 0.01132 \Rightarrow p(\text{Technology}|\mathbf{x}) = 0.7431\)

\(p(\mathbf{x}|\text{Sales}) =\) \(\frac{2+1}{4+1 \cdot 2}\frac{3+1}{4+1 \cdot 2} \ldots \frac{1+1}{4+1 \cdot 2} \ldots \) \(= 0.0001058 \Rightarrow p(\text{Sales}|\mathbf{x}) = 0.00926\)

\(p(\mathbf{x}|\text{Finance}) =\) \(\frac{2+1}{3+1 \cdot 2}\frac{3+1}{3+1 \cdot 2} \ldots \frac{1+1}{3+1 \cdot 2} \ldots \) \(= 0.003775 \Rightarrow p(\text{Finance}|\mathbf{x}) = 0.2477\)

### Generalizing this

Recall our model which is a direct translation of Bayes’ Theorem

\(p(C_k \vert \mathbf{x})=\) \(\frac{p(\mathbf{x} \vert C_k)p(C_k)}{p(\mathbf{x})}\)

From our naive assumption, we transformed \(p(\mathbf{x} \vert C_k)\) into the product \(\prod_{i=1}^n p(x_i \vert C_k)\). Then we calculated \(p(x_i \vert C_k)\) using relative frequencies. The underlying argument for using relative frequencies is that \(x_i\) was a Bernoulli Random Variable (because \(x_i\) is either 0 or 1 for our example). However, in the general case \(x_i\) can be from any distribution and our Naive Bayes model will still work as long as we can closely estimate \(p(x_i \vert C_k)\). This includes multinomial random variables (commonly used in document classification problems like ours, but where word-repeats can occur) as well as continuous random variables. That means we can include things like a user’s bio (word frequencies) and age as additional features in our model.

Good Job !

What kind of variables (not frequencies) can you apply this on? How would you modify it for continuous variables for instance?

You can use any kind of random variable as long as you can estimate p(xi|Ck). However, if I was mixing multinomial AND continuous variables I think I’d build two independent models and ensemble them. I’ve seen discussions where people talk about mixing continuous and discrete random variables in a single naive bayes classifier, but I would be hesitant to do this myself. Honestly, I need to research this more.

For a little more on this topic see https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Gaussian_naive_Bayes

Great example! Thank you

Nice presentation.

I think it’s great you start off with an exercise that could be followed with pencil and paper. Too often illustrations of data science topics jump right to the R/Python library.

Very useful! Thank you

Thank you very much – this is a very interesting post. I don’t completely understand the example of additive smoothing, specifically the frequency(xi|Ck). In p(X|Technology), how numerator of p(X10=0) = 0+1?

That’s p(X4=0), not p(X10=0). (Notices the ellipses … after the fraction.) I put x4 (which represents the word “junior”) here so it aligns with the calculations in the section just above it.

Many thanks Ben (shame on me I need new glasses or am too tired!).

I’m very glad, all is clear and this is very helpful.

Nice article, I am curious about the syntax:

y = list(map(lambda x: {‘finance’:0, ‘sales’:1, ‘technology’:2}[x], categories[:10]))

Can you tell me what the [x] is called? It seems very functional with a pattern matching like syntax?

I’m working through this example with a panda dataframe that has ~ 200k records and each record averages 3-4 words.

When I get to:

df = pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())

I pop a memory error.

Thanks

Thank you for a good explanation!