In three months (as of June 2016) the New Orleans Saints will play a football game against the Atlanta Falcons. I want to know who will win. I ask my friend and he says the Saints. Technically this is a predictive model, but it’s probably not worth much. I can improve upon this model by asking other people who they think will win. Someone might pick the Saints because we have a better quarterback.
I think there’s a rule somewhere that says “You can’t call yourself a data scientist until you’ve used a Naive Bayes classifier”. It’s extremely useful, yet beautifully simplistic. This article is my attempt at laying the groundwork for Naive Bayes in a practical and intuitive fashion.
Motivating Problem Let’s start with a problem to motivate our formulation of Naive Bayes. (Feel free to follow along using the Python script or R script found here.
The purpose of this guide is to bridge the gap between understanding what a regular expression is and how to use them in Python. If you’re brand new to regular expressions, I highly recommend checking out RegexOne.
For this guide, we’ll use Python’s re module which makes using regular expressions a breeze.
Setup import re # import the re module sentence = "We bought our Golden Retriever, Snuggles, for $30 on 1/1/2015 at 1017 Main St.
The purpose of this guide is to bridge the gap between understanding what a regular expression is and how to use them in R. If you’re brand new to regular expressions, I highly recommend checking out RegexOne.
Hadley Wickham’s stringr package makes using regular expressions in R a breeze. I use it to avoid the complexity of base R’s regex functions grep, grepl, regexpr, gregexpr, sub and gsub where even the function names are cryptic.
Here’s a practical guide for calculating customer retention and churn from transaction data.
Preface The general idea of customer retention is self explanatory; It’s a measure of how well a business retainins their customers. Unfortunately the specifics of how to calculate a retention metric are not so clear. Likewise, customer churn is the complement of retention; It’s a measure of how many customers end their relationship with a business - i.
Logistic regression is a generalized linear model most commonly used for classifying binary data. It’s output is a continuous range of values between 0 and 1 (commonly representing the probability of some event occurring), and its input can be a multitude of real-valued and discrete predictors.
Motivating Problem Suppose you want to predict the probability someone is a homeowner based solely on their age. You might have a dataset like