## The Problem

You sell software that helps stores manage their inventory. You collect leads on thousands of potential customers, and your strategy is to cold-call them and pitch your product. You can only make 100 phone calls per day, so you want to identify leads with a high probability of converting to a sale. By calling leads randomly, you only generate about two sales per day (a 2% hit ratio). If only you could be smarter about who you target, you could increase your sales with no extra resources…

## The Solution

Machine Learning. You tracked important data on your leads – attributes like Facebook likes, phone number, type of business, etc. Now you can use machine learning techniques to build a model to help you separate the strong leads from the weak ones. In this article I’ll present an example dataset with example solutions to illustrate how this works. If you’d like to download the data and solutions, you can find them in the Rank Sales Leads directory of the Machine Learning Problem Bible on Github.

## The Data

### train

### test

## The Objective

This is important, so pay attention. The simplest and most common objective function for most binary classification algorithms is accuracy rate. This is not what we want. For our sample training dataset, just 35% of leads convert to a sale and in practice this number can be much lower (less than 1%). The most accurate model will often be the one that predicts every lead will *not* become a sale. Instead, we’re interested in accurately **ranking** the likelihood that each lead becomes a sale. Then we can pick out the best leads to pursue, with confidence that our overall conversion rate will be higher than pursuing a set of random leads. A very easy and common objective function for this is Area Under the ROC Curve. There’s a lot of information about it, so I encourage you to Google it. But in short,

Suppose we randomly sort our leads and predict that every lead converts to a sale. Then we can measure the cumulative *True Positive Rate* (portion of sales correctly identified up to position i) and the cumulative *False Positive Rate* (portion of non-sales incorrectly identified up to position i). Now we can plot the cumulative True Positive Rate vs the cumulative False Positive Rate which, in theory, will generate a y=x line from 0 to 1. So, for a point (x, y) like (.5, .5), the interpretation is that we have to pursue 50% of the total non-converting leads in order to catch 50% of converting leads.

Now suppose we build a model that predicts the probability each lead will convert. We sort the data from the strongest lead to the weakest lead and remeasure the cumulative True Positive Rate and True Negative Rate. Plotting this over our previous example yields

For reference, the point (0.25, 1.0) means that our model exhausted 25% of the bad leads in order to identify 100% of the good leads. This curve is called the Receiver Operating Characteristic (ROC curve). Measuring the area under the ROC curve gives an indication of how well the leads are ranked. A perfect model will have AUC ROC = 1, and a random guess model should have AUC ROC = 0.5.

One last bit about our objective – It doesn’t matter how well we rank the weakest leads because we only have the resources to call the top leads. For this reason, partial AUC ROC would be even better objective function, but since it’s a little more complex we’ll stick to maximizing AUC ROC.

## The Models

I built three models to solve this problem. Each model has some benefits and drawbacks which I’ll discuss, but I’m not going to display their results. This would be counterproductive for a few reasons:

1) There’s not enough data to make confident model comparisons

2) I made up the data

3) I didn’t spend a lot of time tuning each model’s hyper parameters

Instead, think of this as a starter kit/guide to solving your own problem

### Logistic Regression

This is probably the *purist* model of three with the strongest statistical foundation. It can work really well, but it’ll quickly fall apart when important assumptions are violated (as they often are when working with real world data).

*Pros*

– Can work extremely well (if the data is very clean and satisfies a bunch of nice properties)

– Fast

– Simple. You could build it in an Excel spreadsheet

*Cons*

– Requires imputation for missing values

– Poorly handles non monotonic relationships and spikes (e.g. imputing -1 for NA values where all other values are >= 0)

– Requires one-hot-encoding for unordered categorical features

– Can be difficult to measure variable importance

– R implementation doesn’t directly provide a regularization term

**Implementations:** rank_leads_logreg.R | rank_leads_logreg.py

### Random Forest

This is probably my favorite solution. It’s easy to set up and it almost always works. Plus it’s pretty darn cool.

*Pros*

– Almost always performs well

– Easy to understand the model’s logic

– Can handle complex interactions of multiple variables

– Does a good job of modeling non monotonic relationships and spikes in the data

– Easy to measure the importance of each predictor

– Easy to tune the model’s hyper parameters

*Cons*

– Can be difficult to explain exactly why a certain prediction was made on test data

– Can be difficult to diagnose issues

– Requires imputation for missing values

– Scikit-Learn requires one-hot-encoding for unordered categorical features which leads to problems with performance and possible speed issues

– R implementation only allows categorical features with up to 53 unique categories

**Implementations:** rank_leads_rf1.R | rank_leads_rf1.py

### Gradient Boosting (XGBoost)

In terms of performance, this one’s probably the best. XGBoost wins something like half of all Kaggle competitions. However, it’s fairly complex, can be difficult to tune, and it’s predictions are even more mystical than random forest.

*Pros*

– Usually performs *extremely* well

– Very fast

– Smart handling of missing values

– Easy to measure variable importance

*Cons*

– Can be difficult to explain exactly why a certain prediction was made on test data

– Difficult to prepare dataset for model training and predictions

– Difficult to tune hyper parameters

– Requires one-hot encoding categorical features

**Implementations:** rank_leads_xgb.R | rank_leads_xgb.py

Questions? Leave a comment or drop me a line – bgorman@GormAnalysis.com

Is this real, anonymized+real or synthetic data?

Synthetic. I usually make small toy datasets to illustrate concepts in my blog posts.