Logistic Regression Fundamentals

Logistic regression is a generalized linear model most commonly used for classifying binary data.  It’s output is a continuous range of values between 0 and 1 (commonly representing the probability of some event occurring), and its input can be a multitude of real-valued and discrete predictors.

 Motivation

Suppose you want to predict the probability someone is a homeowner based solely on their age.  You might have a dataset like

ID Age Homeowner
0 13 0
1 13 0
39 75 1
40 79 1

As with any binary variable, it makes sense to code True values as 1s and False values as 0s.  Then you can plot the data.  Our homeowner dataset looks like this

There’s definitely more positive samples as age increases which makes sense.  If we decide to group the data into equal size bins we can calculate the proportion of positive samples for each group.

LB RB MedianAge PcntHomeowner
10 24 17 0.125
24 38 34 0.25
38 52 47 0.5
52 66 54 0.75
66 80 70.5 0.875

Notice the data starting to take an S shape.  This is a common and natural occurrence for a variety of random processes, particularly where the explanatory variable and the response variable have a monotonic relationship.  One such S-shaped function is the logistic function.

\(\sigma (t) = \frac{1}{1+e^{-t}}\)

fig3

Writing t as B0 + B1 lets us change the horizontal position of the curve by varying B0 and the steepness of the curve by varying B1.

\(F(x) = \frac {1}{1+e^{-(\beta_0 + \beta_1 x)}}\)

Fitting a model to the data

At this point we’d like to fit a logistic curve to our data.  There are two distinct ways to do this depending on the type of data to be fitted.

Method 1

First we’ll look at fitting a logistic curve to binned, or grouped data. For example, suppose we didn’t have individual Yes/No responses of whether someone was a homeowner, but instead had an aggregated data set like

MedianAge PcntHomeowner
17 0.125
34 0.25
47 0.5
54 0.75
70.5 0.875

Recall our last form of the logistic function

\(F(x) = \frac {1}{1+e^{-(\beta_0 + \beta_1 x)}}\)

where we interpret F(x) to be the probability that someone is a homeowner.  We can rearrange this equation as follows

\(\ln (\frac{F(x)}{1 – F(x)}) = \beta_0 + \beta_1 x\)

Notice that the modified function is linear in terms of x.  So, if we take our sample data and create a transformed column Y’ equal to ln(PcntHomeowner/ (1-PcntHomeowner)) then we can fit Y’ = B0 + B1x using ordinary least squares.

MedianAge PcntHomeowner YPrime
17 0.125 -1.946
13 0.25 -1.099
47 0.5 0
54 0.75 1.099
70.5 0.875 1.946

From here we can transform the fitted linear model to a logistic model.  We have

\(Y’ = \ln (\frac{F(x)}{1 – F(x)}) = \beta_0 + \beta_1x \Rightarrow F(x) = \frac{1}{1+e^{-(\beta_0 + \beta_1x)}}\)

(Nothing special here – just the logistic function.)

Finally, plotting the model against our data.

Before we wrap up method1, let’s take another look at our linear model

\(\ln (\frac{F(x)}{1 – F(x)}) = \beta_0 + \beta_1 x\)

First of all, this is the inverse of the logistic function.  Secondly, notice the F/(1-F) part.  Remember, F is the probability of success.  So, F/(1-F) is the odds of success.

The odds of something happening is the probability of it happening divided by the probability of it not happening.  If the probability of a horse winning the Kentucky Derby is .2, then the odds of that horse winning are .2/.8 = .25 (or more commonly stated, the odds of the horse losing are 4 or “4 to 1”).

Recapping, for some probability of success p, the odds of success are p/(1-p) and the log-odds are ln(p/(1-p)).  The function ln(p/(1-p)) is special enough to warrant its own name – the logit function.  Notice it’s equivalent to the linear model we fit, ln(F/(1-F)) = B0 + B1x.  In other words, we fit a logistic regression to our data by fitting a linear model to the log-odds of our sample data.

Method 2

In Method 1 we were able to use linear regression to fit our data because our dataset had probabilities of homeownership.  On the other hand, if our data just has {0, 1} response values we’ll have to use a more sophisticated technique – maximum likelihood estimation.

First, recall the Bernoulli distribution.  A Bernoulli random variable X is just a binary random variable (0 or 1) with probability of success p. (I.e. P(X=1) = p). Thus the Bernoulli distribution is defined by a single parameter, p.  Furthermore, its expected value equals p.

Next, let’s take another look at our plotted data.

Now pick an x value, say 50, and imagine slicing the data in a small neighborhood around 50. 

Age Homeowner
49 0
49 1
51 1
51 0

Looking at the data we find 2 positive and 2 negative samples.  In this case we can think of each response variable near Age=50 as a random variable from some Bernoulli distribution whose p value is somewhere in the neighborhood of 0.5.

Now let’s slice the data in the neighborhood of 70. 

Age Homeowner
68 1
68 1
68 1
70 0
70 1

We can think of these samples as random variables sampled from some other Bernoulli distribution which have a p value close to 0.80. This coincides with our intuition that the probability of someone being a homeowner generally increases with their age.

Generalizing this idea, we can assume that at each point, x, the samples close to x follow a Bernoulli distribution whose expected value is some function of x, p(x).  Since we want to model our data with the logistic function

\(F(x) = \frac {1}{1+e^{-(\beta_0 + \beta_1 x)}}\),

we can treat F(x) and p(x) to be the same.  In other words, we can think of our logistic function as defining an infinite set of Bernoulli distributions.

Now suppose we guess some parameters B0 and B1 which appear to fit the data well, say B0=-3.5 and B1=0.06.

Assuming our guessed model is the true model for whether or not someone is a homeowner based on their age, what is the probability of our sampled data occurring?  In other words, what is the probability that a random 13-year-old isn’t a homeowner AND another random 13-year-old isn’t a homeowner … AND a random 75-year-old is a homeowner AND a random 79-year-old is a homeowner?

According to our model, the probability that a random 13-year-old isn’t a homeowner is p(Y=0 | x=13) = 1-F(13) = .93.  The probability that a random 75-year-old is a homeowner is F(75) = .73, etc.  If we assume each of these instances are independent, then the probability of all of them occurring is (1-F(13)) * (1-F(13)) … * F(75) * F(79).

What we just described is calculating the probability or “likelihood” of our samples having their response values according to our model F(x, B0=-3.5, B1=0.06). For a set of samples assumed to be from some logistic regression model, we can define the likelihood of the specific model with parameters B0 and B1 as

\(\mathcal{L}(B0, B1; \boldsymbol{samples}) = \prod_{i=1}^n P[Y_i=y_i \mid p_i=F(x_i, B0, B1)]\) \(= \prod_{i=1}^n F(x_i, B0, B1)^{y_i}(1-F(x_i, B0, B1))^{(1-y_i)}\)

where \((x_i, y_i)\) is the ith observation in our sample data.

Plugging in F gives

\(\mathcal{L}(B0, B1; \boldsymbol{samples}) = \prod_{i=1}^n (\frac {1}{1+e^{-(\beta_0 + \beta_1 x_i)}})^{y_i}(1-\frac {1}{1+e^{-(\beta_0 + \beta_1 x_i)}})^{(1-y_i)}\)

This is called the likelihood function.  The parameters (B0, B1) that yield the largest value of L are exactly the parameters we want to use for our logistic regression model. The process of finding those optimum parameters is called maximum likelihood estimation.

With that said, maximum likelihood estimation is a deep topic and probably warrants its own separate article. For that reason, I’ll leave out the gritty details and just say

[magic].

Fortunately for the practitioners out there, a number of maximum likelihood estimation methods have been implemented in open-source statistical libraries. Using scikit-learn with our sample data yields B0=-3.23 and B1=0.0723.