Logistic regression is a generalized linear model most commonly used for classifying binary data. It’s output is a continuous range of values between 0 and 1 (commonly representing the probability of some event occurring), and its input can be a multitude of real-valued and discrete predictors.

### Motivation

Suppose you want to predict the **probability** someone is a homeowner based solely on their age. You might have a dataset like

ID | Age | Homeowner |
---|---|---|

0 | 13 | 0 |

1 | 13 | 0 |

… | … | … |

39 | 75 | 1 |

40 | 79 | 1 |

As with any binary variable, it makes sense to code True values as 1s and False values as 0s. Then you can plot the data. Our homeowner dataset looks like this

There’s definitely more positive samples as age increases which makes sense. If we decide to group the data into equal size bins we can calculate the proportion of positive samples for each group.

LB | RB | MedianAge | PcntHomeowner |
---|---|---|---|

10 | 24 | 17 | 0.125 |

24 | 38 | 34 | 0.25 |

38 | 52 | 47 | 0.5 |

52 | 66 | 54 | 0.75 |

66 | 80 | 70.5 | 0.875 |

Notice the data starting to take an S shape. This is a common and natural occurrence for a variety of random processes, particularly where the explanatory variable and the response variable have a monotonic relationship. One such S-shaped function is the logistic function.

\(\sigma (t) = \frac{1}{1+e^{-t}}\)Writing t as B0 + B1 lets us change the horizontal position of the curve by varying B0 and the steepness of the curve by varying B1.

\(F(x) = \frac {1}{1+e^{-(\beta_0 + \beta_1 x)}}\)### Fitting a model to the data

At this point we’d like to fit a logistic curve to our data. There are two distinct ways to do this depending on the type of data to be fitted.

#### Method 1

First we’ll look at fitting a logistic curve to binned, or grouped data. For example, suppose we didn’t have individual Yes/No responses of whether someone was a homeowner, but instead had an aggregated data set like

MedianAge | PcntHomeowner |
---|---|

17 | 0.125 |

34 | 0.25 |

47 | 0.5 |

54 | 0.75 |

70.5 | 0.875 |

Recall our last form of the logistic function

\(F(x) = \frac {1}{1+e^{-(\beta_0 + \beta_1 x)}}\)where we interpret F(x) to be the probability that someone is a homeowner. We can rearrange this equation as follows

\(\ln (\frac{F(x)}{1 – F(x)}) = \beta_0 + \beta_1 x\)Notice that the modified function is **linear** in terms of x. So, if we take our sample data and create a transformed column Y’ equal to ln(PcntHomeowner/ (1-PcntHomeowner)) then we can fit Y’ = B0 + B1x using ordinary least squares.

MedianAge | PcntHomeowner | YPrime |
---|---|---|

17 | 0.125 | -1.946 |

13 | 0.25 | -1.099 |

47 | 0.5 | 0 |

54 | 0.75 | 1.099 |

70.5 | 0.875 | 1.946 |

From here we can transform the fitted linear model to a logistic model. We have

\(Y’ = \ln (\frac{F(x)}{1 – F(x)}) = \beta_0 + \beta_1x \Rightarrow F(x) = \frac{1}{1+e^{-(\beta_0 + \beta_1x)}}\)(Nothing special here – just the logistic function.)

Finally, plotting the model against our data.

Before we wrap up method1, let’s take another look at our linear model

\(\ln (\frac{F(x)}{1 – F(x)}) = \beta_0 + \beta_1 x\)First of all, this is the inverse of the logistic function. Secondly, notice the F/(1-F) part. Remember, F is the probability of success. So, F/(1-F) is the *odds* of success.

The odds of something happening is the probability of it happening divided by the probability of it not happening. If the probability of a horse winning the Kentucky Derby is .2, then the odds of that horse winning are .2/.8 = .25 (or more commonly stated, the odds of the horse *losing* are 4 or “4 to 1”).

Recapping, for some probability of success p, the odds of success are p/(1-p) and the log-odds are ln(p/(1-p)). The function ln(p/(1-p)) is special enough to warrant its own name – the logit function. Notice it’s equivalent to the linear model we fit, ln(F/(1-F)) = B0 + B1x. In other words, we fit a logistic regression to our data by fitting a linear model to the log-odds of our sample data.

#### Method 2

In Method 1 we were able to use linear regression to fit our data because our dataset had probabilities of homeownership. On the other hand, if our data just has {0, 1} response values we’ll have to use a more sophisticated technique – maximum likelihood estimation.

First, recall the Bernoulli distribution. A Bernoulli random variable X is just a binary random variable (0 or 1) with probability of success p. (I.e. P(X=1) = p). Thus the Bernoulli distribution is defined by a single parameter, p. Furthermore, its expected value equals p.

Next, let’s take another look at our plotted data.

Now pick an x value, say 50, and imagine slicing the data in a small neighborhood around 50.

Age | Homeowner |
---|---|

49 | 0 |

49 | 1 |

51 | 1 |

51 | 0 |

Looking at the data we find 2 positive and 2 negative samples. In this case we can think of each response variable near Age=50 as a random variable from some Bernoulli distribution whose p value is somewhere in the neighborhood of 0.5.

Now let’s slice the data in the neighborhood of 70.

Age | Homeowner |
---|---|

68 | 1 |

68 | 1 |

68 | 1 |

70 | 0 |

70 | 1 |

We can think of these samples as random variables sampled from some *other* Bernoulli distribution which have a p value close to 0.80. This coincides with our intuition that the probability of someone being a homeowner generally increases with their age.

Generalizing this idea, we can assume that at each point, x, the samples close to x follow a Bernoulli distribution whose expected value is some **function of x**, p(x). Since we want to model our data with the logistic function

\(F(x) = \frac {1}{1+e^{-(\beta_0 + \beta_1 x)}}\),

we can treat F(x) and p(x) to be the same. In other words, we can think of our logistic function as defining an infinite set of Bernoulli distributions.

Now suppose we guess some parameters B0 and B1 which appear to fit the data well, say B0=-3.5 and B1=0.06.

Assuming our guessed model is the *true* model for whether or not someone is a homeowner based on their age, what is the probability of our sampled data occurring? In other words, what is the probability that a random 13-year-old *isn’t* a homeowner AND another random 13-year-old *isn’t* a homeowner … AND a random 75-year-old *is* a homeowner AND a random 79-year-old *is* a homeowner?

According to our model, the probability that a random 13-year-old *isn’t* a homeowner is p(Y=0 | x=13) = 1-F(13) = .93. The probability that a random 75-year-old *is* a homeowner is F(75) = .73, etc. If we assume each of these instances are independent, then the probability of all of them occurring is (1-F(13)) * (1-F(13)) … * F(75) * F(79).

What we just described is calculating the probability or “likelihood” of our samples having their response values according to our model F(x, B0=-3.5, B1=0.06). For a set of samples assumed to be from *some* logistic regression model, we can define the likelihood of *the specific model* with parameters B0 and B1 as

where \((x_i, y_i)\) is the ith observation in our sample data.

Plugging in F gives

\(\mathcal{L}(B0, B1; \boldsymbol{samples}) = \prod_{i=1}^n (\frac {1}{1+e^{-(\beta_0 + \beta_1 x_i)}})^{y_i}(1-\frac {1}{1+e^{-(\beta_0 + \beta_1 x_i)}})^{(1-y_i)}\)This is called the likelihood function. The parameters (B0, B1) that yield the largest value of L are exactly the parameters we want to use for our logistic regression model. The process of finding those optimum parameters is called maximum likelihood estimation.

With that said, maximum likelihood estimation is a **deep** topic and probably warrants its own separate article. For that reason, I’ll leave out the gritty details and just say

*[magic]*.

Fortunately for the practitioners out there, a number of maximum likelihood estimation methods have been implemented in open-source statistical libraries. Using scikit-learn with our sample data yields B0=-3.23 and B1=0.0723.

What exactly does LB and RB supposed to represent?

That’s the “Left Bound” and “Right Bound” of each bin used to group the samples. (Left bound inclusive, right bound exclusive).

Still, Ben, isn’t monotonicity necessary but not sufficient (This aplies to Method 2, BTW)? There is the issue of separation of variables issue whereby one may not have a(n) (numerical) input value Xo so that for all X>Xo all trials are successes/fails or beyond Xo all are true or false. I suspect the underlying issue is that an S-curve must twist abruptly to go into a constant curve (either f(x)=0 or f(x)=1) But somehow (in practice) there don’t seem to be problems when you have 3 or more levels in an ordinal curve,; maybe the S curve smooths out more easily with 3 levels.

Very good article, please keep me posting with new articles.

Hi, nice article . Is it fair to say that if you have a good fit when doing a standard regression Y dependent against X independent on your data then a logistic regression is _not_ likely to describe the log-odds accurately, i.e., the probability of a success in Y for different cases X cannot be successfully/effectively described ?

Thanks Guero. Hmm, I think you mean Y’, the transformed version of Y. If so, I would disagree with your statement. If you really did mean Y, then your statement is correct.

Thanks, how do you mean Y’ as the transformed version of X? Do you mean the “discretization” of Y into 2 or more states, e.g., High, Medium, Low?

No. Notice this line in my post “Y’ equal to ln(PcntHomeowner/ (1-PcntHomeowner))”

Thanks for your patience Ben, I am basically trying to understand the meaning of a lack of fit (Chi-Squared has a P of 0.000) that I obtained in a multilevel logistic ( 5 states: Very High, High,…) . Can I conclude that lack of fit means that the distribution of the data or the odds P(x)/(1-P(x))cannot be well-approximated by a logit curve?

Responding to your last post (WordPress won’t let me reply below)… Yes.

Thanks for everything, Ben.