Logistic Regression is a common Supervised Machine Learning Classification algorithm that attempts to estimate the probability that a given set of data can be classified as a positive example of some category.   Examples of some question logistic regression can help answer include:

  • What is the probability a costumer will purchase a product given their recent on-sight behavior?
  • What is the probability that a recipient of a loan will default based on their most recent credit data?
  • What is the probability a given wine is red based on its chemical properties?

These questions can be changed to a goal such as: “Classify this person as a purchaser/non-purchaser based on their on-sight behavior.”  Logistic Regression does not do this directly.  It is done by setting a classification threshold for the probability, say 0.75 (or 75%).   All examples with an estimated probability of being a positive example of the classification above 75% will be classified as a positive example.   What to set the threshold depends on the problem/ goal of the classification, and is not inherit to Logistic Regression.  I will defer discussing this to another time.

Why Implement It Ourselves?

There are great libraries already written in all common data science languages that already implement Logistic Regression.  They are faster and much more robust than what I am going to write here.   I also use them over my own code.  I recommend that for everyone.

The point of implementation is partly to have a deeper understanding of the packages/libraries/algorithms, and partly for the sheer join of it.  There is also a habit of mind that I want to establish for myself: rigorous understanding.  Most implementations have design decisions and choice that might not be be the best choice for all problems.  There are also bugs that exist that might inhibit your work.   Plus, there is also the fact that there are many very good and specific machine learning methods and algorithms that are not in the common or best packages out there.   You might find yourself wanting to use one, and not willing to wait.

My last reason:  cost function manipulation is fun and the results can be surprising.   I will also defer this to a post later in the series.

If you want to skip the theory you can jump to the code.

Logistic Regression Theory (Math o’clock)

Logistic Regression is an algorithm that attempts to estimate the probability (p) that a given the data (x) is an example of given classification (y = 1).  This can be written as:

p( y=1|x)


{p( y=1|x) +p( y=0|x) = 1}

The odds ratio for a given example is:

{\textrm{odds} = p( y=1|x)/p( y=0|x) }

The Logistic Regression algorithm requires/assumes that the natural log of the odds ( ln( \frac{p( y=1|x)}{p( y=0|x)})) is a linear function of the variables:

{ln( \frac{p( y=1|x,w)}{p( y=0|x,w)}) = w_{0} + \sum_{i=1}^N w^{i} x_{i}}

Notice that the weights (w) slipped into the equation.  The index i represents the i^{\textrm{th}} variable in the example set (x,y).  The result is now both a condition of the data and the chosen weights because we have decided to model the probability with a specific function.

The above equation than requires that the probability is represented as a Sigmoid function:

{p( y=1|x,w) = \frac{e^{w_{0} + \sum_{i=1}^N w^{i} x_{i}}}{1+e^{w_{0} + \sum_{i=1}^N w^{i} x_{i}}} = \frac{1}{1+e^{-(w_{0} +\sum_{i=1}^N w^{i} x_{i})}}}

We now have a set of examples of (x_i,y_i) and a model to represent the probabilities from each example.  The weights or coefficients in the model are not specified.   The goal now is to choose the weights that best reflect the truth presented in the data.

Gradient Decent – How to Choose the Best Weights.

The best always depends on the definition of the problem, so we have to define it.   Lets consider one example (x_{0},y_{0}) and ask what is the probability of being correct.  If y=1 then the probability of being correct is:


And if y=0, the probability of being correct is:

{p(y=0|x,w) = 1 - p(y=1|x,w)}

We because y = 1 or y=0, we write the probability/likelihood of being correct (\mathcal{L}(w)) as:

{\mathcal{L}(w) = p(y_{0}=1|x_{0},w)^{y_{0}} (1-p(y_{0}=1|x_{0},w))^{1-y_{0}}}

We can multiply the above equation together for each example if the assumption that each example is independent holds.  The likelihood of being correct for all the examples can be written as:

{\mathcal{L}(w) = \prod_{k} p(y_{k}=1|x_{k},w)^{y_{k}} (1-p(y_{k}=1|x_{k},w))^{1-y_{k}}}

Where the index k is the k^{\textrm{th}} example from our set of training examples.

We want to maximize the likelihood of being correct, we we need to choose the weights that produce the maximum likelihood of being correct.  We have two problems with the current form of the likelihood that makes it difficult to solve for the weights

  1. Machine Precisions – the product of small numbers is a smaller number.   At some point zero is the only representation of the result in the computer
  2. Algebriac complexity of the product rule for finding the direction to step in.

Thankfully, there is an acceptable way around these problems: by taking the natural log of the likelihood of being correct.  By the property of logs, the product turns into a sum.  There is also convention of taking the negative sign so that the problem is transformed from finding the weights that maximum likelihood of being correct to finding the weights the minimizes the negative log likelihood of being correct.

{-ln\left(\mathcal{L}(w)\right) = -\sum_{k} ln\left( p(y_{k}=1|x_{k},w)^{y} (1-p(y_{k}=1|x_{k},w))^{1-y_{k}}\right)}

{-ln\left(\mathcal{L}(w)\right) = -\sum_{k} y_{k} ln\left( p(y_{k}=1|x_{k},w)\right) - \Sigma_{k} (1-y_{k}) ln\left(1-p(y_{k}=1|x_{k},w)\right)}

Before we move on to the implementation, there is one last math value we need to derive, and that is the gradient of the negative log-likelihood function from above.

{-\frac{\partial}{\partial w^{i}} ln\left(\mathcal{L}(w)\right) = \sum_{k} \left( \frac{y_{k}}{p(y_{k}=1|x_{k},w)} -\frac{1-y_{k}}{1-p(y_{k}=1|x_{k},w)} \right) \frac{\partial}{\partial w^{j}} p(y_{k}=1|x_{k},w)}


{\frac{\partial}{\partial w^{j}} p(y_{k}=1|x_{k},w) = p(y_{k}=1|x_{k},w) \left( 1 - p(y_{k}=1|x_{k},w) \right) x^k_{i}}

As before the indexes i and  k represent the i^{\textrm{th}} variable in the k^{\textrm{th}} training example.

Combining these two equations leads to:

{-\frac{\partial}{\partial w^{i}} ln\left(\mathcal{L}(w)\right) = -\sum_{k} \left( y_{k}\left(1-p(y_{k}=1|x_{k},w)\right) - \left(1-y_{k}\right) p(y_{k}=1|x_{k},w) \right) x^i_k}

{-\frac{\partial}{\partial w^{i}} ln\left(\mathcal{L}(w)\right) = -\sum_{k} \left( y_{k} - p(y_{k}=1|x_{k},w) \right) x^i_k}

Because we have a cost function ( - ln\left(\mathcal{L}(w) \right)) we want to minimize, we can use gradient decent.

The algorithm follows:

  1. Set initial weight/coefficient values
  2. While the change cost function is larger than some threshold
    1. Find the gradient of the cost function
    2. Change the weight values in the direction of the gradient

Though simply described, there is a lot of nuance in practice.  One is that we must decide the right amount to change the weights in the direction of the gradient.   We will use a constant proportion of the gradient, known as the learning rate (\alpha), for our implementation.

The final weight update equation than becomes:

{w^{i} = w^{i} + \alpha \sum_{k} \left( y_{k} - p(y_{k}=1|x_{k},w) \right) x^k_i}

Pseudo-Code Logistic Regression

Before we get to the python code, lets write out the steps we are planning to implement.

  1. Setup
    1. Set stopping criteria
    2. Add an incept (w_{0}) to the training data
  2. Run Gradient Decent
    1. Get Log Likelihood Gradient
    2. Update Weights
    3. Get new Log Likelihood Value
    4. Stop if difference between old and new Log Likelihood is below threshold
  3. Use Model

Now lets implement Logistic Regression in python in the next post.

Thank You

I appreciate you taking the time to read this post.  I hope you gained something from it.  Feel free to reach out with any questions in the comments or directly through the email link in the menu.


Join the Conversation


  1. Hey Bryan

    Very thorough article! I thought dividing your work into two separate portions – theory and example – was useful.

    I had two comments:
    1. Explain we can’t find the optimal weights analytically – thus, we find them numerically.
    2. This pertains to the 2nd article – provide a quick background on overfitting and thus the need for CV techniques.

    Great work! It’s obvious you spent a considerable amount of time on it.

    1. Hi Jimmy,

      I really appreciate the feedback and suggestions.

      I agree that I glossed over the fact the weights can’t be found analytically for logistic regressions.

      Cross validation is a really useful/necessary process in practice, but I do not think it is inherit to logistic regression.

      I have post plans where I will show that the sampling data from distributions can lead to cost functions that do not have the same minimum as the cost function based on the underlying distributions
      minimum. I will also show how cross validation helps finding minimums closer to the true minimum.

      That post, however, is planned outside the context of logistic regression.

  2. Hi Bryan, thanks for sharing. But …. I can’t see any images (equations, formula) on this webpage. Nor do my friends. Is it possible get it to fixed?

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: