# Introduction

I wrote about the theory of Logistic Regression in the previous article where I highlighted that the algorithm is attempting to find the weights/coefficients that maximizes the likelihood that the estimates of the probability/labels are correct.  Through algebriac manipulation, the problem is reframed as finding the weights that produce minimum negative log-likelihood.   One way to approach this problem is to pick some weights, then move along this function by changing the weights in a way that tend to lower the value of the function for the new weights.   In theory this will eventually converge, but in practice we stop when it is good enough.

Note:  What follows is a naive implementation of gradient descent and stocastic gradient descent.  I do not recommend using these methods in practice because they are not very robust, require tuning, and are slower than standard implementation that take advantage of well developed optimization libraries.  Follow up articles will attempt to address these, but for now we will just move along with the naive approach.

That being said here is the final model I develop for predicting the color of wine compared to sklearn’s Logistic Regression.

# Github

I have a Github repo with some of the code from this series:  https://github.com/bryantravissmith/Logistic-Regression-From-Scratch

# Data

In order to test and compare our implementation of Logistic Regression we will need data.   I will use two data sets for this article.  One is a toy dataset with two variables and an integer label.  The other is a the combining of two wine datasets from the UCI Machine Learning Repository

The gradient descent algorithm assumes you have some function, and we will assume that it is some some data ($x$) and weight ($w$) variables where the data if fixed but the weights are to be determined.  The function $f(x,w)$ gives us some value for a given data, weight combination and we want to find the weights that gives us the lowest value.  Assuming the function is appropriately well-behaved, we can choose some weights then continue to update the weights using the following update rule:

${w^{i} = w^{i} - \alpha \frac{\partial}{\partial w^{i}} f(x,w)}$

The index $i$ is the index of the $i^{\textrm{th}}$ feature in $x$.  The negative sign is because we are trying to descend down the cost function.  If the slope/derivative is positive, we want to move in the negative direction.  If the slope/derivative is negative we want to move in the positive direction.

For Logistic Regression we have the following cost function:

${-ln\mathcal{L}(w) = -\sum_{k} \Bigg[y_{k} \ ln\big( p(y_{k}=1|x_{k},w)\big) + (1-y_{k}) ln\big(1-p(y_{k}=1|x_{k},w)\big)\Bigg]}$

The variables $x$ and $y$ are the features and labels for the training data, and $k$ is the $k^{\textrm{th}}$ example in the training data.

The gradient of the Logistic cost function is:

${-\frac{\partial}{\partial w^{i}} ln\mathcal{L}(w) = -\sum_{k} \Big( y_{k} - p(y_{k}=1|x_{k},w) \Big) x^i_k}$

And our update rule will be:

${w^{i} = w^{i} + \alpha \sum_{k} \Big( y_{k} - p(y_{k}=1|x_{k},w) \Big) x^k_i}$

You might have noticed the negative sign switched to a positive value, and that is because of the negative sign is the gradient of the negative log-likelihood.

# Code

I am going to implement the Logistic Regression as a class, and start with the following template.  The methods are broken up in a way to highlight the fact that I will be implementing two forms of gradient decent.

2. Stochastic Gradient Descent (Using data point at a time for weight updates)

There are tradeoffs for both methods.  There is also in between implementations where you can use subsets/batches of the data for weight updates.

## Template Code

import numpy as np
class LogisticRegression:

def __init__(self,X,y,tolerance=1e-5):
&quot;&quot;&quot;Initializes Class for Logistic Regression

Parameters
----------
X : ndarray(n-rows,m-features)
Numerical training data.

y: ndarray(n-rows,)
Interger training labels.

tolerance : float (default 1e-5)
Stopping threshold difference in the loglikelihood between iterations.

&quot;&quot;&quot;
self.tolerance = tolerance
self.labels = y.reshape(y.size,1)
#create weights equal to zero with an intercept coefficent at index 0
self.w = np.zeros((X.shape[1]+1,1))
#Add Intercept Data Point of 1 to each row
self.features = np.ones((X.shape[0],X.shape[1]+1))
self.features[:,1:] = X
self.shuffled_features = self.features
self.shuffled_labels = self.labels
self.likelihood_history = []

def probability(self):
&quot;&quot;&quot;Computes the logistic probability of being a positive exampe

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
pass

def log_likelihood(self):
&quot;&quot;&quot;Calculate the loglikelihood for the current set of weights and features.

Returns
-------
out : float
&quot;&quot;&quot;
pass

&quot;&quot;&quot;Calculate the loglikelihood gradient for the current set of weights and features.

Returns
-------
out : ndarray(n features, 1)
&quot;&quot;&quot;
pass

Parameters
----------
alpha : float
The learning rate for the algorithm

max_iterations : int
The maximum number of iterations allowed to run before the algorithm terminates

&quot;&quot;&quot;
pass

def row_probability(self,row):
&quot;&quot;&quot;Computes the logistic probability of being a positive example for a given row

Parameters
----------
row : int
Row from feature matrix with to calculate the probablity.

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
pass

&quot;&quot;&quot;Computes the loglikelihood gradient for a given row

Parameters
----------
row : int
Row from feature matrix with to calculate the probablity.

Returns
-------
out : ndarray(n features, 1)
&quot;&quot;&quot;
pass

Parameters
----------
alpha : float
The learning rate for the algorithm

max_iterations : int
The maximum number of iterations allowed to run before the algorithm terminates

&quot;&quot;&quot;
pass

def predict_probabilty(self,X):
&quot;&quot;&quot;Computes the logistic probability of being a positive example

Parameters
----------
X : ndarray (n-rows,n-features)
Test data to score using the current weights

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
pass


This implementation requires that you initialize the class with the training data.  The features are in a 2D numeric numpy array, and the labels are in a 1D numeric numpy array.  Additionally I am adding an intercept to the data so that the  zero-indexed weight is the intercept, and the follow weights correspond to the order columns in the feature matrix X.

I will implement method by method for the standard gradient decent algorithm in the following order:

1. Probability Calculation of the Training Data
2. Log-Likelihood of the Training Data
3. Log-Likelihood Gradient of Training Data

### Probability Calculation

def probability(self):
&quot;&quot;&quot;Computes the logistic probability of being a positive example

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
return 1/(1+np.exp(-self.features.dot(self.w)))


### Log-Likelihood Calculation

def log_likelihood(self):
&quot;&quot;&quot;Calculate the loglikelihood for the current set of weights and features.

Returns
-------
out : float
&quot;&quot;&quot;
#Get Probablities
p = self.probability()
#Get Log Likelihood For Each Row of Dataset
loglikelihood = self.labels*np.log(p+1e-24) + (1-self.labels)*np.log(1-p+1e-24)
#Return Sum
return -1*loglikelihood.sum()


def log_likelihood_gradient(self):
&quot;&quot;&quot;Calculate the loglikelihood gradient for the current set of weights and features.

Returns
-------
out : ndarray(n features, 1)
&quot;&quot;&quot;
error = self.labels-self.probability()
product = error*self.features
return product.sum(axis=0).reshape(self.w.shape)


def gradient_decent(self,alpha=1e-7,max_iterations=1e4):

Parameters
----------
alpha : float
The learning rate for the algorithm

max_iterations : int
The maximum number of iterations allowed to run before the algorithm terminates

&quot;&quot;&quot;
previous_likelihood = self.log_likelihood()
difference = self.tolerance+1
iteration = 0
self.likelihood_history = [previous_likelihood]
while (difference gt; self.tolerance) and (iteration lt; max_iterations):
temp = self.log_likelihood()
difference = np.abs(temp-previous_likelihood)
previous_likelihood = temp
self.likelihood_history.append(previous_likelihood)
iteration += 1


I will be copying the implementation above in the method by method fashion except for the log-likelihood (already done!):

1. Probability Calculation  for a single row of the Training Data
2. Log-Likelihood Gradient for a single row of Training Data

### Probability Calculation by Row

def row_probability(self,row):
&quot;&quot;&quot;Computes the logistic probability of being a positive example for a given row

Parameters
----------
row : int
Row from feature matrix with to calculate the probablity.

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
return 1/(1+np.exp(-self.shuffled_features[row,:].dot(self.w)))


def row_log_likelihood_gradient(self,row):
&quot;&quot;&quot;Computes the loglikelihood gradient for a given row

Parameters
----------
row : int
Row from feature matrix with to calculate the probablity.

Returns
-------
out : ndarray(n features, 1)
&quot;&quot;&quot;
error = self.shuffled_labels[row]-self.row_probability(row)
product = self.shuffled_features[row,:]*error
return product.reshape(self.features.shape[1],1)




Parameters
----------
alpha : float
The learning rate for the algorithm

max_iterations : int
The maximum number of iterations allowed to run before the algorithm terminates

&quot;&quot;&quot;
difference = self.tolerance+1.
previous_likelihood = self.log_likelihood()
rows = range(len(self.features))
np.random.shuffle(rows)
self.shuffled_features = self.shuffled_features[rows,:]
self.shuffled_labels = self.shuffled_labels[rows]
iteration = 0
self.likelihood_history = [previous_likelihood]

while (difference gt; self.tolerance)  (iteration lt; max_iterations):
for i in xrange(len(self.features)):
temp = self.log_likelihood()
difference = np.abs(temp - previous_likelihood)

#print previous_likelihood, temp, difference

previous_likelihood = temp

np.random.shuffle(rows)
self.shuffled_features = self.shuffled_features[rows,:]
self.shuffled_labels = self.shuffled_labels[rows]
iteration += 1
self.likelihood_history.append(previous_likelihood)



## Review

I have implemented the logistic regression algorithm using both gradient descent and stochastic gradient descent.  Now I am going to compare this with the sklearn implementation using the Toy and Wine datasets from above.

# Comparing to sklearn

In this section I will be comparing my implementation of logistic regression with sklearn.  Let us start with the toy dataset.

## Toy Data

from LogisticRegression import LogisticRegression as LogisticRegressionBTS
from sklearn.linear_model import LogisticRegression as LogisticRegressionSKL
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.figure(figsize=(16,8))
ax=plt.gca()
toy.loc[toy.label==0,:].plot(kind='scatter',x='var1',y='var2',label=quot;Negative Examplequot;,color='#AA0000',ax=ax,s=100)
toy.loc[toy.label==1,:].plot(kind='scatter',x='var1',y='var2',label=quot;Positive Examplequot;,color='#00AA00',ax=ax,s=100)
plt.legend(loc='best')
plt.title(quot;Toy Dataquot;)
plt.show()


The positive examples are well separated, and the first variable does not do much for informing if an example is positive or negative.

Running gradient descent on the toy dataset using the following code in an iPython notebook

X = toy.values[:,:2]
y = toy.values[:,2]
logBTS = LogisticRegressionBTS(X,y,tolerance=1e-6)
print logBTS.w.T,logBTS.log_likelihood()


produces the following results:

CPU times: user 549 ms, sys: 5.1 ms, total: 554 ms
Wall time: 552 ms
[[-26.92062149 -0.18584114 8.64366537]] 8.30463361158

This produces an intercept of -26.9, a coefficient for var1 of -0.189, and coefficient for var2 of 8.64.   The negative log-likelihood for these coefficients is 8.3.

For the stochastic gradient descent we get the following results

logBTS = LogisticRegressionBTS(X,y,tolerance=1e-6)
print logBTS.w.T,logBTS.log_likelihood()


Which produces the following results:

CPU times: user 23.8 s, sys: 89.6 ms, total: 23.9 s
Wall time: 23.9 s
[[-27.67652405  -0.15754055   8.89458409]] 8.33243660654

The weights/coefficients and the negative log-likelihood are similar in value as the regular gradient decent, but the time is almost 400x longer to computer on this data set.

Now lets compare to sklearn’s implementation.  Because sklearn implements regularization (which I will do in the next post, we have to set the ‘C’ parameter to a large value to reduce its effect.

logSKL = LogisticRegressionSKL(tol=1e-6,C=1e15)
%time logSKL.fit(X,y)
coef = np.append(logSKL.intercept_,logSKL.coef_).reshape(3,1)
temp = LogisticRegressionBTS(X,y,tolerance=1e-6)
temp.w = coef
print coef.T, temp.log_likelihood()


This produces the following results:

CPU times: user 1.02 ms, sys: 552 µs, total: 1.58 ms
Wall time: 1.12 ms
[[-27.7534122   -0.19149536   8.92554005]] 8.30156580374

This produces an intercept of -27.7 (-26.9 of my implementation), a coefficient for var1 of -0191 (-0.189 of my implementation), and coefficient for var2 of 8.93 (8.64 for my implementation).   The negative log-likelihood for these coefficients is the same 8.3.

The weights/coefficients and negative log-likelihood produce by sklearn are very similar to my implementation.  The speed, however, is very much faster than my implementation.   As well it should be, because under the hood it is using LIBLINEAR (though that is not the whole story – to be continued).

## Wine Data

Now we are going to run the same tests on the wine dataset.  This is going to really highlight the difficulty in implementing your own version of a machine learning algorithm.

wine = pd.read_csv(quot;wine.csvquot;)
y = (wine.Type=='Red').values.astype(int)
X = wine.loc[:,wine.columns[0:11]].values
logBTS = LogisticRegressionBTS(X,y,tolerance=1e-6)
print logBTS.w.T,logBTS.log_likelihood()


This produces the following results:

CPU times: user 14min 44s, sys: 1min 1s, total: 15min 46s
Wall time: 16min 3s
[[ -8.08877274   0.84694802  12.70627186   0.02073947  -0.16034602
6.11128278   0.0561014   -0.06567519  -7.46192366   3.65361126
10.6676526   -0.88305311]] 373.208256779

The gradient descent algorithm ran for 15 minutes and did not converge.   In contrast, we can look at the look at sklearns results.

logSKL = LogisticRegressionSKL(tol=1e-6,C=1e15)
%time logSKL.fit(X,y)
coef = np.append(logSKL.intercept_,logSKL.coef_).reshape(12,1)
temp = LogisticRegressionBTS(X,y,tolerance=1e-6)
temp.w = coef
print coef.T, temp.log_likelihood()


This produces the following results:

CPU times: user 188 ms, sys: 2.02 ms, total: 190 ms
Wall time: 192 ms [[ -1.73626291e+03 -2.91001543e-01 6.47063319e+00 -2.45603390e+00 -9.03523744e-01 2.24928967e+01 6.54734512e-02 -5.30980797e-02 1.73329397e+03 -1.16165896e+00 3.33314373e+00 1.76592277e+00]] 214.601181992

The results converge, and do so in less than 200 ms.  The coefficients are very different between the two implementations.

In gradient descent we have to naively set the learning rate, but there are algorithms that adjust the learning rate appropriately.   I believe LIBLINEAR uses a coordinate descent algorithm.

More importantly, we are applying the same learning rate to all the weights, but the weights in the wine dataset have very different scales (means) and distributions (standard deviations).   More on this later.

Since my implementation did not converge, I can adjust the tolerance of the sklearn implementation to a higher tolerance and see if the results are consistent.

logSKL = LogisticRegressionSKL(tol=5.4e-4,C=1e15)
%time logSKL.fit(X,y)
coef = np.append(logSKL.intercept_,logSKL.coef_).reshape(12,1)
temp = LogisticRegressionBTS(X,y,tolerance=1e-6)
temp.w = coef
print coef.T, temp.log_likelihood()


This produces the following results:

CPU times: user 44 ms, sys: 1.02 ms, total: 45 ms
Wall time: 44.6 ms
[[ -9.53934274   0.8029794   12.85814315   0.58228827  -0.148062
6.85134198   0.05505118  -0.06961054  -8.8251745    4.66868577
10.31003786  -0.91916414]] 352.331151996

The coefficients are very similar now that the negative log-likelihood values are similar.

# Scaling Variables

My implementation of Logistic Regression took an unreasonable amount of time.   Even though it will eventually find the minimum, we can help the algorithm find the minimum, as usually, by redefining the problem.   The data in the wine quality dataset has very different scales and distributions, so the gradient will be biased by both the scale and distribution of the variables.  It will still converge, but in slower time.  The slow time is because of how the data is represented, but not in the information contained in the data.

One way around this to scale the variables to similar sizes and ranges:

${x^{'} = \frac{x - \bar{x}}{\sigma_{x}}}$

Where $x^{'}$ is the scaled variable, $\bar{x}$ is the average of the unscaled variable, and $\sigma_{x}$ is the standard deviation of the unscaled variable.

The algorithm should find the same minimum, and produce the same estimate of probabilities, so the following equation must be true:

${w^{'}_{0} + \sum^{N}_{i=1} w^{i'} x^{'}_{i} =w_{0} + \sum^N_{i=0} w^{i} x_{i} }$

In this case the primed variables are the scaled variables and the resulting weights from using the scaled variables, and the unprimed variables are the unscaled variables and the corresponding resulting weights.

If we plug in the scaled equation into the left hand side of the above equation, we can produce the following conversion between the weights produced by gradient decent applied to the scaled variables, and gradient decent applied to the unscaled variables.

${w^{'}_{0} + \sum^{N}_{i=1} w^{i'}\frac{x_{i} - \bar{x_{i}}}{\sigma_{x_{i}}} = w_{0} + \sum^N_{i=0} w^{i} x_{i} }$

${\Big(w^{'}_{0} - \sum^{N}_{i=1} \frac{w^{i'} \bar{x_{i}}}{\sigma_{x_{i}}} \Big) + \sum^{N}_{i=1} \frac{w^{i'}}{\sigma_{x_{i}}} x_{i} = w_{0}+ \sum^{N}_{i=1} w^{i} x_{i}}$

So that the unscaled intercept is:

${w_{0} = \Big(w^{'}_{0} - \sum^{N}_{i=1} \frac{w^{i'}\bar{x_{i}}}{\sigma_{x_{i}}} \Big)}$

And the unscaled weights are:

${w_{i} = \frac{w^{i'}}{\sigma_{x_{i}}}}$

I am going to scale the wine dataset into a new scaled dataset, fit the new dataset, and transform the coefficients.

mean = X.mean(axis=0)
std = X.std(axis=0)
XP = (X-mean)/std
logBTS = LogisticRegressionBTS(XP,y,tolerance=1e-6)
new_coef = logBTS.w.T[0]/np.hstack((1,std))
new_coef[0] = logBTS.w.T[0][0]-(mean*logBTS.w.T[0][1:]/std).sum()
new_coef,logBTS.log_likelihood()


CPU times: user 452 ms, sys: 29.9 ms, total: 482 ms
Wall time: 468 ms
[ -1.84343985e+03  -3.86566692e-01   6.22415134e+00  -2.60396655e+00
-9.46160941e-01   2.22425572e+01   6.69735358e-02  -5.32335410e-02
1.84259358e+03  -1.72842890e+00   3.09482351e+00   1.90221249e+00] 214.434914995

The scaled data only took 3x the sklearn learn fitting time for the same initial tolerance.  The log-likelihood for these two methods are almost identical.   Here are the comparison between the models as shown at the beginning of the post.

# Review

In this post I implemented Logistic Regression and reproduce results similar to sklearn’s implementation.   The goal was to show how the algorithm works, but also how implementation of machine learning algorithms require more than just algorithms.  The algorithm must be smartly applied.  Not having an intuition gradient decent and the effect of variable scaling can lead to sub-optimal performance, for instance.  The difference on the wine dataset was over a 2500x increase in training time for results that did not converge.

I personally enjoy the process of reproducing the sklearn method.  In the next post in the series I will add regularization to the algorithm.

# Thank You

I appreciate you taking the time to read this post.  I hope you gained something from it.  Feel free to reach out with any questions in the comments or directly through the email link in the menu.

## Join the Conversation

1. martin says:

I think Log-Likelihood Calculation code is missing in your post since you pasted in the same code as for log_likelihood function. Other than that do you have your code available on github or other source? Thanks. Great articles, happy to learn from them.

2. mat says:

Thanks for a great article. Can you make the code for the log_likelihood function available? Like the earlier poster said, it was miss-copied.

Thanks again

3. Yeah – I’ll get everything in a github soon. Sorry about the mistake. Can you pinpoint it so I can fix it quickly in the post?

1. mat says:

it is in Implementing Standard Gradient Descent, Log-Likelihood Calculation.

2. function for log likelihood is missing can you include the code block for this.

4. sujay says:

If we have x1 and x2 in features as for the Toy dataset, why do you initialize an extra column in features and an extra row in weights?
This is in __init__ method.

1. The extra column in the data set is to add an intercept to data (x1, x2) -> (1, x1, x2). This is (intercept, feature1, feature2). The extra row in the weights is to fit the intercept (w1, w2) -> (b, w1, d2). This would be (intercept value, feature1 weight, feature2 weight)

1. sujay says:

Got it. Thanks

5. sujay says:

I had another question. When I used the value of alpha(learning rate) from the ‘toy’ dataset while training the ‘wine’ dataset, then I get an overflow error while calculating the probability. Is there any reason for this?
Basically wanted to understand why you have taken different learning rates for both the datasets.

1. Are you using this on the un-normalized data? If so, the scales of the variables are different. Large variable values in a dataset will likely need a smaller learning rate to converge. Let me know if that doesn’t make sense.