This is a third part in a series of posts where I am implementing logistic regression from scratch. In the first post I discussed the theory of logistic regression, and in the second post I implemented it in python providing comparison to sklearn. In this post I will be showing how I implemented regularization from scratch.

In my previous post I also discussed the theory of regularization in the context of logistic regression where I show how different prior beliefs on the weights in the model lead to different update equations. This post will take advantage of the weight updates showed in that post.

### Update for LASSO Regularization (Laplace Prior) $w^i = w^i + \alpha \ \Bigg ( \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) w^i - \lambda_1 \thinspace sign(w^i) \Bigg )$

### Update for Ridge Regularization (Gaussian Prior) $w^i = w^i + \alpha \ \Bigg ( \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) w^i - \lambda_2 \thinspace w^i \Bigg )$

The simplest solution is to take the code from the previous post on implementing the code and just change the gradient function to include the regularized terms.

# Starter Code

This is the template code of logistic regression similar to what I implemented in the previous post

import numpy as np
class LogisticRegression:

def __init__(self,X,y,tolerance=1e-5,l1=0.,l2=0.):
&quot;&quot;&quot;Initializes Class for Logistic Regression

Parameters
----------
X : ndarray(n-rows,m-features)
Numerical training data.

y: ndarray(n-rows,)
Interger training labels.

tolerance : float (default 1e-5)
Stopping threshold difference in the loglikelihood between iterations.

l1: float (default 0)
LASSO regularization value

l2: float (default 0)
Ridge regularization value

&quot;&quot;&quot;
self.tolerance = tolerance
self.labels = y.reshape(y.size,1)
self.w = np.zeros((X.shape+1,1))
self.features = np.ones((X.shape,X.shape+1))
self.features[:,1:] = X
self.shuffled_features = self.features
self.shuffled_labels = self.labels
self.l1=l1
self.l2=l2
self.likelihood_history = []

def log_likelihood(self):
&quot;&quot;&quot;Calculate the loglikelihood for the current set of weights and features.

Returns
-------
out : float
&quot;&quot;&quot;
#Get Probablities
p = self.probability()
#Get Log Likelihood For Each Row of Dataset
loglikelihood = self.labels*np.log(p+1e-24) + (1-self.labels)*np.log(1-p+1e-24)
loglikelihood = -1*loglikelihood.sum()
#Return Sum
return loglikelihood

def probability(self):
&quot;&quot;&quot;Computes the logistic probability of being a positive example

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
return 1/(1+np.exp(-self.features.dot(self.w)))

&quot;&quot;&quot;Calculate the loglikelihood gradient for the current set of weights and features.

Returns
-------
out : ndarray(n features, 1)
gradient of the loglikelihood
&quot;&quot;&quot;
error = self.labels-self.probability()
product = error*self.features

&quot;&quot;&quot;Runs the gradient decent algorithm

Parameters
----------
alpha : float
The learning rate for the algorithm

max_iterations : int
The maximum number of iterations allowed to run before the algorithm terminates

&quot;&quot;&quot;
previous_likelihood = self.log_likelihood()
difference = self.tolerance+1
iteration = 0
self.likelihood_history = [previous_likelihood]
while (difference &gt; self.tolerance) and (iteration &lt; max_iterations):
self.w = self.w + alpha*self.log_likelihood_gradient()
temp = self.log_likelihood()
difference = np.abs(temp-previous_likelihood)
previous_likelihood = temp
self.likelihood_history.append(previous_likelihood)
iteration += 1

def predict_probabilty(self,X):
&quot;&quot;&quot;Computes the logistic probability of being a positive example

Parameters
----------
X : ndarray (n-rows,n-features)
Test data to score using the current weights

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
features = np.ones((X.shape,X.shape+1))
features[:,1:] = (X-self.mean_x)/self.std_x
return 1/(1+np.exp(-features.dot(self.w)))

def get_coefficients(self):
return self.w.T


I will be  using the same data from the the regularization post:

import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
%matplotlib inline
def generate_1D_data(loc_positive=1,loc_negative=-1,size=100):
X = np.hstack((np.random.normal(loc=loc_positive,size=size),np.random.normal(loc=loc_negative,size=size)))
X = X.reshape(X.size,1)
y = np.hstack((np.ones(size),np.zeros(size)))
return X,y
X,y = generate_1D_data()
plt.figure(figsize=(16,8))
plt.hist(X[y==1],alpha=0.5,color='steelblue',label='Positive Example')
plt.hist(X[y==0],alpha=0.5,color='darkred',label='Negative Example')
plt.xlabel('Variable Value')
plt.ylabel('Count')
plt.title('Histogram of Simulated 1D Data')
plt.show() Using the data generated from  above, my implementation generates the following results:

log = LogisticRegression(X,y,tolerance=1e-6,l1=0,l2=0)
print 'Cost Before Fit:', log.log_likelihood()
coef = log.get_coefficients()
print 'Cost After Fit:', log.log_likelihood()
print 'Intercept: ',coef
print 'Slope: ',coef


Cost Before Fit: 138.629436112
Cost After Fit: 62.6436146605
Intercept: -0.0574759097926
Slope: 2.31981998386

We can compare this to sklearn’s implementation of logistic regression to make sure the results still match.

from sklearn.linear_model import LogisticRegression as SKLogisticRegression
logSK = SKLogisticRegression(tol=1e-6,C=1e35)
logSK.fit(X.reshape(X.size,1),y)
print &quot;Intercept: &quot;,logSK.intercept_
print &quot;Slope: &quot;,logSK.coef_[0,0]


Intercept: -0.0576359266481
Slope: 2.32334711964

The results are very close. Lets update the code to include regularization.

# Update Code

I have to update two functions to implement regularization. One is the log_likelihood, and the other is the log_likelihood_gradient.

## Log Likelihood Code

def log_likelihood(self):
&quot;&quot;&quot;Calculate the loglikelihood for the current set of weights and features.

Returns
-------
out : float
&quot;&quot;&quot;
#Get Probablities
p = self.probability()
#Get Log Likelihood For Each Row of Dataset
loglikelihood = self.labels*np.log(p+1e-24) + (1-self.labels)*np.log(1-p+1e-24)
loglikelihood = -1*loglikelihood.sum()
loglikelihood += self.l1*np.abs(self.w).sum()
loglikelihood += self.l2*np.power(self.w, 2).sum()/2
#Return Sum
return loglikelihood


Ideal we should probably check that only l1 or l2 is set, but this is only toy code. The first term implements the addition of the LASSO regularization to the cost function. $\textrm{LASSO Cost} = \lambda_1 \thinspace \sum_k |w^k|$

The second term implements the Ridge regularization to the cost function. $\textrm{Ridge Cost} = \frac{1}{2} \thinspace \lambda_2 \thinspace \sum_k |w^k|^2$

The next code update is to update the gradient equations as shown at the beginning of the post.

## Log Likelihood Gradient Code

def log_likelihood_gradient(self):
&quot;&quot;&quot;Calculate the loglikelihood gradient for the current set of weights and features.

Returns
-------
out : ndarray(n features, 1)
gradient of the loglikelihood
&quot;&quot;&quot;
error = self.labels-self.probability()
product = error*self.features


The first term implements subtracting the gradient for the LASSO regularization of the gradient. $\frac{\partial}{\partial w^k} \textrm{LASSO Cost} = \lambda_1 \thinspace sign(w^k)$

The second term implements subtracting the Ridge regularization of the gradient. $\frac{\partial}{\partial w^k} \textrm{Ridge Cost} = \thinspace \lambda_2 \thinspace w^k$

# Checking the implementation

I will compare a couple of toy examples in this implementation compared to sklearn.

## Ridge (Gaussian Prior)

### Lambda = 100, C = 0.01

from LogisticRegressionRegularization import LogisticRegression as BTSLogisticRegression
log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=0,l2=100)
coef = log.get_coefficients()
logSK = SKLogisticRegression(tol=1e-9,C=0.01)
logSK.fit(X.reshape(X.size,1),y)
print &quot;Sklearn Intercept, Bryan Intercept: &quot;,logSK.intercept_, coef
print &quot;Sklearn Slope, Bryan Slope: &quot;,logSK.coef_[0,0], coef


Sklearn Intercept, Bryan Intercept: -0.00217832261688 -0.00217735345805
Sklearn Slope, Bryan Slope: 0.548644577533 0.54864036825

### Lambda = 10, C = 0.1

log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=0,l2=10)
coef = log.get_coefficients()
logSK = SKLogisticRegression(tol=1e-9,C=0.1)
logSK.fit(X.reshape(X.size,1),y)
print &quot;Sklearn Intercept, Bryan Intercept: &quot;,logSK.intercept_, coef
print &quot;Sklearn Slope, Bryan Slope: &quot;,logSK.coef_[0,0], coef


Sklearn Intercept, Bryan Intercept: -0.021489688722 -0.0214850033177
Sklearn Slope, Bryan Slope: 1.40733200454 1.40730512612

## LASSO (Laplace Prior)

### Lambda = 100, C = 0.01

from LogisticRegressionRegularization import LogisticRegression as BTSLogisticRegression
log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=100,l2=0)
coef = log.get_coefficients()
logSK = SKLogisticRegression(tol=1e-9,C=0.01,penalty='l1')
logSK.fit(X.reshape(X.size,1),y)
print &quot;Sklearn Intercept, Bryan Intercept: &quot;,logSK.intercept_, coef
print &quot;Sklearn Slope, Bryan Slope: &quot;,logSK.coef_[0,0], coef


Sklearn Intercept, Bryan Intercept: 0.0 -0.0524505475518
Sklearn Slope, Bryan Slope: 0.0892273998403 0.089274906091

### Lambda = 10, C = 0.1

from LogisticRegressionRegularization import LogisticRegression as BTSLogisticRegression
log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=10,l2=0)
coef = log.get_coefficients()
logSK = SKLogisticRegression(tol=1e-9,C=0.1,penalty='l1')
logSK.fit(X.reshape(X.size,1),y)
print &quot;Sklearn Intercept, Bryan Intercept: &quot;,logSK.intercept_, coef
print &quot;Sklearn Slope, Bryan Slope: &quot;,logSK.coef_[0,0], coef


Sklearn Intercept, Bryan Intercept: 0.0 -0.00436123552719
Sklearn Slope, Bryan Slope: 1.58593771765 1.58595857714

# Review

In this post I made minimal changes to the logistic regression code to implement Ridge and LASSO regularization. I also showed that the results between this implementation and sklearn on a simple toy model are consistent. The simple gradient decent algorithm, with its manual tuning of the learning rate, keeps this method from being ready for production. Attemping to use my code on larger and more nuanced data will lead to difficulties with training time and finding the ‘optimal’ learning rate.

In the next and last post in this series, I am going to alter the cost functions/regularization to choose set priors that reflect prior experience, data, or conclusions that are not reflected in the dataset.

# Thank You

I appreciate you taking the time to read this post. I hope you gained something from it.  Feel free to reach out with any questions in the comments or directly through the email link in the menu.

## Join the Conversation

1. 1 Comment