This is a third part in a series of posts where I am implementing logistic regression from scratch. In the first post I discussed the theory of logistic regression, and in the second post I implemented it in python providing comparison to sklearn. In this post I will be showing how I implemented regularization from scratch.

In my previous post I also discussed the theory of regularization in the context of logistic regression where I show how different prior beliefs on the weights in the model lead to different update equations. This post will take advantage of the weight updates showed in that post.

### Update for LASSO Regularization (Laplace Prior)

### Update for Ridge Regularization (Gaussian Prior)

The simplest solution is to take the code from the previous post on implementing the code and just change the gradient function to include the regularized terms.

# Starter Code

This is the template code of logistic regression similar to what I implemented in the previous post

import numpy as np class LogisticRegression: def __init__(self,X,y,tolerance=1e-5,l1=0.,l2=0.): """Initializes Class for Logistic Regression Parameters ---------- X : ndarray(n-rows,m-features) Numerical training data. y: ndarray(n-rows,) Interger training labels. tolerance : float (default 1e-5) Stopping threshold difference in the loglikelihood between iterations. l1: float (default 0) LASSO regularization value l2: float (default 0) Ridge regularization value """ self.tolerance = tolerance self.labels = y.reshape(y.size,1) self.w = np.zeros((X.shape[1]+1,1)) self.features = np.ones((X.shape[0],X.shape[1]+1)) self.features[:,1:] = X self.shuffled_features = self.features self.shuffled_labels = self.labels self.l1=l1 self.l2=l2 self.likelihood_history = [] def log_likelihood(self): """Calculate the loglikelihood for the current set of weights and features. Returns ------- out : float """ #Get Probablities p = self.probability() #Get Log Likelihood For Each Row of Dataset loglikelihood = self.labels*np.log(p+1e-24) + (1-self.labels)*np.log(1-p+1e-24) loglikelihood = -1*loglikelihood.sum() #Return Sum return loglikelihood def probability(self): """Computes the logistic probability of being a positive example Returns ------- out : ndarray (1,) Probablity of being a positive example """ return 1/(1+np.exp(-self.features.dot(self.w))) def log_likelihood_gradient(self): """Calculate the loglikelihood gradient for the current set of weights and features. Returns ------- out : ndarray(n features, 1) gradient of the loglikelihood """ error = self.labels-self.probability() product = error*self.features grad = product.sum(axis=0).reshape(self.w.shape) return grad def gradient_decent(self,alpha=1e-7,max_iterations=1e4): """Runs the gradient decent algorithm Parameters ---------- alpha : float The learning rate for the algorithm max_iterations : int The maximum number of iterations allowed to run before the algorithm terminates """ previous_likelihood = self.log_likelihood() difference = self.tolerance+1 iteration = 0 self.likelihood_history = [previous_likelihood] while (difference > self.tolerance) and (iteration < max_iterations): self.w = self.w + alpha*self.log_likelihood_gradient() temp = self.log_likelihood() difference = np.abs(temp-previous_likelihood) previous_likelihood = temp self.likelihood_history.append(previous_likelihood) iteration += 1 def predict_probabilty(self,X): """Computes the logistic probability of being a positive example Parameters ---------- X : ndarray (n-rows,n-features) Test data to score using the current weights Returns ------- out : ndarray (1,) Probablity of being a positive example """ features = np.ones((X.shape[0],X.shape[1]+1)) features[:,1:] = (X-self.mean_x)/self.std_x return 1/(1+np.exp(-features.dot(self.w))) def get_coefficients(self): return self.w.T[0]

I will be using the same data from the the regularization post:

import numpy as np import matplotlib.pyplot as plt import scipy as sp %matplotlib inline def generate_1D_data(loc_positive=1,loc_negative=-1,size=100): X = np.hstack((np.random.normal(loc=loc_positive,size=size),np.random.normal(loc=loc_negative,size=size))) X = X.reshape(X.size,1) y = np.hstack((np.ones(size),np.zeros(size))) return X,y X,y = generate_1D_data() plt.figure(figsize=(16,8)) plt.hist(X[y==1],alpha=0.5,color='steelblue',label='Positive Example') plt.hist(X[y==0],alpha=0.5,color='darkred',label='Negative Example') plt.xlabel('Variable Value') plt.ylabel('Count') plt.title('Histogram of Simulated 1D Data') plt.show()

Using the data generated from above, my implementation generates the following results:

log = LogisticRegression(X,y,tolerance=1e-6,l1=0,l2=0) print 'Cost Before Fit:', log.log_likelihood() log.gradient_decent(alpha=1e-3) coef = log.get_coefficients() print 'Cost After Fit:', log.log_likelihood() print 'Intercept: ',coef[0] print 'Slope: ',coef[1]

Cost Before Fit: 138.629436112

Cost After Fit: 62.6436146605

Intercept: -0.0574759097926

Slope: 2.31981998386

We can compare this to sklearn’s implementation of logistic regression to make sure the results still match.

from sklearn.linear_model import LogisticRegression as SKLogisticRegression logSK = SKLogisticRegression(tol=1e-6,C=1e35) logSK.fit(X.reshape(X.size,1),y) print "Intercept: ",logSK.intercept_[0] print "Slope: ",logSK.coef_[0,0]

Intercept: -0.0576359266481

Slope: 2.32334711964

The results are very close. Lets update the code to include regularization.

# Update Code

I have to update two functions to implement regularization. One is the log_likelihood, and the other is the log_likelihood_gradient.

## Log Likelihood Code

def log_likelihood(self): """Calculate the loglikelihood for the current set of weights and features. Returns ------- out : float """ #Get Probablities p = self.probability() #Get Log Likelihood For Each Row of Dataset loglikelihood = self.labels*np.log(p+1e-24) + (1-self.labels)*np.log(1-p+1e-24) loglikelihood = -1*loglikelihood.sum() loglikelihood += self.l1*np.abs(self.w).sum() loglikelihood += self.l2*np.power(self.w, 2).sum()/2 #Return Sum return loglikelihood

Ideal we should probably check that only l1 or l2 is set, but this is only toy code. The first term implements the addition of the LASSO regularization to the cost function.

The second term implements the Ridge regularization to the cost function.

The next code update is to update the gradient equations as shown at the beginning of the post.

## Log Likelihood Gradient Code

def log_likelihood_gradient(self): """Calculate the loglikelihood gradient for the current set of weights and features. Returns ------- out : ndarray(n features, 1) gradient of the loglikelihood """ error = self.labels-self.probability() product = error*self.features grad = product.sum(axis=0).reshape(self.w.shape) grad -= self.l1*np.sign(self.w) grad -= self.l2*self.w return grad

The first term implements subtracting the gradient for the LASSO regularization of the gradient.

The second term implements subtracting the Ridge regularization of the gradient.

# Checking the implementation

I will compare a couple of toy examples in this implementation compared to sklearn.

## Ridge (Gaussian Prior)

### Lambda = 100, C = 0.01

from LogisticRegressionRegularization import LogisticRegression as BTSLogisticRegression log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=0,l2=100) log.gradient_decent(alpha=1e-3) coef = log.get_coefficients() logSK = SKLogisticRegression(tol=1e-9,C=0.01) logSK.fit(X.reshape(X.size,1),y) print "Sklearn Intercept, Bryan Intercept: ",logSK.intercept_[0], coef[0] print "Sklearn Slope, Bryan Slope: ",logSK.coef_[0,0], coef[1]

Sklearn Intercept, Bryan Intercept: -0.00217832261688 -0.00217735345805

Sklearn Slope, Bryan Slope: 0.548644577533 0.54864036825

### Lambda = 10, C = 0.1

log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=0,l2=10) log.gradient_decent(alpha=1e-3) coef = log.get_coefficients() logSK = SKLogisticRegression(tol=1e-9,C=0.1) logSK.fit(X.reshape(X.size,1),y) print "Sklearn Intercept, Bryan Intercept: ",logSK.intercept_[0], coef[0] print "Sklearn Slope, Bryan Slope: ",logSK.coef_[0,0], coef[1]

Sklearn Intercept, Bryan Intercept: -0.021489688722 -0.0214850033177

Sklearn Slope, Bryan Slope: 1.40733200454 1.40730512612

## LASSO (Laplace Prior)

### Lambda = 100, C = 0.01

from LogisticRegressionRegularization import LogisticRegression as BTSLogisticRegression log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=100,l2=0) log.gradient_decent(alpha=1e-3) coef = log.get_coefficients() logSK = SKLogisticRegression(tol=1e-9,C=0.01,penalty='l1') logSK.fit(X.reshape(X.size,1),y) print "Sklearn Intercept, Bryan Intercept: ",logSK.intercept_[0], coef[0] print "Sklearn Slope, Bryan Slope: ",logSK.coef_[0,0], coef[1]

Sklearn Intercept, Bryan Intercept: 0.0 -0.0524505475518

Sklearn Slope, Bryan Slope: 0.0892273998403 0.089274906091

### Lambda = 10, C = 0.1

from LogisticRegressionRegularization import LogisticRegression as BTSLogisticRegression log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=10,l2=0) log.gradient_decent(alpha=1e-3) coef = log.get_coefficients() logSK = SKLogisticRegression(tol=1e-9,C=0.1,penalty='l1') logSK.fit(X.reshape(X.size,1),y) print "Sklearn Intercept, Bryan Intercept: ",logSK.intercept_[0], coef[0] print "Sklearn Slope, Bryan Slope: ",logSK.coef_[0,0], coef[1]

Sklearn Intercept, Bryan Intercept: 0.0 -0.00436123552719

Sklearn Slope, Bryan Slope: 1.58593771765 1.58595857714

# Review

In this post I made minimal changes to the logistic regression code to implement Ridge and LASSO regularization. I also showed that the results between this implementation and sklearn on a simple toy model are consistent. The simple gradient decent algorithm, with its manual tuning of the learning rate, keeps this method from being ready for production. Attemping to use my code on larger and more nuanced data will lead to difficulties with training time and finding the ‘optimal’ learning rate.

In the next and last post in this series, I am going to alter the cost functions/regularization to choose set priors that reflect prior experience, data, or conclusions that are not reflected in the dataset.

# Thank You

I appreciate you taking the time to read this post. I hope you gained something from it. Feel free to reach out with any questions in the comments or directly through the email link in the menu.