## Implementing Logistic Regression From Scratch – Part 3 : Regularization

This is a third part in a series of posts where I am implementing logistic regression from scratch. In the first post I discussed the theory of logistic regression, and in the second post I implemented it in python providing comparison to sklearn. In this post I will be showing how I implemented regularization from scratch.

In my previous post I also discussed the theory of regularization in the context of logistic regression where I show how different prior beliefs on the weights in the model lead to different update equations. This post will take advantage of the weight updates showed in that post.

### Update for LASSO Regularization (Laplace Prior)

$w^i = w^i + \alpha \ \Bigg ( \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) w^i - \lambda_1 \thinspace sign(w^i) \Bigg )$

### Update for Ridge Regularization (Gaussian Prior)

$w^i = w^i + \alpha \ \Bigg ( \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) w^i - \lambda_2 \thinspace w^i \Bigg )$

The simplest solution is to take the code from the previous post on implementing the code and just change the gradient function to include the regularized terms.

# Starter Code

This is the template code of logistic regression similar to what I implemented in the previous post

import numpy as np
class LogisticRegression:

def __init__(self,X,y,tolerance=1e-5,l1=0.,l2=0.):
&quot;&quot;&quot;Initializes Class for Logistic Regression

Parameters
----------
X : ndarray(n-rows,m-features)
Numerical training data.

y: ndarray(n-rows,)
Interger training labels.

tolerance : float (default 1e-5)
Stopping threshold difference in the loglikelihood between iterations.

l1: float (default 0)
LASSO regularization value

l2: float (default 0)
Ridge regularization value

&quot;&quot;&quot;
self.tolerance = tolerance
self.labels = y.reshape(y.size,1)
self.w = np.zeros((X.shape[1]+1,1))
self.features = np.ones((X.shape[0],X.shape[1]+1))
self.features[:,1:] = X
self.shuffled_features = self.features
self.shuffled_labels = self.labels
self.l1=l1
self.l2=l2
self.likelihood_history = []

def log_likelihood(self):
&quot;&quot;&quot;Calculate the loglikelihood for the current set of weights and features.

Returns
-------
out : float
&quot;&quot;&quot;
#Get Probablities
p = self.probability()
#Get Log Likelihood For Each Row of Dataset
loglikelihood = self.labels*np.log(p+1e-24) + (1-self.labels)*np.log(1-p+1e-24)
loglikelihood = -1*loglikelihood.sum()
#Return Sum
return loglikelihood

def probability(self):
&quot;&quot;&quot;Computes the logistic probability of being a positive example

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
return 1/(1+np.exp(-self.features.dot(self.w)))

&quot;&quot;&quot;Calculate the loglikelihood gradient for the current set of weights and features.

Returns
-------
out : ndarray(n features, 1)
&quot;&quot;&quot;
error = self.labels-self.probability()
product = error*self.features

Parameters
----------
alpha : float
The learning rate for the algorithm

max_iterations : int
The maximum number of iterations allowed to run before the algorithm terminates

&quot;&quot;&quot;
previous_likelihood = self.log_likelihood()
difference = self.tolerance+1
iteration = 0
self.likelihood_history = [previous_likelihood]
while (difference &gt; self.tolerance) and (iteration &lt; max_iterations):
temp = self.log_likelihood()
difference = np.abs(temp-previous_likelihood)
previous_likelihood = temp
self.likelihood_history.append(previous_likelihood)
iteration += 1

def predict_probabilty(self,X):
&quot;&quot;&quot;Computes the logistic probability of being a positive example

Parameters
----------
X : ndarray (n-rows,n-features)
Test data to score using the current weights

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
features = np.ones((X.shape[0],X.shape[1]+1))
features[:,1:] = (X-self.mean_x)/self.std_x
return 1/(1+np.exp(-features.dot(self.w)))

def get_coefficients(self):
return self.w.T[0]


I will be  using the same data from the the regularization post:

import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
%matplotlib inline
def generate_1D_data(loc_positive=1,loc_negative=-1,size=100):
X = np.hstack((np.random.normal(loc=loc_positive,size=size),np.random.normal(loc=loc_negative,size=size)))
X = X.reshape(X.size,1)
y = np.hstack((np.ones(size),np.zeros(size)))
return X,y
X,y = generate_1D_data()
plt.figure(figsize=(16,8))
plt.hist(X[y==1],alpha=0.5,color='steelblue',label='Positive Example')
plt.hist(X[y==0],alpha=0.5,color='darkred',label='Negative Example')
plt.xlabel('Variable Value')
plt.ylabel('Count')
plt.title('Histogram of Simulated 1D Data')
plt.show()



Using the data generated from  above, my implementation generates the following results:

log = LogisticRegression(X,y,tolerance=1e-6,l1=0,l2=0)
print 'Cost Before Fit:', log.log_likelihood()
coef = log.get_coefficients()
print 'Cost After Fit:', log.log_likelihood()
print 'Intercept: ',coef[0]
print 'Slope: ',coef[1]


Cost Before Fit: 138.629436112
Cost After Fit: 62.6436146605
Intercept: -0.0574759097926
Slope: 2.31981998386

We can compare this to sklearn’s implementation of logistic regression to make sure the results still match.

from sklearn.linear_model import LogisticRegression as SKLogisticRegression
logSK = SKLogisticRegression(tol=1e-6,C=1e35)
logSK.fit(X.reshape(X.size,1),y)
print &quot;Intercept: &quot;,logSK.intercept_[0]
print &quot;Slope: &quot;,logSK.coef_[0,0]


Intercept: -0.0576359266481
Slope: 2.32334711964

The results are very close. Lets update the code to include regularization.

# Update Code

I have to update two functions to implement regularization. One is the log_likelihood, and the other is the log_likelihood_gradient.

## Log Likelihood Code

def log_likelihood(self):
&quot;&quot;&quot;Calculate the loglikelihood for the current set of weights and features.

Returns
-------
out : float
&quot;&quot;&quot;
#Get Probablities
p = self.probability()
#Get Log Likelihood For Each Row of Dataset
loglikelihood = self.labels*np.log(p+1e-24) + (1-self.labels)*np.log(1-p+1e-24)
loglikelihood = -1*loglikelihood.sum()
loglikelihood += self.l1*np.abs(self.w).sum()
loglikelihood += self.l2*np.power(self.w, 2).sum()/2
#Return Sum
return loglikelihood


Ideal we should probably check that only l1 or l2 is set, but this is only toy code. The first term implements the addition of the LASSO regularization to the cost function.

$\textrm{LASSO Cost} = \lambda_1 \thinspace \sum_k |w^k|$

The second term implements the Ridge regularization to the cost function.

$\textrm{Ridge Cost} = \frac{1}{2} \thinspace \lambda_2 \thinspace \sum_k |w^k|^2$

The next code update is to update the gradient equations as shown at the beginning of the post.

def log_likelihood_gradient(self):
&quot;&quot;&quot;Calculate the loglikelihood gradient for the current set of weights and features.

Returns
-------
out : ndarray(n features, 1)
&quot;&quot;&quot;
error = self.labels-self.probability()
product = error*self.features


The first term implements subtracting the gradient for the LASSO regularization of the gradient.

$\frac{\partial}{\partial w^k} \textrm{LASSO Cost} = \lambda_1 \thinspace sign(w^k)$

The second term implements subtracting the Ridge regularization of the gradient.

$\frac{\partial}{\partial w^k} \textrm{Ridge Cost} = \thinspace \lambda_2 \thinspace w^k$

# Checking the implementation

I will compare a couple of toy examples in this implementation compared to sklearn.

## Ridge (Gaussian Prior)

### Lambda = 100, C = 0.01

from LogisticRegressionRegularization import LogisticRegression as BTSLogisticRegression
log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=0,l2=100)
coef = log.get_coefficients()
logSK = SKLogisticRegression(tol=1e-9,C=0.01)
logSK.fit(X.reshape(X.size,1),y)
print &quot;Sklearn Intercept, Bryan Intercept: &quot;,logSK.intercept_[0], coef[0]
print &quot;Sklearn Slope, Bryan Slope: &quot;,logSK.coef_[0,0], coef[1]


Sklearn Intercept, Bryan Intercept: -0.00217832261688 -0.00217735345805
Sklearn Slope, Bryan Slope: 0.548644577533 0.54864036825

### Lambda = 10, C = 0.1

log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=0,l2=10)
coef = log.get_coefficients()
logSK = SKLogisticRegression(tol=1e-9,C=0.1)
logSK.fit(X.reshape(X.size,1),y)
print &quot;Sklearn Intercept, Bryan Intercept: &quot;,logSK.intercept_[0], coef[0]
print &quot;Sklearn Slope, Bryan Slope: &quot;,logSK.coef_[0,0], coef[1]


Sklearn Intercept, Bryan Intercept: -0.021489688722 -0.0214850033177
Sklearn Slope, Bryan Slope: 1.40733200454 1.40730512612

## LASSO (Laplace Prior)

### Lambda = 100, C = 0.01

from LogisticRegressionRegularization import LogisticRegression as BTSLogisticRegression
log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=100,l2=0)
coef = log.get_coefficients()
logSK = SKLogisticRegression(tol=1e-9,C=0.01,penalty='l1')
logSK.fit(X.reshape(X.size,1),y)
print &quot;Sklearn Intercept, Bryan Intercept: &quot;,logSK.intercept_[0], coef[0]
print &quot;Sklearn Slope, Bryan Slope: &quot;,logSK.coef_[0,0], coef[1]


Sklearn Intercept, Bryan Intercept: 0.0 -0.0524505475518
Sklearn Slope, Bryan Slope: 0.0892273998403 0.089274906091

### Lambda = 10, C = 0.1

from LogisticRegressionRegularization import LogisticRegression as BTSLogisticRegression
log = BTSLogisticRegression(X,y,tolerance=1e-9,l1=10,l2=0)
coef = log.get_coefficients()
logSK = SKLogisticRegression(tol=1e-9,C=0.1,penalty='l1')
logSK.fit(X.reshape(X.size,1),y)
print &quot;Sklearn Intercept, Bryan Intercept: &quot;,logSK.intercept_[0], coef[0]
print &quot;Sklearn Slope, Bryan Slope: &quot;,logSK.coef_[0,0], coef[1]


Sklearn Intercept, Bryan Intercept: 0.0 -0.00436123552719
Sklearn Slope, Bryan Slope: 1.58593771765 1.58595857714

# Review

In this post I made minimal changes to the logistic regression code to implement Ridge and LASSO regularization. I also showed that the results between this implementation and sklearn on a simple toy model are consistent. The simple gradient decent algorithm, with its manual tuning of the learning rate, keeps this method from being ready for production. Attemping to use my code on larger and more nuanced data will lead to difficulties with training time and finding the ‘optimal’ learning rate.

In the next and last post in this series, I am going to alter the cost functions/regularization to choose set priors that reflect prior experience, data, or conclusions that are not reflected in the dataset.

# Thank You

I appreciate you taking the time to read this post. I hope you gained something from it.  Feel free to reach out with any questions in the comments or directly through the email link in the menu.

# Regularization

Regularization is a mathematical method to add constraints to an expression of a problem. In context of machine learning, it is a way to encourage the reduction of model complexity, reduce overfitting, or/and establish priors on the weights/coefficients of the model. I want to focus on the last part about priors for this post.

## Bayes Rules

Bayes rules is a relationship between conditional probabilities that states the probability of some statement A is true give B is true is related to the probability of some statement B is true given A is true:

$P( A | B ) P( B ) = P ( B | A ) P (A)$

$P( A | B ) = \frac{P(B|A)}{P(B)} P(A)$

In context of model fitting we want to know what is probabilities that we have certain weights given the data we observed: $P( w | y, X )$. According to Bayes’ Rule this is equal to:

$P(w|y,X) = \frac{P(y,X|w)}{P(y,X)} P(w)$

$\mathcal{P} (\mathrm{Posterior}) = \mathcal{L} (\mathrm{Likelihood}) \times \mathcal{P}_r (\mathrm{Prior})$

$P(w)$ corresponds to what our beliefs are about the weights before we see the data.  This is our prior. $P(w|y,X)$ is our believes about the weights after we have seen the data, or our posterior. $\frac{P(y,X|w)}{P(y,X)}$ is the likelihood of the weights, given the fact that we have seen the data $(X,y)$. In the posts on implementing logistic regression from scratch we were finding the weights the maximized the likelihood.

## Maximizing Posterior (Not Likelihood)

If we want to include priors in fitting our model, we have to choose the weights that maximize the posterior distribution $\mathcal(P)$ instead of choosing the weights that maximize the likelihood. We can take the negative natural log of both sides of posterior equation and get the following equation.

$-ln \mathcal{P} = - ln \mathcal{L} - ln \mathcal{P}_{r}$

And for logistic regression that is:

$-ln \mathcal{P} = - \sum_{k}^{N} y_{k} \ ln \big( p(y_{k}=1| x_{k}, w) \big) + (1 - y_{k}) \ ln \big( 1 - p(y_{k}=1| x_{k}, w) \big) - ln \mathcal{P}_{r}$

For the weight update rules we will have to take the derivative of both sides.

$- \frac{\partial}{\partial w^i} \ln \mathcal{P} = - \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) {X^i_k} + \frac{\partial}{\partial w^i} \big( - ln \mathcal{P}_{r} \big)$

So the new weight update rule is (change):

$w^i = w^i + \alpha \thinspace \sum_k \big( y_k- p(y_k = 1 |X,w) \big) \thinspace {X^i_k} - \alpha\thinspace \frac{\partial}{\partial w^i} \thinspace ln\mathcal{P}_{r}$

## Gaussian Prior (Ridge Regularization)

A common prior is assumed the weights are centered at zero with some standard deviation that is the same for all the weights.

$\mathcal{P}_{r} = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2 \sigma^2}\sum_i {w^i}^2}$

The negative log likelihood of the gaussian prior is

$- ln \mathcal{P}_{r} = - ln\big(\frac{1}{\sqrt{2 \pi \sigma^2}}\big) + \frac{1}{2 \sigma^2} \sum_i {w^i}^2$

We can now include in our new cost function for the posterior:

$-ln \mathcal{P} = - ln \mathcal{L}_{\mathrm{logistic}} + \frac{1}{2 \sigma^2} \sum_i {w^i}^2 - ln\big(\frac{1}{\sqrt{2 \pi \sigma^2}}\big)$

The last term is a constant, and can be ignored. If we define $\lambda$ as $\frac{1}{\sigma^2}$ than we can put this in the common ridge regularization form.

$-ln \mathcal{P} = - ln \mathcal{L}_{\mathrm{logistic}} + \frac{1}{2} \lambda \sum_i {w^i}^2$

### New Update Rule

The derivative of the negative log of the prior is

$\frac{\partial}{\partial w^i} - ln \mathcal{P}_{r} = \frac{w^i}{\sigma^2} = \lambda w^i$

Make the new weight update rule for the postior based on a gaussian prior:

$w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i_k - \alpha \frac{1}{\sigma^2} w^i$

$w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i_k - \alpha \lambda w^i$

A large lambda corresponds to a small sigma, and a strong belief that the weights should be close to zero.

## Laplace Prior (LASSO Regularization)

Another common prior is assumed the weights are centered at zero with a Laplace (double exponential) distribution.

$\mathcal{P}_{r} = \frac{1}{2 b} e^{-\frac{1}{b}\sum_i |w^i|}$

The negative log likelihood of the gaussian prior is

$- ln \mathcal{P}_{r} = - ln\big(\frac{1}{2b}\big) + \frac{1}{b} \sum_i |w^i|$

We can now include in our new cost function for the posterior:

$-ln \mathcal{P} = - ln \mathcal{L}_{\mathrm{logistic}} + \frac{1}{b} \sum_i |w^i| - ln\big(\frac{1}{2b}\big)$

The last term is a constant, and can be ignored. If we define $\lambda$ as $\frac{1}{b}$ than we can put this in the common ridge regularization form.

$-ln \mathcal{P} = - ln \mathcal{L}_{\mathrm{logistic}} + \lambda \sum_i |w^i|$

### New Update Rule

The derivative of the negative log of the prior is

$\frac{\partial}{\partial w^i} - ln \mathcal{P}_{r} = \frac{sign(w^i)}{b} = \lambda sign(w^i)$

Make the new weight update rule for the posterior based on a gaussian prior:

$w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i_k - \alpha \ \frac{1}{b} sign(w^i)$

$w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i_k - \alpha \ \lambda \ sign(w^i)$

A large lambda corresponds to a small $b$, and a strong belief that the weights should be close to zero. Since the exponential distribution is much more narrow than the gaussian distribution, it corresponds to a much stronger believe/prior that the coefficients are zero. This is the reason that LASSO is used for variable selection. With a large lambda/small b, it takes very strong evidence to overcome the Laplace prior that the weights are close to zero.

# Examples

I always like to examine some toy examples to get a feel for a concept. In this case I am going to generate gaussian data that is well separated.

The code is below:

import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
%matplotlib inline
def generate_1D_data(loc_positive=1,loc_negative=-1,size=100):
X = np.hstack((np.random.normal(loc=loc_positive,size=size),np.random.normal(loc=loc_negative,size=size)))
y = np.hstack((np.ones(size),np.zeros(size)))
return X,y
X,y = generate_1D_data()
w = np.array([0,0])
def prob(w,X):
return 1/(1+np.exp(-w[0]-w[1]*X))

def cost(w,X,y,l2=0,l1=0,include_intercept=1):
return -1*np.sum(y*np.log(prob(w,X))+(1-y)*np.log(1-prob(w,X)))+l2*w[1]*w[1] + \
include_intercept*l2*w[0]*w[0]+l1*np.abs(w[1]) + include_intercept*l1*np.abs(w[0])

plt.figure(figsize=(16,8))
plt.hist(np.random.normal(loc=1,size=100),alpha=0.5,color='steelblue',label='Positive Example')
plt.hist(np.random.normal(loc=-1,size=100),alpha=0.5,color='darkred',label='Negative Example')
plt.xlabel('Variable Value')
plt.ylabel('Count')
plt.title('Histogram of Simulated 1D Data')
plt.show()


This data can be fit with logistic regression, but before we do that lets examine some of the underlying concepts. In the absence of priors we want to choose weights that minimize the negative log likelihood. We can produce a contour plot of this in our weight space (intercept and weight for our only variable)

delta = .05
t1 = np.arange(-1.0, 1.0, delta)
t2 = np.arange(0.0, 4.0, delta)
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
for i,w0 in enumerate(t1):
for j,w1 in enumerate(t2):
Z[j,i] = cost([w0,w1],X,y)
plt.figure(figsize=(16,8))
CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.show()


This suggest that the minimum of the log-likelihood shoudl produce the intercept close ot zero and a slope close to 2.0. I will perform a fit with sklearn:

from sklearn.linear_model import LogisticRegression
log = LogisticRegression(C=1e35)
log.fit(X.reshape(X.size,1),y)
print &amp;amp;amp;quot;Intercept: &amp;amp;amp;quot;,log.intercept_[0]
print &amp;amp;amp;quot;Slope: &amp;amp;amp;quot;,log.coef_[0,0]


Intercept: 0.0390729281934
Slope: 1.78817685475

The results are very close to the center of the inner most contour.

## Gaussian Prior

Now lets apply Gaussian prior with a $\lambda = 100$

delta = .05
t1 = np.arange(-1.0, 1.0, delta)
t2 = np.arange(0.0, 4.0, delta)
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
for i,w0 in enumerate(t1):
for j,w1 in enumerate(t2):
Z[j,i] = cost([w0,w1],X,y,l2=100)
plt.figure(figsize=(16,8))
CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.show()


The cost function values, and the minimum of the cost function is now at an intercept closer to 0, and and slope closer to 0.5. Sklearn uses a C parameter where $C = \frac{1}{\lambda}$, so a $\lambda = 100$ is a $C = 0.01$. Fitting the data again with a gaussian prior produces

log = LogisticRegression(C=.01,penalty='l2')
log.fit(X.reshape(X.size,1),y)
print &amp;amp;amp;quot;Intercept: &amp;amp;amp;quot;,log.intercept_[0]
print &amp;amp;amp;quot;Slope: &amp;amp;amp;quot;,log.coef_[0,0]


Intercept: 0.00605127827799
Slope: 0.514997242718

Since we know that the intercept should be zero, we can examine the likelihood, posterior, and pripors of the first weight.

ws = np.linspace(-1,2.5,1000)
l = 100
li = ws.copy()
for i, wi in enumerate(ws):
li[i] = np.exp(-1*cost([0,wi],X,y))
pr = sp.stats.norm(loc=0,scale=1/np.sqrt(l)).pdf(ws)
li = li*pr.max()/li.max()
po = li*pr
po = po/po.sum()
po = po*li.max()/po.max()
plt.figure(figsize=(16,8))
plt.plot(ws,li,label='Likelihood (From Data)',color='steelblue')
plt.plot(ws,pr,label='Prior (Gaussian)',color='#AA3333',linestyle='--')
plt.plot(ws,po,label='Posterior (After Data + Prior)',color='#3333AA')
plt.legend()
plt.show()


I have rescaled the data for display purposes. Here we how regularization can keep us from overfitting. The maximum likelihood weight is around 1.5, but we have decided to believe that the weight should be within some distance of zero. This bias keeps us from moving to the maximum likelihood (based only on data). We end up with a solution that is close to the data, but not perfectly fit to it.

Lets look at the position of the maximum Postior.

print &amp;amp;amp;quot;Slope: &amp;amp;amp;quot;, ws[po.argmax()]


Slope: 0.513513513514

Very close to the sklearn solution!

## Laplace Prior

Now lets apply Laplace prior with a $\lambda = 100$

delta = .05
t1 = np.arange(-1.0, 1.0, delta)
t2 = np.arange(0.0, 4.0, delta)
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
for i,w0 in enumerate(t1):
for j,w1 in enumerate(t2):
Z[j,i] = cost([w0,w1],X,y,l1=100)
plt.figure(figsize=(16,8))
CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.show()


The minimum of this posterior cost function is close to 0 for the intercept and 0 for the slope. The absolute value is illustrated with the kink in the cost function. The sklearn solution is:

log = LogisticRegression(C=.01,penalty='l1')
log.fit(X.reshape(X.size,1),y)
print &amp;amp;amp;quot;Intercept: &amp;amp;amp;quot;,log.intercept_[0]
print &amp;amp;amp;quot;Slope: &amp;amp;amp;quot;,log.coef_[0,0]


Intercept: 0.0
Slope: 0.0

The Laplace prior is overwhelming the information from the data. I want to visualize this along the direction of the slope.

li = ws.copy()
for i, wi in enumerate(ws):
li[i] = np.exp(-1*cost([0,wi],X,y))
pr = sp.stats.laplace(loc=0,scale=0.01).pdf(ws)
po = li*pr
po = po*pr.max()/po.max()
li = li*pr.max()/li.max()
plt.figure(figsize=(16,8))
plt.plot(ws,li,label='Likelihood (From Data)',color='steelblue')
plt.plot(ws,pr,label='Prior (Gaussian)',color='#AA3333',linestyle='--')
plt.plot(ws,po,label='Posterior (After Data + Prior)',color='#3333AA')
plt.legend()
plt.show()


The Laplace prior is very strong in this example, and this illustrates how it can be used for variable selection. The evidence has to be very strong to overwhelm a strong prior. This can be done with drastic data, or more data. Instead of using 200 points, lets use 20,000 simulated points.

X,y = generate_1D_data(size=10000)
delta = .05
t1 = np.arange(-1.0, 1.0, delta)
t2 = np.arange(0.0, 4.0, delta)
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
for i,w0 in enumerate(t1):
for j,w1 in enumerate(t2):
Z[j,i] = cost([w0,w1],X,y,l1=100)
plt.figure(figsize=(16,8))
CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.show()


The minimum cost of the posterior function is close to where the original likelihood was. More data provided enough evidence to step away from the same prior distribution. Fitting more data with sklearn produces supporting results.

from sklearn.linear_model import LogisticRegression
log = LogisticRegression(C=0.01,penalty='l1')
log.fit(X.reshape(X.size,1),y)
print &amp;amp;amp;quot;Intercept: &amp;amp;amp;quot;, log.intercept_[0]
print &amp;amp;amp;quot;Slope: &amp;amp;amp;quot;,log.coef_[0,0]


Intercept: 0.0
Slope: 1.92902347294

# Review

I wanted to explore regularization, and specifically how having a prior belief on the weights for a model can change the results. We have the option of maximizing the likelihood, or maximizing the posterior, which is the product of the likelihood and the priors.

I also wanted to illustrate how LASSO and Ridge regularization result from Laplacian and Gaussian Priors on the weights respectively. These lead to the cost function updates that are found in most books/online resources.

I then played with a couple of toy models to show how the priors and the likelihood interact to produce the results of regularization in logistic regression models. The goal is to build up an intuition of regularization that can carry over to more complex datasets and models.

# Thank You

I appreciate you taking the time to read this post.  I hope you gained something from it.  Feel free to reach out with any questions in the comments or directly through the email link in the menu.

# Introduction

I wrote about the theory of Logistic Regression in the previous article where I highlighted that the algorithm is attempting to find the weights/coefficients that maximizes the likelihood that the estimates of the probability/labels are correct.  Through algebriac manipulation, the problem is reframed as finding the weights that produce minimum negative log-likelihood.   One way to approach this problem is to pick some weights, then move along this function by changing the weights in a way that tend to lower the value of the function for the new weights.   In theory this will eventually converge, but in practice we stop when it is good enough.

Note:  What follows is a naive implementation of gradient descent and stocastic gradient descent.  I do not recommend using these methods in practice because they are not very robust, require tuning, and are slower than standard implementation that take advantage of well developed optimization libraries.  Follow up articles will attempt to address these, but for now we will just move along with the naive approach.

That being said here is the final model I develop for predicting the color of wine compared to sklearn’s Logistic Regression.

# Github

I have a Github repo with some of the code from this series:  https://github.com/bryantravissmith/Logistic-Regression-From-Scratch

# Data

In order to test and compare our implementation of Logistic Regression we will need data.   I will use two data sets for this article.  One is a toy dataset with two variables and an integer label.  The other is a the combining of two wine datasets from the UCI Machine Learning Repository

The gradient descent algorithm assumes you have some function, and we will assume that it is some some data ($x$) and weight ($w$) variables where the data if fixed but the weights are to be determined.  The function $f(x,w)$ gives us some value for a given data, weight combination and we want to find the weights that gives us the lowest value.  Assuming the function is appropriately well-behaved, we can choose some weights then continue to update the weights using the following update rule:

${w^{i} = w^{i} - \alpha \frac{\partial}{\partial w^{i}} f(x,w)}$

The index $i$ is the index of the $i^{\textrm{th}}$ feature in $x$.  The negative sign is because we are trying to descend down the cost function.  If the slope/derivative is positive, we want to move in the negative direction.  If the slope/derivative is negative we want to move in the positive direction.

For Logistic Regression we have the following cost function:

${-ln\mathcal{L}(w) = -\sum_{k} \Bigg[y_{k} \ ln\big( p(y_{k}=1|x_{k},w)\big) + (1-y_{k}) ln\big(1-p(y_{k}=1|x_{k},w)\big)\Bigg]}$

The variables $x$ and $y$ are the features and labels for the training data, and $k$ is the $k^{\textrm{th}}$ example in the training data.

The gradient of the Logistic cost function is:

${-\frac{\partial}{\partial w^{i}} ln\mathcal{L}(w) = -\sum_{k} \Big( y_{k} - p(y_{k}=1|x_{k},w) \Big) x^i_k}$

And our update rule will be:

${w^{i} = w^{i} + \alpha \sum_{k} \Big( y_{k} - p(y_{k}=1|x_{k},w) \Big) x^k_i}$

You might have noticed the negative sign switched to a positive value, and that is because of the negative sign is the gradient of the negative log-likelihood.

# Code

I am going to implement the Logistic Regression as a class, and start with the following template.  The methods are broken up in a way to highlight the fact that I will be implementing two forms of gradient decent.

2. Stochastic Gradient Descent (Using data point at a time for weight updates)

There are tradeoffs for both methods.  There is also in between implementations where you can use subsets/batches of the data for weight updates.

## Template Code

import numpy as np
class LogisticRegression:

def __init__(self,X,y,tolerance=1e-5):
&quot;&quot;&quot;Initializes Class for Logistic Regression

Parameters
----------
X : ndarray(n-rows,m-features)
Numerical training data.

y: ndarray(n-rows,)
Interger training labels.

tolerance : float (default 1e-5)
Stopping threshold difference in the loglikelihood between iterations.

&quot;&quot;&quot;
self.tolerance = tolerance
self.labels = y.reshape(y.size,1)
#create weights equal to zero with an intercept coefficent at index 0
self.w = np.zeros((X.shape[1]+1,1))
#Add Intercept Data Point of 1 to each row
self.features = np.ones((X.shape[0],X.shape[1]+1))
self.features[:,1:] = X
self.shuffled_features = self.features
self.shuffled_labels = self.labels
self.likelihood_history = []

def probability(self):
&quot;&quot;&quot;Computes the logistic probability of being a positive exampe

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
pass

def log_likelihood(self):
&quot;&quot;&quot;Calculate the loglikelihood for the current set of weights and features.

Returns
-------
out : float
&quot;&quot;&quot;
pass

&quot;&quot;&quot;Calculate the loglikelihood gradient for the current set of weights and features.

Returns
-------
out : ndarray(n features, 1)
&quot;&quot;&quot;
pass

Parameters
----------
alpha : float
The learning rate for the algorithm

max_iterations : int
The maximum number of iterations allowed to run before the algorithm terminates

&quot;&quot;&quot;
pass

def row_probability(self,row):
&quot;&quot;&quot;Computes the logistic probability of being a positive example for a given row

Parameters
----------
row : int
Row from feature matrix with to calculate the probablity.

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
pass

&quot;&quot;&quot;Computes the loglikelihood gradient for a given row

Parameters
----------
row : int
Row from feature matrix with to calculate the probablity.

Returns
-------
out : ndarray(n features, 1)
&quot;&quot;&quot;
pass

Parameters
----------
alpha : float
The learning rate for the algorithm

max_iterations : int
The maximum number of iterations allowed to run before the algorithm terminates

&quot;&quot;&quot;
pass

def predict_probabilty(self,X):
&quot;&quot;&quot;Computes the logistic probability of being a positive example

Parameters
----------
X : ndarray (n-rows,n-features)
Test data to score using the current weights

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
pass


This implementation requires that you initialize the class with the training data.  The features are in a 2D numeric numpy array, and the labels are in a 1D numeric numpy array.  Additionally I am adding an intercept to the data so that the  zero-indexed weight is the intercept, and the follow weights correspond to the order columns in the feature matrix X.

I will implement method by method for the standard gradient decent algorithm in the following order:

1. Probability Calculation of the Training Data
2. Log-Likelihood of the Training Data
3. Log-Likelihood Gradient of Training Data

### Probability Calculation

def probability(self):
&quot;&quot;&quot;Computes the logistic probability of being a positive example

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
return 1/(1+np.exp(-self.features.dot(self.w)))


### Log-Likelihood Calculation

def log_likelihood(self):
&quot;&quot;&quot;Calculate the loglikelihood for the current set of weights and features.

Returns
-------
out : float
&quot;&quot;&quot;
#Get Probablities
p = self.probability()
#Get Log Likelihood For Each Row of Dataset
loglikelihood = self.labels*np.log(p+1e-24) + (1-self.labels)*np.log(1-p+1e-24)
#Return Sum
return -1*loglikelihood.sum()


def log_likelihood_gradient(self):
&quot;&quot;&quot;Calculate the loglikelihood gradient for the current set of weights and features.

Returns
-------
out : ndarray(n features, 1)
&quot;&quot;&quot;
error = self.labels-self.probability()
product = error*self.features
return product.sum(axis=0).reshape(self.w.shape)


def gradient_decent(self,alpha=1e-7,max_iterations=1e4):

Parameters
----------
alpha : float
The learning rate for the algorithm

max_iterations : int
The maximum number of iterations allowed to run before the algorithm terminates

&quot;&quot;&quot;
previous_likelihood = self.log_likelihood()
difference = self.tolerance+1
iteration = 0
self.likelihood_history = [previous_likelihood]
while (difference gt; self.tolerance) and (iteration lt; max_iterations):
temp = self.log_likelihood()
difference = np.abs(temp-previous_likelihood)
previous_likelihood = temp
self.likelihood_history.append(previous_likelihood)
iteration += 1


I will be copying the implementation above in the method by method fashion except for the log-likelihood (already done!):

1. Probability Calculation  for a single row of the Training Data
2. Log-Likelihood Gradient for a single row of Training Data

### Probability Calculation by Row

def row_probability(self,row):
&quot;&quot;&quot;Computes the logistic probability of being a positive example for a given row

Parameters
----------
row : int
Row from feature matrix with to calculate the probablity.

Returns
-------
out : ndarray (1,)
Probablity of being a positive example
&quot;&quot;&quot;
return 1/(1+np.exp(-self.shuffled_features[row,:].dot(self.w)))


def row_log_likelihood_gradient(self,row):
&quot;&quot;&quot;Computes the loglikelihood gradient for a given row

Parameters
----------
row : int
Row from feature matrix with to calculate the probablity.

Returns
-------
out : ndarray(n features, 1)
&quot;&quot;&quot;
error = self.shuffled_labels[row]-self.row_probability(row)
product = self.shuffled_features[row,:]*error
return product.reshape(self.features.shape[1],1)




Parameters
----------
alpha : float
The learning rate for the algorithm

max_iterations : int
The maximum number of iterations allowed to run before the algorithm terminates

&quot;&quot;&quot;
difference = self.tolerance+1.
previous_likelihood = self.log_likelihood()
rows = range(len(self.features))
np.random.shuffle(rows)
self.shuffled_features = self.shuffled_features[rows,:]
self.shuffled_labels = self.shuffled_labels[rows]
iteration = 0
self.likelihood_history = [previous_likelihood]

while (difference gt; self.tolerance)  (iteration lt; max_iterations):
for i in xrange(len(self.features)):
temp = self.log_likelihood()
difference = np.abs(temp - previous_likelihood)

#print previous_likelihood, temp, difference

previous_likelihood = temp

np.random.shuffle(rows)
self.shuffled_features = self.shuffled_features[rows,:]
self.shuffled_labels = self.shuffled_labels[rows]
iteration += 1
self.likelihood_history.append(previous_likelihood)



## Review

I have implemented the logistic regression algorithm using both gradient descent and stochastic gradient descent.  Now I am going to compare this with the sklearn implementation using the Toy and Wine datasets from above.

# Comparing to sklearn

In this section I will be comparing my implementation of logistic regression with sklearn.  Let us start with the toy dataset.

## Toy Data

from LogisticRegression import LogisticRegression as LogisticRegressionBTS
from sklearn.linear_model import LogisticRegression as LogisticRegressionSKL
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.figure(figsize=(16,8))
ax=plt.gca()
toy.loc[toy.label==0,:].plot(kind='scatter',x='var1',y='var2',label=quot;Negative Examplequot;,color='#AA0000',ax=ax,s=100)
toy.loc[toy.label==1,:].plot(kind='scatter',x='var1',y='var2',label=quot;Positive Examplequot;,color='#00AA00',ax=ax,s=100)
plt.legend(loc='best')
plt.title(quot;Toy Dataquot;)
plt.show()


The positive examples are well separated, and the first variable does not do much for informing if an example is positive or negative.

Running gradient descent on the toy dataset using the following code in an iPython notebook

X = toy.values[:,:2]
y = toy.values[:,2]
logBTS = LogisticRegressionBTS(X,y,tolerance=1e-6)
print logBTS.w.T,logBTS.log_likelihood()


produces the following results:

CPU times: user 549 ms, sys: 5.1 ms, total: 554 ms
Wall time: 552 ms
[[-26.92062149 -0.18584114 8.64366537]] 8.30463361158

This produces an intercept of -26.9, a coefficient for var1 of -0.189, and coefficient for var2 of 8.64.   The negative log-likelihood for these coefficients is 8.3.

For the stochastic gradient descent we get the following results

logBTS = LogisticRegressionBTS(X,y,tolerance=1e-6)
print logBTS.w.T,logBTS.log_likelihood()


Which produces the following results:

CPU times: user 23.8 s, sys: 89.6 ms, total: 23.9 s
Wall time: 23.9 s
[[-27.67652405  -0.15754055   8.89458409]] 8.33243660654

The weights/coefficients and the negative log-likelihood are similar in value as the regular gradient decent, but the time is almost 400x longer to computer on this data set.

Now lets compare to sklearn’s implementation.  Because sklearn implements regularization (which I will do in the next post, we have to set the ‘C’ parameter to a large value to reduce its effect.

logSKL = LogisticRegressionSKL(tol=1e-6,C=1e15)
%time logSKL.fit(X,y)
coef = np.append(logSKL.intercept_,logSKL.coef_).reshape(3,1)
temp = LogisticRegressionBTS(X,y,tolerance=1e-6)
temp.w = coef
print coef.T, temp.log_likelihood()


This produces the following results:

CPU times: user 1.02 ms, sys: 552 µs, total: 1.58 ms
Wall time: 1.12 ms
[[-27.7534122   -0.19149536   8.92554005]] 8.30156580374

This produces an intercept of -27.7 (-26.9 of my implementation), a coefficient for var1 of -0191 (-0.189 of my implementation), and coefficient for var2 of 8.93 (8.64 for my implementation).   The negative log-likelihood for these coefficients is the same 8.3.

The weights/coefficients and negative log-likelihood produce by sklearn are very similar to my implementation.  The speed, however, is very much faster than my implementation.   As well it should be, because under the hood it is using LIBLINEAR (though that is not the whole story – to be continued).

## Wine Data

Now we are going to run the same tests on the wine dataset.  This is going to really highlight the difficulty in implementing your own version of a machine learning algorithm.

wine = pd.read_csv(quot;wine.csvquot;)
y = (wine.Type=='Red').values.astype(int)
X = wine.loc[:,wine.columns[0:11]].values
logBTS = LogisticRegressionBTS(X,y,tolerance=1e-6)
print logBTS.w.T,logBTS.log_likelihood()


This produces the following results:

CPU times: user 14min 44s, sys: 1min 1s, total: 15min 46s
Wall time: 16min 3s
[[ -8.08877274   0.84694802  12.70627186   0.02073947  -0.16034602
6.11128278   0.0561014   -0.06567519  -7.46192366   3.65361126
10.6676526   -0.88305311]] 373.208256779

The gradient descent algorithm ran for 15 minutes and did not converge.   In contrast, we can look at the look at sklearns results.

logSKL = LogisticRegressionSKL(tol=1e-6,C=1e15)
%time logSKL.fit(X,y)
coef = np.append(logSKL.intercept_,logSKL.coef_).reshape(12,1)
temp = LogisticRegressionBTS(X,y,tolerance=1e-6)
temp.w = coef
print coef.T, temp.log_likelihood()


This produces the following results:

CPU times: user 188 ms, sys: 2.02 ms, total: 190 ms
Wall time: 192 ms [[ -1.73626291e+03 -2.91001543e-01 6.47063319e+00 -2.45603390e+00 -9.03523744e-01 2.24928967e+01 6.54734512e-02 -5.30980797e-02 1.73329397e+03 -1.16165896e+00 3.33314373e+00 1.76592277e+00]] 214.601181992

The results converge, and do so in less than 200 ms.  The coefficients are very different between the two implementations.

In gradient descent we have to naively set the learning rate, but there are algorithms that adjust the learning rate appropriately.   I believe LIBLINEAR uses a coordinate descent algorithm.

More importantly, we are applying the same learning rate to all the weights, but the weights in the wine dataset have very different scales (means) and distributions (standard deviations).   More on this later.

Since my implementation did not converge, I can adjust the tolerance of the sklearn implementation to a higher tolerance and see if the results are consistent.

logSKL = LogisticRegressionSKL(tol=5.4e-4,C=1e15)
%time logSKL.fit(X,y)
coef = np.append(logSKL.intercept_,logSKL.coef_).reshape(12,1)
temp = LogisticRegressionBTS(X,y,tolerance=1e-6)
temp.w = coef
print coef.T, temp.log_likelihood()


This produces the following results:

CPU times: user 44 ms, sys: 1.02 ms, total: 45 ms
Wall time: 44.6 ms
[[ -9.53934274   0.8029794   12.85814315   0.58228827  -0.148062
6.85134198   0.05505118  -0.06961054  -8.8251745    4.66868577
10.31003786  -0.91916414]] 352.331151996

The coefficients are very similar now that the negative log-likelihood values are similar.

# Scaling Variables

My implementation of Logistic Regression took an unreasonable amount of time.   Even though it will eventually find the minimum, we can help the algorithm find the minimum, as usually, by redefining the problem.   The data in the wine quality dataset has very different scales and distributions, so the gradient will be biased by both the scale and distribution of the variables.  It will still converge, but in slower time.  The slow time is because of how the data is represented, but not in the information contained in the data.

One way around this to scale the variables to similar sizes and ranges:

${x^{'} = \frac{x - \bar{x}}{\sigma_{x}}}$

Where $x^{'}$ is the scaled variable, $\bar{x}$ is the average of the unscaled variable, and $\sigma_{x}$ is the standard deviation of the unscaled variable.

The algorithm should find the same minimum, and produce the same estimate of probabilities, so the following equation must be true:

${w^{'}_{0} + \sum^{N}_{i=1} w^{i'} x^{'}_{i} =w_{0} + \sum^N_{i=0} w^{i} x_{i} }$

In this case the primed variables are the scaled variables and the resulting weights from using the scaled variables, and the unprimed variables are the unscaled variables and the corresponding resulting weights.

If we plug in the scaled equation into the left hand side of the above equation, we can produce the following conversion between the weights produced by gradient decent applied to the scaled variables, and gradient decent applied to the unscaled variables.

${w^{'}_{0} + \sum^{N}_{i=1} w^{i'}\frac{x_{i} - \bar{x_{i}}}{\sigma_{x_{i}}} = w_{0} + \sum^N_{i=0} w^{i} x_{i} }$

${\Big(w^{'}_{0} - \sum^{N}_{i=1} \frac{w^{i'} \bar{x_{i}}}{\sigma_{x_{i}}} \Big) + \sum^{N}_{i=1} \frac{w^{i'}}{\sigma_{x_{i}}} x_{i} = w_{0}+ \sum^{N}_{i=1} w^{i} x_{i}}$

So that the unscaled intercept is:

${w_{0} = \Big(w^{'}_{0} - \sum^{N}_{i=1} \frac{w^{i'}\bar{x_{i}}}{\sigma_{x_{i}}} \Big)}$

And the unscaled weights are:

${w_{i} = \frac{w^{i'}}{\sigma_{x_{i}}}}$

I am going to scale the wine dataset into a new scaled dataset, fit the new dataset, and transform the coefficients.

mean = X.mean(axis=0)
std = X.std(axis=0)
XP = (X-mean)/std
logBTS = LogisticRegressionBTS(XP,y,tolerance=1e-6)
new_coef = logBTS.w.T[0]/np.hstack((1,std))
new_coef[0] = logBTS.w.T[0][0]-(mean*logBTS.w.T[0][1:]/std).sum()
new_coef,logBTS.log_likelihood()


CPU times: user 452 ms, sys: 29.9 ms, total: 482 ms
Wall time: 468 ms
[ -1.84343985e+03  -3.86566692e-01   6.22415134e+00  -2.60396655e+00
-9.46160941e-01   2.22425572e+01   6.69735358e-02  -5.32335410e-02
1.84259358e+03  -1.72842890e+00   3.09482351e+00   1.90221249e+00] 214.434914995

The scaled data only took 3x the sklearn learn fitting time for the same initial tolerance.  The log-likelihood for these two methods are almost identical.   Here are the comparison between the models as shown at the beginning of the post.

# Review

In this post I implemented Logistic Regression and reproduce results similar to sklearn’s implementation.   The goal was to show how the algorithm works, but also how implementation of machine learning algorithms require more than just algorithms.  The algorithm must be smartly applied.  Not having an intuition gradient decent and the effect of variable scaling can lead to sub-optimal performance, for instance.  The difference on the wine dataset was over a 2500x increase in training time for results that did not converge.

I personally enjoy the process of reproducing the sklearn method.  In the next post in the series I will add regularization to the algorithm.

# Thank You

I appreciate you taking the time to read this post.  I hope you gained something from it.  Feel free to reach out with any questions in the comments or directly through the email link in the menu.

## Implementing Logistic Regression From Scratch – Part 1: Theory

Logistic Regression is a common Supervised Machine Learning Classification algorithm that attempts to estimate the probability that a given set of data can be classified as a positive example of some category.   Examples of some question logistic regression can help answer include:

• What is the probability a costumer will purchase a product given their recent on-sight behavior?
• What is the probability that a recipient of a loan will default based on their most recent credit data?
• What is the probability a given wine is red based on its chemical properties?

These questions can be changed to a goal such as: “Classify this person as a purchaser/non-purchaser based on their on-sight behavior.”  Logistic Regression does not do this directly.  It is done by setting a classification threshold for the probability, say 0.75 (or 75%).   All examples with an estimated probability of being a positive example of the classification above 75% will be classified as a positive example.   What to set the threshold depends on the problem/ goal of the classification, and is not inherit to Logistic Regression.  I will defer discussing this to another time.

# Why Implement It Ourselves?

There are great libraries already written in all common data science languages that already implement Logistic Regression.  They are faster and much more robust than what I am going to write here.   I also use them over my own code.  I recommend that for everyone.

The point of implementation is partly to have a deeper understanding of the packages/libraries/algorithms, and partly for the sheer join of it.  There is also a habit of mind that I want to establish for myself: rigorous understanding.  Most implementations have design decisions and choice that might not be be the best choice for all problems.  There are also bugs that exist that might inhibit your work.   Plus, there is also the fact that there are many very good and specific machine learning methods and algorithms that are not in the common or best packages out there.   You might find yourself wanting to use one, and not willing to wait.

My last reason:  cost function manipulation is fun and the results can be surprising.   I will also defer this to a post later in the series.

If you want to skip the theory you can jump to the code.

# Logistic Regression Theory (Math o’clock)

Logistic Regression is an algorithm that attempts to estimate the probability ($p$) that a given the data ($x$) is an example of given classification ($y = 1$).  This can be written as:

$p( y=1|x)$

where

${p( y=1|x) +p( y=0|x) = 1}$

The odds ratio for a given example is:

${\textrm{odds} = p( y=1|x)/p( y=0|x) }$

The Logistic Regression algorithm requires/assumes that the natural log of the odds ( $ln( \frac{p( y=1|x)}{p( y=0|x)})$) is a linear function of the variables:

${ln( \frac{p( y=1|x,w)}{p( y=0|x,w)}) = w_{0} + \sum_{i=1}^N w^{i} x_{i}}$

Notice that the weights ($w$) slipped into the equation.  The index $i$ represents the $i^{\textrm{th}}$ variable in the example set $(x,y)$.  The result is now both a condition of the data and the chosen weights because we have decided to model the probability with a specific function.

The above equation than requires that the probability is represented as a Sigmoid function:

${p( y=1|x,w) = \frac{e^{w_{0} + \sum_{i=1}^N w^{i} x_{i}}}{1+e^{w_{0} + \sum_{i=1}^N w^{i} x_{i}}} = \frac{1}{1+e^{-(w_{0} +\sum_{i=1}^N w^{i} x_{i})}}}$

We now have a set of examples of $(x_i,y_i)$ and a model to represent the probabilities from each example.  The weights or coefficients in the model are not specified.   The goal now is to choose the weights that best reflect the truth presented in the data.

## Gradient Decent – How to Choose the Best Weights.

The best always depends on the definition of the problem, so we have to define it.   Lets consider one example $(x_{0},y_{0})$ and ask what is the probability of being correct.  If $y=1$ then the probability of being correct is:

${p(y=1|x,w)}$

And if $y=0$, the probability of being correct is:

${p(y=0|x,w) = 1 - p(y=1|x,w)}$

We because $y = 1$ or $y=0$, we write the probability/likelihood of being correct ($\mathcal{L}(w)$) as:

${\mathcal{L}(w) = p(y_{0}=1|x_{0},w)^{y_{0}} (1-p(y_{0}=1|x_{0},w))^{1-y_{0}}}$

We can multiply the above equation together for each example if the assumption that each example is independent holds.  The likelihood of being correct for all the examples can be written as:

${\mathcal{L}(w) = \prod_{k} p(y_{k}=1|x_{k},w)^{y_{k}} (1-p(y_{k}=1|x_{k},w))^{1-y_{k}}}$

Where the index $k$ is the $k^{\textrm{th}}$ example from our set of training examples.

We want to maximize the likelihood of being correct, we we need to choose the weights that produce the maximum likelihood of being correct.  We have two problems with the current form of the likelihood that makes it difficult to solve for the weights

1. Machine Precisions – the product of small numbers is a smaller number.   At some point zero is the only representation of the result in the computer
2. Algebriac complexity of the product rule for finding the direction to step in.

Thankfully, there is an acceptable way around these problems: by taking the natural log of the likelihood of being correct.  By the property of logs, the product turns into a sum.  There is also convention of taking the negative sign so that the problem is transformed from finding the weights that maximum likelihood of being correct to finding the weights the minimizes the negative log likelihood of being correct.

${-ln\left(\mathcal{L}(w)\right) = -\sum_{k} ln\left( p(y_{k}=1|x_{k},w)^{y} (1-p(y_{k}=1|x_{k},w))^{1-y_{k}}\right)}$

${-ln\left(\mathcal{L}(w)\right) = -\sum_{k} y_{k} ln\left( p(y_{k}=1|x_{k},w)\right) - \Sigma_{k} (1-y_{k}) ln\left(1-p(y_{k}=1|x_{k},w)\right)}$

Before we move on to the implementation, there is one last math value we need to derive, and that is the gradient of the negative log-likelihood function from above.

${-\frac{\partial}{\partial w^{i}} ln\left(\mathcal{L}(w)\right) = \sum_{k} \left( \frac{y_{k}}{p(y_{k}=1|x_{k},w)} -\frac{1-y_{k}}{1-p(y_{k}=1|x_{k},w)} \right) \frac{\partial}{\partial w^{j}} p(y_{k}=1|x_{k},w)}$

where

${\frac{\partial}{\partial w^{j}} p(y_{k}=1|x_{k},w) = p(y_{k}=1|x_{k},w) \left( 1 - p(y_{k}=1|x_{k},w) \right) x^k_{i}}$

As before the indexes $i$ and  $k$ represent the $i^{\textrm{th}}$ variable in the $k^{\textrm{th}}$ training example.

Combining these two equations leads to:

${-\frac{\partial}{\partial w^{i}} ln\left(\mathcal{L}(w)\right) = -\sum_{k} \left( y_{k}\left(1-p(y_{k}=1|x_{k},w)\right) - \left(1-y_{k}\right) p(y_{k}=1|x_{k},w) \right) x^i_k}$

${-\frac{\partial}{\partial w^{i}} ln\left(\mathcal{L}(w)\right) = -\sum_{k} \left( y_{k} - p(y_{k}=1|x_{k},w) \right) x^i_k}$

Because we have a cost function ( $- ln\left(\mathcal{L}(w) \right)$) we want to minimize, we can use gradient decent.

The algorithm follows:

1. Set initial weight/coefficient values
2. While the change cost function is larger than some threshold
1. Find the gradient of the cost function
2. Change the weight values in the direction of the gradient

Though simply described, there is a lot of nuance in practice.  One is that we must decide the right amount to change the weights in the direction of the gradient.   We will use a constant proportion of the gradient, known as the learning rate ($\alpha$), for our implementation.

The final weight update equation than becomes:

${w^{i} = w^{i} + \alpha \sum_{k} \left( y_{k} - p(y_{k}=1|x_{k},w) \right) x^k_i}$

# Pseudo-Code Logistic Regression

Before we get to the python code, lets write out the steps we are planning to implement.

1. Setup
1. Set stopping criteria
2. Add an incept ($w_{0}$) to the training data
2. Update Weights
3. Get new Log Likelihood Value
4. Stop if difference between old and new Log Likelihood is below threshold
3. Use Model

Now lets implement Logistic Regression in python in the next post.

# Thank You

I appreciate you taking the time to read this post.  I hope you gained something from it.  Feel free to reach out with any questions in the comments or directly through the email link in the menu.

## From Teacher to Data Scientist in 2 Years

As I write this it is Christmas Day, 2015.   I am working as a Data Scientist at Payoff in Irvine, CA, but last year I was a physics teacher.  It took a lot to make the change.  I wanted to take the time to reflect and share that transition that started in January of 2014 when I started taking my first course in the Online Masters of Computer Science (OMSCS) with the goal of leaving teaching and becoming a data scientist.

# How I started and stopped Teaching

I started teaching at The Buckley School (private school in Sherman Oaks) in Aug of 2007. I accepted the job impulsively right after finishing up my Ph.D in Theoretical Particle Physics and Cosmology.   The idea of job insecurity as a postdoc for 3 to 9 years, working on topics that no one cared about (or understood), and having to leave my then girlfriend/now wife was too much. I received an email about an open teaching position that offered me a chance of something new and to stay with my now wife. So I went in for an interview, and they offered me the job before I left the campus.

The one thing that surprised me about teaching was how difficult and exhausting it was.  I was also surprised with how rewarding it was.  The problem was that the exhaustion was winning out in the long run.   By 2012, the school’s politics taxed me.   It was a high-end school where questions like the following were posed to me:

How can I ask Mrs. XXX to donate money for our new building phase while her daughter has a B- in physics?

It seems like a bad line from a movie, but it stems from real concerns in institutions that rely heavily on donations to execute their mission.  It is leadership’s role to strike the balance in the relationship between giver and receiver.  Every leadership team will take a difference stance.

This event motivated to seek to renew joy by switching to a school that was more aligned with a social mission and making a difference in a larger community.  This is when I transitioned to New Roads.  This place was more wonderful, more meaningful, but more exhausting than Buckley.  It was here that I finally burned out and needed a different career.

My PhD in Physics was under-utilized in teaching, and I wanted to take advantage of these skills.  I enjoyed executing on these skills.  I also wanted to work on developing insights (like in graduate school) that lead to actions (unlike in graduate school).  Somewhere online I found this mythical job called “Data Scientist”, and I got a tingling all over my body:  This is it!

# Choosing to Become a Data Scientist

I found my calling, but in my mind it was not as easy as applying for a data science job.  The reason is that I categorized the skills/requirements for the positions I found online into a couple of buckets:

1. What is that?
3. I’ve played with that.
4. I’ve done a project involving that.
5. I’ve done multiple projects involving that.
6. I’ve lived and breathed that.

Most of the buckets were 1, 2, & 3.   I know other PhDs who would have made the argument that they could learn it–they have a PhD after all.  The issue to me is not the truth of this statement, but the relevancy of it. Academia hires you for what you can learn, industry hires you for what you can do.   I did not want to find myself in a position where I can not deliver on what was needed, or only get hired by people who can not adequately decide that I am not the right person for the job.   Fit was and continues to be one of my major concerns for employment.  I want my skills and abilities to fit with the need of the company that hires me.

So in January of 2014 I took the following actions to facilitate my transition to becoming a data scientist:

1. Enrolled in OMSCS
1. Machine Learning
2. Artificial Intelligence for Robotics.
2. Put our house up for sale
3. Applying for teaching jobs near my in-laws
4. Plan to live with my in-laws for at least 1 year
5. Commit to being unemployed starting the Summer of 2015.

# The Long Slog

That this point the story becomes uninteresting because it is me just putting my nose down for months on end, working on anything that gets me to produce deliverables.   I found for myself that I learn best when I have to give something to someone.  That’s actually the reason for this blog.

What I did was integrate this work in my life.  I committed every weekday morning from 4:30am to 6:30am working on data science or masters in computer science related work.   I also committed most Saturday morning and all day Sundays to this work as well. It is not hyperbole to say I spent more than 20 hours/week working towards this transition.  I also reduced my social life quite a bit during this time.  That probably goes without saying.

# Job Applications

I started applying for data science jobs in January of 2015.  At first I received zero responses, so I hired a freelancer that worked for as a HR person at a tech company to re-write my resume.  She did an excellent job.   I immediately started getting responses after using her version.  Recruiters (who now I will not work with) started reaching out to me.   It turns out there is a keyword and format game that must be used to get results.  It also turns out that you do not want to say you were a  teacher.  My teacher-less resumes received more traction.  All in all I applied to well over 100 jobs, did dozens of phone interviews, and had 10’s of on-site interviews.

I learned that start-up companies are the quickest to respond and the quickest to dismiss.   I learned the established companies can take up to 3 months to get back to you, and sometimes that has nothing to do with you or their interest in you.  This does not seem to be true in San Francisco, however.   People are very quick to respond if interested.

# Udacity

In the winter of 2014 Udacity started their nano-degrees.  There was one for Data Science (which was later changed to Data Analyst).  It was perfectly aligned with my goals, project based learning focused on Data Science.  It was affordable, online, using the same platform as OMSCS.  For me it was a no-brainer.

The value-added for me were the projects and feedback.   They were opened ended, so you had the option to do as much or as little as you wanted.  There was also very good feedback associated with each submission.  My one recommendation is that if you use Udacity, invest time into summarizing them into a pitch deck that can be talked about in less than 5 minutes.  This can be useful to show off to potential employers, and much better than showing them a 20+page report.  Plus it has the added benefit of working on your communication skills.

This also lead to the best decision I made involving my career transition.  The 5th part of the nanodegree is Data Visualization with D3.js.  If you click on the link you will see that it was built with Zipfian Academy.  I was curious about them, who they are, what they did.   It turned out they were a data science boot-camp in San Francisco that was acquired by Galvanize.  After reading about them on Quora, I decided to apply.  My job search was not going well at this point.

# Galvanize Data Science Immersive

The application process for Galvanize was  a series of programming and statistics related questions and interview.   The boot-camp is immersive, and the more prepared you are the more you will get out of it.  It is also a financial commitment.  The tuition was 16k when I went, and after living in San Francisco for 3 months I was out about 24k.

After I was accepted into the program, but before I committed, I was offered a Data Science job.  This was a pivotal moment for me because I was being offered my goal/dream.   I could accept the job, not spend the money at Galvanize, would have to move, but work as a full time Data Scientist.  The problem for me was that it didn’t feel right.  It was not a ‘hell yes!’ decision.  I also did not feel like I would be accepting the position from a place to strength or with a full knowledge that fit into the position.  I also did not align with the mission of the company.   I thankfully opted for Galvanize.

The Galvanize Data Science Immersive was the hardest fun I ever had.  More importantly it put almost all the data science skills/abilities I started to list in 2014 into the 5 & 6 categories from earlier.  I was also, in Seth Godin terminology, shown my tribe.  It allowed me to build a network that will continue to help me professionally, gave me exposures to a wide rage of companies and interviews, and allowed me to know for certain if I and a given position was a correct fit.  I can not give this program enough credit transforming me into a high quality Data Scientist.

# Payoff

I accepted a job at Payoff in August of this year, a full 20 months after I decided to become a data scientist.   There were other options and different paths I could have taken, but I think the path I took was near optimal in terms of long term rewards.  I get to work at a company whose mission is being a financial institution that helps people transition from being borrowers to being investors.  Science and Compassion are integrated into Payoff, so we get a treasure trove of data with a mission to find actionable insights that lead to making people’s (financial) lives better.   And because of the choices I made I have learned and mastered the data science skills necessary to contribute on a day to day basis.

# Summary

It was a difficult two years, with a number of big and stressful decisions.   All in all I spent 31k transitioning from teaching to becoming a data scientists.  I am happy to say that this will be recouped in less than two years (even after taxes) in the difference between what I was making and am currently making.  Also, I go to work everyday looking forward to what I am doing.   I do not think you can put a price on that.

There were dozens of times I wondered if it was worth it, or thought about just teaching.  It was a certain, stable, and noble profession.  I am glad I stuck with it because I know too many teachers who continued teaching after they lost the passion.   It’s not good for anyone.

It took about 2 years, but both my wife and I are happier than we ever have been.  We are glad that I decided to take the long hard road.

# Thank You

I appreciate you taking the time to read this post.  I hope you gained something from it.  Feel free to reach out with any questions in the comments or directly through the email link in the menu.