Regularization

Regularization is a mathematical method to add constraints to an expression of a problem. In context of machine learning, it is a way to encourage the reduction of model complexity, reduce overfitting, or/and establish priors on the weights/coefficients of the model. I want to focus on the last part about priors for this post.

Bayes Rules

Bayes rules is a relationship between conditional probabilities that states the probability of some statement A is true give B is true is related to the probability of some statement B is true given A is true:

P( A | B ) P( B ) = P ( B | A ) P (A)

P( A | B ) = \frac{P(B|A)}{P(B)} P(A)

In context of model fitting we want to know what is probabilities that we have certain weights given the data we observed: P( w | y, X ). According to Bayes’ Rule this is equal to:

P(w|y,X) = \frac{P(y,X|w)}{P(y,X)} P(w)

\mathcal{P} (\mathrm{Posterior}) = \mathcal{L} (\mathrm{Likelihood}) \times \mathcal{P}_r (\mathrm{Prior})

P(w) corresponds to what our beliefs are about the weights before we see the data.  This is our prior. P(w|y,X) is our believes about the weights after we have seen the data, or our posterior. \frac{P(y,X|w)}{P(y,X)} is the likelihood of the weights, given the fact that we have seen the data (X,y). In the posts on implementing logistic regression from scratch we were finding the weights the maximized the likelihood.

Maximizing Posterior (Not Likelihood)

If we want to include priors in fitting our model, we have to choose the weights that maximize the posterior distribution \mathcal(P) instead of choosing the weights that maximize the likelihood. We can take the negative natural log of both sides of posterior equation and get the following equation.

-ln \mathcal{P} = - ln \mathcal{L} - ln \mathcal{P}_{r}

And for logistic regression that is:

-ln \mathcal{P} = - \sum_{k}^{N} y_{k} \ ln \big( p(y_{k}=1| x_{k}, w) \big) + (1 - y_{k}) \ ln \big( 1 - p(y_{k}=1| x_{k}, w) \big) - ln \mathcal{P}_{r}

For the weight update rules we will have to take the derivative of both sides.

- \frac{\partial}{\partial w^i} \ln \mathcal{P} = - \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) {X^i_k} + \frac{\partial}{\partial w^i} \big( - ln \mathcal{P}_{r} \big)

So the new weight update rule is (change):

w^i = w^i + \alpha \thinspace \sum_k \big( y_k- p(y_k = 1 |X,w) \big) \thinspace {X^i_k} - \alpha\thinspace \frac{\partial}{\partial w^i} \thinspace ln\mathcal{P}_{r}

Gaussian Prior (Ridge Regularization)

A common prior is assumed the weights are centered at zero with some standard deviation that is the same for all the weights.

\mathcal{P}_{r} = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2 \sigma^2}\sum_i {w^i}^2}

The negative log likelihood of the gaussian prior is

- ln \mathcal{P}_{r} = - ln\big(\frac{1}{\sqrt{2 \pi \sigma^2}}\big) + \frac{1}{2 \sigma^2} \sum_i {w^i}^2

We can now include in our new cost function for the posterior:

-ln \mathcal{P} = - ln \mathcal{L}_{\mathrm{logistic}} + \frac{1}{2 \sigma^2} \sum_i {w^i}^2 - ln\big(\frac{1}{\sqrt{2 \pi \sigma^2}}\big)

The last term is a constant, and can be ignored. If we define \lambda as \frac{1}{\sigma^2} than we can put this in the common ridge regularization form.

-ln \mathcal{P} = - ln \mathcal{L}_{\mathrm{logistic}} + \frac{1}{2} \lambda \sum_i {w^i}^2

New Update Rule

The derivative of the negative log of the prior is

\frac{\partial}{\partial w^i} - ln \mathcal{P}_{r} = \frac{w^i}{\sigma^2} = \lambda w^i

Make the new weight update rule for the postior based on a gaussian prior:

w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i_k - \alpha \frac{1}{\sigma^2} w^i

w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i_k - \alpha \lambda w^i

A large lambda corresponds to a small sigma, and a strong belief that the weights should be close to zero.

Laplace Prior (LASSO Regularization)

Another common prior is assumed the weights are centered at zero with a Laplace (double exponential) distribution.

\mathcal{P}_{r} = \frac{1}{2 b} e^{-\frac{1}{b}\sum_i |w^i|}

The negative log likelihood of the gaussian prior is

- ln \mathcal{P}_{r} = - ln\big(\frac{1}{2b}\big) + \frac{1}{b} \sum_i |w^i|

We can now include in our new cost function for the posterior:

-ln \mathcal{P} = - ln \mathcal{L}_{\mathrm{logistic}} + \frac{1}{b} \sum_i |w^i| - ln\big(\frac{1}{2b}\big)

The last term is a constant, and can be ignored. If we define \lambda as \frac{1}{b} than we can put this in the common ridge regularization form.

-ln \mathcal{P} = - ln \mathcal{L}_{\mathrm{logistic}} + \lambda \sum_i |w^i|

New Update Rule

The derivative of the negative log of the prior is

\frac{\partial}{\partial w^i} - ln \mathcal{P}_{r} = \frac{sign(w^i)}{b} = \lambda sign(w^i)

Make the new weight update rule for the posterior based on a gaussian prior:

w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i_k - \alpha \ \frac{1}{b} sign(w^i)

w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i_k - \alpha \ \lambda \ sign(w^i)

A large lambda corresponds to a small b, and a strong belief that the weights should be close to zero. Since the exponential distribution is much more narrow than the gaussian distribution, it corresponds to a much stronger believe/prior that the coefficients are zero. This is the reason that LASSO is used for variable selection. With a large lambda/small b, it takes very strong evidence to overcome the Laplace prior that the weights are close to zero.

Examples

I always like to examine some toy examples to get a feel for a concept. In this case I am going to generate gaussian data that is well separated.

The code is below:

import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
%matplotlib inline
def generate_1D_data(loc_positive=1,loc_negative=-1,size=100):
    X = np.hstack((np.random.normal(loc=loc_positive,size=size),np.random.normal(loc=loc_negative,size=size)))
    y = np.hstack((np.ones(size),np.zeros(size)))
    return X,y
X,y = generate_1D_data()
w = np.array([0,0])
def prob(w,X):
    return 1/(1+np.exp(-w[0]-w[1]*X))

def cost(w,X,y,l2=0,l1=0,include_intercept=1):
    return -1*np.sum(y*np.log(prob(w,X))+(1-y)*np.log(1-prob(w,X)))+l2*w[1]*w[1] + \
        include_intercept*l2*w[0]*w[0]+l1*np.abs(w[1]) + include_intercept*l1*np.abs(w[0])
plt.figure(figsize=(16,8))
plt.hist(np.random.normal(loc=1,size=100),alpha=0.5,color='steelblue',label='Positive Example')
plt.hist(np.random.normal(loc=-1,size=100),alpha=0.5,color='darkred',label='Negative Example')
plt.xlabel('Variable Value')
plt.ylabel('Count')
plt.title('Histogram of Simulated 1D Data')
plt.show()

output_5_0

This data can be fit with logistic regression, but before we do that lets examine some of the underlying concepts. In the absence of priors we want to choose weights that minimize the negative log likelihood. We can produce a contour plot of this in our weight space (intercept and weight for our only variable)

delta = .05
t1 = np.arange(-1.0, 1.0, delta)
t2 = np.arange(0.0, 4.0, delta)
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        Z[j,i] = cost([w0,w1],X,y)
plt.figure(figsize=(16,8))
CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.show()

output_7_0

This suggest that the minimum of the log-likelihood shoudl produce the intercept close ot zero and a slope close to 2.0. I will perform a fit with sklearn:

from sklearn.linear_model import LogisticRegression
log = LogisticRegression(C=1e35)
log.fit(X.reshape(X.size,1),y)
print "Intercept: ",log.intercept_[0]
print "Slope: ",log.coef_[0,0]

Intercept: 0.0390729281934
Slope: 1.78817685475

The results are very close to the center of the inner most contour.

Gaussian Prior

Now lets apply Gaussian prior with a \lambda = 100

delta = .05
t1 = np.arange(-1.0, 1.0, delta)
t2 = np.arange(0.0, 4.0, delta)
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        Z[j,i] = cost([w0,w1],X,y,l2=100)
plt.figure(figsize=(16,8))
CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.show()

output_11_0
The cost function values, and the minimum of the cost function is now at an intercept closer to 0, and and slope closer to 0.5. Sklearn uses a C parameter where C = \frac{1}{\lambda}, so a \lambda = 100 is a C = 0.01. Fitting the data again with a gaussian prior produces

log = LogisticRegression(C=.01,penalty='l2')
log.fit(X.reshape(X.size,1),y)
print "Intercept: ",log.intercept_[0]
print "Slope: ",log.coef_[0,0]

Intercept: 0.00605127827799
Slope: 0.514997242718

Since we know that the intercept should be zero, we can examine the likelihood, posterior, and pripors of the first weight.

ws = np.linspace(-1,2.5,1000)
l = 100
li = ws.copy()
for i, wi in enumerate(ws):
    li[i] = np.exp(-1*cost([0,wi],X,y))
pr = sp.stats.norm(loc=0,scale=1/np.sqrt(l)).pdf(ws)
li = li*pr.max()/li.max()
po = li*pr
po = po/po.sum()
po = po*li.max()/po.max()
plt.figure(figsize=(16,8))
plt.plot(ws,li,label='Likelihood (From Data)',color='steelblue')
plt.plot(ws,pr,label='Prior (Gaussian)',color='#AA3333',linestyle='--')
plt.plot(ws,po,label='Posterior (After Data + Prior)',color='#3333AA')
plt.legend()
plt.show()

output_15_0
I have rescaled the data for display purposes. Here we how regularization can keep us from overfitting. The maximum likelihood weight is around 1.5, but we have decided to believe that the weight should be within some distance of zero. This bias keeps us from moving to the maximum likelihood (based only on data). We end up with a solution that is close to the data, but not perfectly fit to it.

Lets look at the position of the maximum Postior.

print "Slope: ", ws[po.argmax()]

 

Slope: 0.513513513514

Very close to the sklearn solution!

Laplace Prior

Now lets apply Laplace prior with a \lambda = 100

delta = .05
t1 = np.arange(-1.0, 1.0, delta)
t2 = np.arange(0.0, 4.0, delta)
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        Z[j,i] = cost([w0,w1],X,y,l1=100)
plt.figure(figsize=(16,8))
CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.show()

output_19_0
The minimum of this posterior cost function is close to 0 for the intercept and 0 for the slope. The absolute value is illustrated with the kink in the cost function. The sklearn solution is:

log = LogisticRegression(C=.01,penalty='l1')
log.fit(X.reshape(X.size,1),y)
print "Intercept: ",log.intercept_[0]
print "Slope: ",log.coef_[0,0]

 

Intercept: 0.0
Slope: 0.0

The Laplace prior is overwhelming the information from the data. I want to visualize this along the direction of the slope.

li = ws.copy()
for i, wi in enumerate(ws):
    li[i] = np.exp(-1*cost([0,wi],X,y))
pr = sp.stats.laplace(loc=0,scale=0.01).pdf(ws)
po = li*pr
po = po*pr.max()/po.max()
li = li*pr.max()/li.max()
plt.figure(figsize=(16,8))
plt.plot(ws,li,label='Likelihood (From Data)',color='steelblue')
plt.plot(ws,pr,label='Prior (Gaussian)',color='#AA3333',linestyle='--')
plt.plot(ws,po,label='Posterior (After Data + Prior)',color='#3333AA')
plt.legend()
plt.show()

output_23_0
The Laplace prior is very strong in this example, and this illustrates how it can be used for variable selection. The evidence has to be very strong to overwhelm a strong prior. This can be done with drastic data, or more data. Instead of using 200 points, lets use 20,000 simulated points.

X,y = generate_1D_data(size=10000)
delta = .05
t1 = np.arange(-1.0, 1.0, delta)
t2 = np.arange(0.0, 4.0, delta)
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        Z[j,i] = cost([w0,w1],X,y,l1=100)
plt.figure(figsize=(16,8))
CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.show()

output_25_0
The minimum cost of the posterior function is close to where the original likelihood was. More data provided enough evidence to step away from the same prior distribution. Fitting more data with sklearn produces supporting results.

from sklearn.linear_model import LogisticRegression
log = LogisticRegression(C=0.01,penalty='l1')
log.fit(X.reshape(X.size,1),y)
print "Intercept: ", log.intercept_[0]
print "Slope: ",log.coef_[0,0]

Intercept: 0.0
Slope: 1.92902347294

Review

I wanted to explore regularization, and specifically how having a prior belief on the weights for a model can change the results. We have the option of maximizing the likelihood, or maximizing the posterior, which is the product of the likelihood and the priors.

I also wanted to illustrate how LASSO and Ridge regularization result from Laplacian and Gaussian Priors on the weights respectively. These lead to the cost function updates that are found in most books/online resources.

I then played with a couple of toy models to show how the priors and the likelihood interact to produce the results of regularization in logistic regression models. The goal is to build up an intuition of regularization that can carry over to more complex datasets and models.

 

Thank You

I appreciate you taking the time to read this post.  I hope you gained something from it.  Feel free to reach out with any questions in the comments or directly through the email link in the menu.

Join the Conversation

6 Comments

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: