This is the final post in implementing logistic regression from scratch. The initial motivation for this series was I wanted to post on the theory and basic implementation of logistic regression, but also show that there are cases where one might not want to use the default models provided by a package. In this post I am going give a toy case for this scenario using a wine data set.

My thought process is there are contexts within companies where there are multiple analysis that lead to baseline expectations using separate datasets. In my company we have places where there are multiple distict sources for information that can be used to build the same metrics or models. I imagine this is the case for most data-driven companies. Context always improves the usefulness of information and decision making, and finding ways to incorporate context into models always improves generalization of the results.

In this post I am considering a scenario where there are results from another analysis that set a baseline that I want to be incorporated in the model, but the training data can not be joined with the data used for the previous analysis. In this contexts, the previous analysis is setting a prior. I want the prior to change the logistic cost function so that the results include information about the baseline.

Results of a Previous Analysis

Imagine the contrived example that someone looked at a worldwide sample of wines, and found that the amount of free sulfer dioxide (a perservative) in the wine was related to how likely a wine was to be red than white. The more free sulfur dioxide, the more likely it was to be white. It turned out that the log odds of being red was approximatey linear with respect to the concentration of the free sulfur dioxide.

ln \big( \frac{p_{\textrm{red}}}{p_{\textrm{white}}} \big) \approx w_1 \big(\textrm{Free Sulfur Dioxide}\big) + w_0

Using the national sample the estimate on the coefficient was estimated to be approximately gaussian with the following results:

w_1 = -0.14
SE_{w_1} = 0.005

These results can be used to construct a prior on the coefficient for this weight (but not other weights)

x = np.linspace(-0.10,-0.20,1000)
y = norm.pdf(x,loc=-0.14,scale=0.005)
plt.figure(figsize=(16,8))
plt.plot(x,y,color="#AA3333")
plt.ylim(y.min(),y.max()+1)
plt.xlabel('Coefficient Value')
plt.ylabel('Probability Density')
plt.show()

Screen Shot 2016-01-17 at 6.10.55 PM

Ridge Regularization

In the previous posts I covered the theory of Regularization and I implemented LASSO (Laplace Prior) and Ridge (Gaussian Prior) Regularization. In both cases the assumption is all coefficients have the same prior distribution centered around zero. In this case we have a prior on only 1 coefficent, and it is not cented at zero. We can generalize the prior(s) to include this.

Updating Equations

The previous prior we used assumed each weight was centered at zero, and all weights at the same distribution:

\mathcal{P} \propto e^{-\frac{1}{2 \sigma^2}\sum_i {w_i}^2}

If we assume the possibility that the expected weight for each weight is different, and the expected standard deviation is different, this changes to:

\mathcal{P} \propto e^{-\frac{1}{2} \sum_i {\big(\frac{w_i-\bar{w_i}}{\sigma_i}\big)}^2}

That means the negative log prior is:

- ln \mathcal{P} = \frac{1}{2} \sum_i {\big(\frac{w_i-\bar{w_i}}{\sigma_i}\big)}^2 + C

To get an update rule to include this prior(s) we take the derivative:

\frac{\partial}{\partial w_i} \big(- ln \mathcal{P}\big) = \frac{1}{\sigma_i^2} \big(w_i-\bar{w_i}\big)

Including the logisitic weight update rule, we get the following equation:

w_i = w_i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) w_i - \alpha \frac{1}{\sigma_i^2} \big(w_i-\bar{w_i}\big)

The convention is the same as in similar posts: w_i is the the coefficient for the ith feature, w_0 is the intercept, \alpha is the learning rate, \bar{w_i} is the expected value, and the \sigma_i is the standard deviation of the estimate of \bar{w_i}

Our Data

I have a sample of wines from South America, and I want to build a model that will estimate the probability of being a red wine for all wines based on the concentration of free sulfur dioxide. I have a biased sample, so in the effort to generalize I want to include the prior. There are a number of issues that I am going to ignore.

  1. In trying to estimate the out of sample performance, the model without the prior will likely have better performance. A train/test split will both be biased south american samples, not samples that include the distribution of wines in the world.

  2. I have no way to test on a generalize population. I would have to release it into the wild, and monitor its performance.

  3. With enough data, the model will fit to the sample and ignore the prior.

Here’s the data:

wine = pd.read_csv("wine.csv")[['free sulfur dioxide','Type']]
wine.head()
free sulfur dioxide Type
0 67 White
1 32 White
2 31 White
3 15 Red
4 25 Red

In the previous posts in this series I wrote classes for logistic regression, but in this post I am going to do it simply with 3 functions.   The first function is the probability estimate using the weights, the second function is the value of the cost (negative log likelihood ) function, and the third function is the gradient of the cost function.

def prob(w,X):
    return 1/(1+np.exp(-X.dot(w)))

def cost(w,X,y,l2,ec):
    return -np.sum(y*np.log(prob(w,X))+(1-y)*np.log(1-prob(w,X)))+l2.dot(np.power(w-ec,2))/2

def grad(w,X,y,l2,ec):
    return X.T.dot(y-prob(w,X))-l2.reshape(l2.size,1)*(w-ec)

The weight column vector will be a feature vector with the zeroth position is the intercept, and other weights are the slopes for the features.  The X is a matrix where the first column is ‘1’ for the intercept.  The y is a column vector of labels.   The l2 is a row vector of regularization parameters.  This allows us to set a different parameter for each feature.  The ec is the expected value from the prior in a column vector.

For our case I am going to use the prior for l2 and ec:

ec = np.array([[0],[-0.14]])
l2 = np.array([0,1/0.005**2])

In a pervious post on regularization, I examined the cost function in the parameter space.  It was easy to see the eventual solution of the parameters that solution would come to.  I showed this with sklearn in that post.  In this post lets look at what would happen to the solution if we did traditional ridge regression with a regularization parameter of 4000 (the equivalent value of \frac{1}{0.005^2}.

I want to load a random sample of 500 wines:

indexes = np.random.choice(wine.shape[0],500)
data = wine.loc[indexes,['free sulfur dioxide']].values
labels = (wine.loc[indexes,'Type'].values=='Red').astype(int)
X = np.hstack((np.ones((data.size,1)),data))
y = labels.reshape((labels.size,1))

Here is the code the contour plot of the cost function is below:

t1 = np.linspace(0,5,100)
t2 = np.linspace(-.3,.0,200)
ecz = np.zeros((2,1))
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
plt.figure(figsize=(16,8))
plt.subplot(121)
l2_temp = np.zeros(2)
ax = plt.gca()
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        w = np.array([w0,w1]).reshape(2,1)
        Z[j,i] = cost(w,X,y,l2_temp,ecz)
levs = np.linspace(np.round(Z.min()+50),np.round(Z.max()-50),6)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Blues_r)
#CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.subplot(122)
l2_temp = np.array([4000.,4000.])
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        w = np.array([w0,w1]).reshape(2,1)
        Z[j,i] = cost(w,X,y,l2_temp,ecz)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Reds_r)
#CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Posterier')
plt.show()

Here is the resulting plot:

Screen Shot 2016-01-17 at 6.39.44 PM

The left plot is a contour plot of the log likelihood, and the right plot is a contour plot of the posterior after a gaussian prior on the weights centered at zero.   The  traditional regulation will pull to the coefficients from about (2,-0.12) to (0,-0.05).  Running this data into sklearn should confirm this:

model = LogisticRegression(C=1e25)
model.fit(data,labels)
print "Unregularized Sklearn: ", model.intercept_[0],model.coef_[0,0]
model = LogisticRegression(C=1/4000.)
model.fit(data,labels)
print "Regularize Sklearn: ", model.intercept_[0],model.coef_[0,0]
Unregularized Sklearn:  1.72872081295 -0.123757551858
Regularize Sklearn:  0.00515437416919 -0.0546835703845

Instead of applying the traditional prior, I want to apply the experimental prior on our likelihood and find the minimum of the posterior function.   First, I want to look at the cost function and posterior function.

t1 = np.linspace(0,5,100)
t2 = np.linspace(-.3,.0,200)
T1, T2 = np.meshgrid(t1, t2)
Z = T1.copy()
plt.figure(figsize=(16,8))
plt.subplot(121)
l2_temp = np.array([0,0])
ax = plt.gca()
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        w = np.array([w0,w1]).reshape(2,1)
        Z[j,i] = cost(w,X,y,l2_temp,ec)
levs = np.linspace(np.round(Z.min()+50),np.round(Z.max()-50),6)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Blues_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Likelihood')
plt.subplot(122)
for i,w0 in enumerate(t1):
    for j,w1 in enumerate(t2):
        w = np.array([w0,w1]).reshape(2,1)
        Z[j,i] = cost(w,X,y,l2,ec)
CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Reds_r)
plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues)
plt.xlabel('Intercept')
plt.ylabel('First Coef')
plt.title('Controur Plot of Negative Log Posterior')
plt.show()

Screen Shot 2016-01-17 at 6.49.06 PM.png

The left plot is the same plot as above, but the right plot is now the posterior after applying a regularization on only the total sulfur dioxide coefficient.   Sklearn and other packages does not have a way to solve this situation.   That is why I made my own function above.

Solving the Custom Cost Function

I have shown gradient decent a numerous number of times in this series.  This will be the final time!

w = np.zeros((2,1))
i = 0
while np.linalg.norm(grad(w,X,y,l2,ec)) > 1e-6:
    if i % 10000 == 0:
        print cost(w,X,y,l2,ec)[0],np.linalg.norm(grad(w,X,y,l2,ec))
    w = w + 1e-5*grad(w,X,y,l2,ec)
    i += 1
print "Custom Results: ", w[0,0],w[1,0]
738.57359028 11753.013914
183.894939846 1.61330803289
183.861493402 0.033069210393
183.861479334 0.000678928106008
183.861479328 1.39392185238e-05
Custom Results:  1.99719535168 -0.138002425824

These weights represent coefficients that can not be found using logistic regression with the data provided, with or without regularization.  Instead, it reflect the optimal weights given the combination of the previous analysis and the data provided.

Review

In this post I showed how we can make a simple alteration to the cost function to allow priors for non-zero weights and different regularization coefficients.  Using that altered function, I used the results from a prior analysis as the prior for the negative log likelihood constructed from the data.  The results lead to coefficients that are not capable of generating from packages like sklearn, but could produce results that generalize better.

Thank You

I appreciate you taking the time to read this post. I hope you gained something from it.  Feel free to reach out with any questions in the comments or directly through the email link in the menu.

 

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: