This is the final post in implementing logistic regression from scratch. The initial motivation for this series was I wanted to post on the theory and basic implementation of logistic regression, but also show that there are cases where one might not want to use the default models provided by a package. In this post I am going give a toy case for this scenario using a wine data set.
My thought process is there are contexts within companies where there are multiple analysis that lead to baseline expectations using separate datasets. In my company we have places where there are multiple distict sources for information that can be used to build the same metrics or models. I imagine this is the case for most data-driven companies. Context always improves the usefulness of information and decision making, and finding ways to incorporate context into models always improves generalization of the results.
In this post I am considering a scenario where there are results from another analysis that set a baseline that I want to be incorporated in the model, but the training data can not be joined with the data used for the previous analysis. In this contexts, the previous analysis is setting a prior. I want the prior to change the logistic cost function so that the results include information about the baseline.
Results of a Previous Analysis
Imagine the contrived example that someone looked at a worldwide sample of wines, and found that the amount of free sulfer dioxide (a perservative) in the wine was related to how likely a wine was to be red than white. The more free sulfur dioxide, the more likely it was to be white. It turned out that the log odds of being red was approximatey linear with respect to the concentration of the free sulfur dioxide.
Using the national sample the estimate on the coefficient was estimated to be approximately gaussian with the following results:
These results can be used to construct a prior on the coefficient for this weight (but not other weights)
x = np.linspace(-0.10,-0.20,1000) y = norm.pdf(x,loc=-0.14,scale=0.005) plt.figure(figsize=(16,8)) plt.plot(x,y,color="#AA3333") plt.ylim(y.min(),y.max()+1) plt.xlabel('Coefficient Value') plt.ylabel('Probability Density') plt.show()
In the previous posts I covered the theory of Regularization and I implemented LASSO (Laplace Prior) and Ridge (Gaussian Prior) Regularization. In both cases the assumption is all coefficients have the same prior distribution centered around zero. In this case we have a prior on only 1 coefficent, and it is not cented at zero. We can generalize the prior(s) to include this.
The previous prior we used assumed each weight was centered at zero, and all weights at the same distribution:
If we assume the possibility that the expected weight for each weight is different, and the expected standard deviation is different, this changes to:
That means the negative log prior is:
To get an update rule to include this prior(s) we take the derivative:
Including the logisitic weight update rule, we get the following equation:
The convention is the same as in similar posts: is the the coefficient for the ith feature, is the intercept, is the learning rate, is the expected value, and the is the standard deviation of the estimate of
I have a sample of wines from South America, and I want to build a model that will estimate the probability of being a red wine for all wines based on the concentration of free sulfur dioxide. I have a biased sample, so in the effort to generalize I want to include the prior. There are a number of issues that I am going to ignore.
- In trying to estimate the out of sample performance, the model without the prior will likely have better performance. A train/test split will both be biased south american samples, not samples that include the distribution of wines in the world.
I have no way to test on a generalize population. I would have to release it into the wild, and monitor its performance.
With enough data, the model will fit to the sample and ignore the prior.
Here’s the data:
wine = pd.read_csv("wine.csv")[['free sulfur dioxide','Type']] wine.head()
|free sulfur dioxide||Type|
In the previous posts in this series I wrote classes for logistic regression, but in this post I am going to do it simply with 3 functions. The first function is the probability estimate using the weights, the second function is the value of the cost (negative log likelihood ) function, and the third function is the gradient of the cost function.
def prob(w,X): return 1/(1+np.exp(-X.dot(w))) def cost(w,X,y,l2,ec): return -np.sum(y*np.log(prob(w,X))+(1-y)*np.log(1-prob(w,X)))+l2.dot(np.power(w-ec,2))/2 def grad(w,X,y,l2,ec): return X.T.dot(y-prob(w,X))-l2.reshape(l2.size,1)*(w-ec)
The weight column vector will be a feature vector with the zeroth position is the intercept, and other weights are the slopes for the features. The X is a matrix where the first column is ‘1’ for the intercept. The y is a column vector of labels. The l2 is a row vector of regularization parameters. This allows us to set a different parameter for each feature. The ec is the expected value from the prior in a column vector.
For our case I am going to use the prior for l2 and ec:
ec = np.array([,[-0.14]]) l2 = np.array([0,1/0.005**2])
In a pervious post on regularization, I examined the cost function in the parameter space. It was easy to see the eventual solution of the parameters that solution would come to. I showed this with sklearn in that post. In this post lets look at what would happen to the solution if we did traditional ridge regression with a regularization parameter of 4000 (the equivalent value of .
I want to load a random sample of 500 wines:
indexes = np.random.choice(wine.shape,500) data = wine.loc[indexes,['free sulfur dioxide']].values labels = (wine.loc[indexes,'Type'].values=='Red').astype(int) X = np.hstack((np.ones((data.size,1)),data)) y = labels.reshape((labels.size,1))
Here is the code the contour plot of the cost function is below:
t1 = np.linspace(0,5,100) t2 = np.linspace(-.3,.0,200) ecz = np.zeros((2,1)) T1, T2 = np.meshgrid(t1, t2) Z = T1.copy() plt.figure(figsize=(16,8)) plt.subplot(121) l2_temp = np.zeros(2) ax = plt.gca() for i,w0 in enumerate(t1): for j,w1 in enumerate(t2): w = np.array([w0,w1]).reshape(2,1) Z[j,i] = cost(w,X,y,l2_temp,ecz) levs = np.linspace(np.round(Z.min()+50),np.round(Z.max()-50),6) CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Blues_r) #CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r) plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues) plt.xlabel('Intercept') plt.ylabel('First Coef') plt.title('Controur Plot of Negative Log Likelihood') plt.subplot(122) l2_temp = np.array([4000.,4000.]) for i,w0 in enumerate(t1): for j,w1 in enumerate(t2): w = np.array([w0,w1]).reshape(2,1) Z[j,i] = cost(w,X,y,l2_temp,ecz) CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Reds_r) #CS = plt.contour(T1, T2, Z,cmap=plt.cm.Blues_r) plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues) plt.xlabel('Intercept') plt.ylabel('First Coef') plt.title('Controur Plot of Negative Posterier') plt.show()
Here is the resulting plot:
The left plot is a contour plot of the log likelihood, and the right plot is a contour plot of the posterior after a gaussian prior on the weights centered at zero. The traditional regulation will pull to the coefficients from about (2,-0.12) to (0,-0.05). Running this data into sklearn should confirm this:
model = LogisticRegression(C=1e25) model.fit(data,labels) print "Unregularized Sklearn: ", model.intercept_,model.coef_[0,0] model = LogisticRegression(C=1/4000.) model.fit(data,labels) print "Regularize Sklearn: ", model.intercept_,model.coef_[0,0]
Unregularized Sklearn: 1.72872081295 -0.123757551858 Regularize Sklearn: 0.00515437416919 -0.0546835703845
Instead of applying the traditional prior, I want to apply the experimental prior on our likelihood and find the minimum of the posterior function. First, I want to look at the cost function and posterior function.
t1 = np.linspace(0,5,100) t2 = np.linspace(-.3,.0,200) T1, T2 = np.meshgrid(t1, t2) Z = T1.copy() plt.figure(figsize=(16,8)) plt.subplot(121) l2_temp = np.array([0,0]) ax = plt.gca() for i,w0 in enumerate(t1): for j,w1 in enumerate(t2): w = np.array([w0,w1]).reshape(2,1) Z[j,i] = cost(w,X,y,l2_temp,ec) levs = np.linspace(np.round(Z.min()+50),np.round(Z.max()-50),6) CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Blues_r) plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues) plt.xlabel('Intercept') plt.ylabel('First Coef') plt.title('Controur Plot of Negative Log Likelihood') plt.subplot(122) for i,w0 in enumerate(t1): for j,w1 in enumerate(t2): w = np.array([w0,w1]).reshape(2,1) Z[j,i] = cost(w,X,y,l2,ec) CS = plt.contour(T1, T2, Z,levels=levs,cmap=plt.cm.Reds_r) plt.clabel(CS, inline=1, fontsize=15,cmap=plt.cm.Blues) plt.xlabel('Intercept') plt.ylabel('First Coef') plt.title('Controur Plot of Negative Log Posterior') plt.show()
The left plot is the same plot as above, but the right plot is now the posterior after applying a regularization on only the total sulfur dioxide coefficient. Sklearn and other packages does not have a way to solve this situation. That is why I made my own function above.
Solving the Custom Cost Function
I have shown gradient decent a numerous number of times in this series. This will be the final time!
w = np.zeros((2,1)) i = 0 while np.linalg.norm(grad(w,X,y,l2,ec)) > 1e-6: if i % 10000 == 0: print cost(w,X,y,l2,ec),np.linalg.norm(grad(w,X,y,l2,ec)) w = w + 1e-5*grad(w,X,y,l2,ec) i += 1 print "Custom Results: ", w[0,0],w[1,0]
738.57359028 11753.013914 183.894939846 1.61330803289 183.861493402 0.033069210393 183.861479334 0.000678928106008 183.861479328 1.39392185238e-05 Custom Results: 1.99719535168 -0.138002425824
These weights represent coefficients that can not be found using logistic regression with the data provided, with or without regularization. Instead, it reflect the optimal weights given the combination of the previous analysis and the data provided.
In this post I showed how we can make a simple alteration to the cost function to allow priors for non-zero weights and different regularization coefficients. Using that altered function, I used the results from a prior analysis as the prior for the negative log likelihood constructed from the data. The results lead to coefficients that are not capable of generating from packages like sklearn, but could produce results that generalize better.
I appreciate you taking the time to read this post. I hope you gained something from it. Feel free to reach out with any questions in the comments or directly through the email link in the menu.