I am currently taking Big Data For Health Informatics for OMSCS. We have been working with data summarization in hive, feature engineering with pig, and mapping and reducing model fitting with hadoop. In the last homework assignment we were referred to a paper titled Lazy Sparse Stochastic Gradient Descent for Regularized Mutlinomial Logistic Regression for which this post is based.
While I was reading it i had done of those forehead slap, “No Duh! I’ve been an idiot.” moments that forecast improvement in both execution and understanding when working with sparse datasets. Still very much a noob in data science!
In this post I wanted to just show the update rules for SGD that take advantage of sparse data structures with regularization.
Weight Update Functions
As I have gone through the derivation of the logistic cost function and weight updates with Ridge and LASSO regularization at length in previous posts, so I will just post the update rule here:
Ridge Weight Update
LASSO Weight Update
The is the weight for the ith feature, is the probability that the feature is a positive example, is the learning rate for the update, and is the regularization coefficient.
Stochastic Gradient Decent
Stochastic Gradient Decent (SGD) updates the cost function as each data point comes in, instead of in an iterative batch way. The summation in the above equations is going to go away, and each data point is going to individually update the weights. The new update rules follow.
Ridge SGD Weight Update
LASSO SGD Weight Update
I have rearranged the regularization term to the front of the equation to highlight how this changes when the data set is sparse. When there are several data points with a particular feature zero, the regularization is going to decrease the weight.
Ridge SGD Sparse Weight Update
While a particular feature is zero, its weight should be decaying to zero for each new observation. During this process the equations is the following.
If we have 300,000 features, we would need to perform 300,000 decays each round, even if only 10 or so features are non-zero. The above version highlights that we can lazily update the weights when we see a feature is nonzero as long as we keep track of how long it has been since it was updated.
LASSO SGD Sparse Weight Update
Similarly for LASSO regression we see the following update.
If then the weight is set to zero. This is because weight should not change sign from regularization, and is the clipping that is referred to in the paper.
SVM Light Format
To take advantage of the sparseness in the dataset we need to represent the data in a sparse format. The format of the data is
<line> .=. <target> <feature>:<value> … <feature>:<value> # <info>
An example of a line from a file is:
1 10:1.000000 39:1.000000 431:0.960630 951:1.000000 981:0.296029 989:0.721311 1001:0.400000
This row is a positive training example (leading 1) with the 10th feature having a non-zero value of 1.0, the 39th feature having a non-zero value of 1.0, the 431th feature has a value of 0.96, and so on.
This allows a significant compression of data when the data is sparse. This is a common format for text data, and you can download the Dexter Dataset as an example.
I am going to compare two version of the same LogisticRegressionSGD class. One with lazy evaluation and one without lazy evaluation. The code is very similar between the two classes. I am only going to evaluate ridge regression in this post. LASSO extension is equally as straight forward.
class LogisticRegressionSGD: def __init__(self, learning_rate, l2_coef): &quot;&quot;&quot; Initialization of model parameters &quot;&quot;&quot; self.learning_rate = learning_rate self.weight = collections.defaultdict(float) self.l2_coef = l2_coef self.last_update = collections.defaultdict(int) self.count = 1 def fit(self, X, y): &quot;&quot;&quot; Update model using a pair of training sample &quot;&quot;&quot; if self.l2_coef &amp;amp;amp;gt; 0: for f, v in X: self.weight[f] *= (1-self.l2_coef*self.learning_rate)**(self.count-self.last_update[f]) self.last_update[f] = self.count diff = y - self.predict_prob(X) for f, v in X: self.weight[f] += diff*v*self.learning_rate self.count += 1 def predict(self, X): return 1 if self.predict_prob(X) &amp;amp;amp;gt; 0.5 else 0 def predict_prob(self, X): return 1.0 / (1.0 + math.exp(-math.fsum((self.weight[f]*v for f, v in X))))
import collections import math class LogisticRegressionSGDSlow: def __init__(self,learning_rate, l2_norm): &quot;&quot;&quot; Initialization of model parameters &quot;&quot;&quot; self.learning_rate = learning_rate self.weight = collections.defaultdict(float) #[0.0] * n_feature self.l2_norm = l2_norm self.last_update = collections.defaultdict(int) def fit(self, X, y): &quot;&quot;&quot; Update model using a pair of training sample &quot;&quot;&quot; diff = y - self.predict_prob(X) if self.l2_norm &amp;amp;amp;gt; 0: for f in self.weight: self.weight[f] *= (1-self.l2_norm*self.learning_rate) for f, v in X: self.weight[f] += diff*v*self.learning_rate def predict(self, X): return 1 if self.predict_prob(X) &amp;amp;amp;gt; 0.5 else 0 def predict_prob(self, X): return 1.0 / (1.0 + math.exp(-math.fsum((self.weight[f]*v for f, v in X))))
The first class is keeping track of how many examples it has been fed, and using an additionally dictionary to track when the last update to that weight was. The second class iterates through each non-zero weight for each data point and ‘decaying’ it.
Command Line Training
To show and ensure the SGD component of the training, I am going to use the command line piping to load the data into the model. There is a python script to read in stdin and parse the svmlight data. It then trains the SGD logistic mode.
#!/usr/bin/env python import sys import random import pickle from optparse import OptionParser from LogisticRegressionSGDSlow import LogisticRegressionSGDSlow to_float_tuple = lambda x: (int(x),float(x)) if __name__ == '__main__': parser = OptionParser() parser.add_option(&quot;-l&quot;, &quot;--learning_rate&quot;, action=&quot;store&quot;, dest=&quot;learning_rate&quot;, default=0.01, help=&quot;step size&quot;, type=&quot;float&quot;) parser.add_option(&quot;-c&quot;, &quot;--l2_norm&quot;, action=&quot;store&quot;, dest=&quot;l2_norm&quot;, default=0.0, help=&quot;regularization strength&quot;, type=&quot;float&quot;) parser.add_option(&quot;-m&quot;, &quot;--model-path&quot;, action=&quot;store&quot;, dest=&quot;path&quot;, default=&quot;logSGD.model&quot;, help=&quot;path where trained classifier was saved&quot;) options, args = parser.parse_args(sys.argv) model = LogisticRegressionSGDSlow(options.learning_rate, options.l2_norm) for line in sys.stdin: row = line.split(&quot; &quot;,1) y = int(row) features = row X = [ to_float_tuple(pair.split(&quot;:&quot;)) for pair in features.split() ] model.fit(X,y) with open(options.path, &quot;wb&quot;) as f: pickle.dump(model, f)
Training the model on sparse data with 25k rows and 300k features looks like
time cat data/training.data | python train_data_slow.py -c 0.1 -l 0.1
The time output is:
This took about 30 second to train in the command light. Using the same code but running on lazy evaluation has a significant speed increase.
time cat data/training.data | python train_data_lazy.py -c 0.1 -l 0.1
The speed increase in this case is 30x. The results of the model are very similar, though they never will be exactly the same. This is because the updates are different, so the final model with be different.
In this post I wanted to review a paper I was exposed to in my class that allowed me to understand how taking advantage of the sparse representation of data can lead to performance increase in training a model. Sklearn in Python and glmnet in R both take advantage of this representation, so there is not any need to do it yourself. I, just out of habit of mind, like to deconstruct these ideas when I have a chance.
Thanks for Reading.
As always, I hope you gained something from reading my musings. Feel free to make comments, or let me know any corrections that need to be made. I have found tons already, but have not caught them all.