I am currently taking Big Data For Health Informatics for OMSCS.   We have been working with data summarization in hive, feature engineering with pig, and mapping and reducing model fitting with hadoop.   In the last homework assignment we were referred to a paper titled Lazy Sparse Stochastic Gradient Descent for Regularized Mutlinomial Logistic Regression for which this post is based.

While I was reading it i had done of those forehead slap, “No Duh! I’ve been an idiot.” moments that forecast improvement in both execution and understanding when working with sparse datasets.   Still very much a noob in data science!

In this post I wanted to just show the update rules for SGD that take advantage of sparse data structures with regularization.

# Weight Update Functions

As I have gone through the derivation of the logistic cost function and weight updates with Ridge and LASSO regularization at length in previous posts, so I will just post the update rule here:

## Ridge Weight Update

$w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i - \alpha \lambda w^i$

## LASSO Weight Update

$w^i = w^i + \alpha \ \sum_{k} \big(y_k - p(y_k = 1 |X,w)\big) X^i - \alpha \lambda sign(w^i)$

The $w^i$ is the weight for the ith feature, $p( y_k = 1 | X,w)$ is the probability that the feature is a positive example, $\alpha$ is the learning rate for the update, and $\lambda$ is the regularization coefficient.

Stochastic Gradient Decent (SGD) updates the cost function as each data point comes in, instead of in an iterative batch way.   The summation in the above equations is going to go away, and each data point is going to individually update the weights.  The new update rules follow.

## Ridge SGD Weight Update

$w^i_t = w^i_t \left( 1 - \alpha \lambda \right) + \alpha \left( y_t - p(y_t=1 | X_t, w_t) \right) X^i_t$

## LASSO SGD Weight Update

$w^i_t = w^i_t - \alpha \lambda sign(w^i_t) + \alpha \left( y_t - p(y_t=1 | X_t, w_t) \right) X^i_t$

I have rearranged the regularization term to the front of the equation to highlight how this changes when the data set is sparse.  When there are several data points with a particular feature zero, the regularization is going to decrease the weight.

## Ridge SGD Sparse Weight Update

While a particular feature is zero, its weight should be decaying to zero for each new observation.   During this process the equations is the following.

$w^i_t = w^i_{t-1} \left( 1 - \alpha \lambda \right)$

$w^i_t = w^i_{t-2} \left( 1 - \alpha \lambda \right)^2$

$w^i_t = w^i_{t-3} \left( 1 - \alpha \lambda \right)^3$

$w^i_t = w^i_{t-n} \left( 1 - \alpha \lambda \right)^n$

If we have 300,000 features, we would need to perform 300,000 decays each round, even if only 10 or so features are non-zero.   The above version highlights that we can lazily update the weights when we see a feature is nonzero as long as we keep track of how long it has been since it was updated.

## LASSO SGD Sparse Weight Update

Similarly for LASSO regression we see the following update.

$w^i_t = w^i_{t-1} - sign(w^i_{t-1}) \alpha \lambda$

$w^i_t= w^i_{t-2} - 2 sign( {w^i_{t-2}} ) {\alpha \lambda}$

$w^i_t=w^i_{t-3} - 3 sign( {w^i_{t-3}} ) {\alpha \lambda}$

$w^i_t=w^i_{t-n} - n sign( {w^i_{t-n}} ) {\alpha \lambda}$

$w^i_t = w^i_{t-n} \left( 1 - \frac{ n\alpha \lambda }{|w^i_{t-n}|} \right)$

If $n \lambda \alpha > |w^i_{t-n}|$ then the weight is set to zero.   This is because weight should not change sign from regularization, and is the clipping that is referred to in the paper.

# SVM Light Format

To take advantage of the sparseness in the dataset we need to represent the data in a sparse format.  The format of the data is

<line> .=. <target> <feature>:<value> … <feature>:<value> # <info>

An example of a line from a file is:

1 10:1.000000 39:1.000000 431:0.960630 951:1.000000 981:0.296029 989:0.721311 1001:0.400000

This row is a positive training example (leading 1) with the 10th feature having a non-zero value of 1.0, the 39th feature having a non-zero value of 1.0, the 431th feature has a value of 0.96, and so on.

This allows a significant compression of data when the data is sparse.  This is a common format for text data, and you can download the Dexter Dataset as an example.

# Code

I am going to compare two version of the same LogisticRegressionSGD class.  One with lazy evaluation and one without lazy evaluation.  The code is very similar between the two classes.  I am only going to evaluate ridge regression in this post.  LASSO extension is equally as straight forward.

## Lazy Evaluation

class LogisticRegressionSGD:

def __init__(self, learning_rate, l2_coef):
&amp;quot;&amp;quot;&amp;quot;
Initialization of model parameters
&amp;quot;&amp;quot;&amp;quot;
self.learning_rate = learning_rate
self.weight = collections.defaultdict(float)
self.l2_coef = l2_coef
self.last_update = collections.defaultdict(int)
self.count = 1

def fit(self, X, y):
&amp;quot;&amp;quot;&amp;quot;
Update model using a pair of training sample
&amp;quot;&amp;quot;&amp;quot;
if self.l2_coef &amp;amp;amp;amp;gt; 0:
for f, v in X:
self.weight[f] *= (1-self.l2_coef*self.learning_rate)**(self.count-self.last_update[f])
self.last_update[f] = self.count

diff = y - self.predict_prob(X)

for f, v in X:
self.weight[f] += diff*v*self.learning_rate
self.count += 1

def predict(self, X):
return 1 if self.predict_prob(X) &amp;amp;amp;amp;gt; 0.5 else 0

def predict_prob(self, X):
return 1.0 / (1.0 + math.exp(-math.fsum((self.weight[f]*v for f, v in X))))



## Standard Evaluation


import collections
import math

class LogisticRegressionSGDSlow:

def __init__(self,learning_rate, l2_norm):
&amp;quot;&amp;quot;&amp;quot;
Initialization of model parameters
&amp;quot;&amp;quot;&amp;quot;
self.learning_rate = learning_rate
self.weight = collections.defaultdict(float)  #[0.0] * n_feature
self.l2_norm = l2_norm
self.last_update = collections.defaultdict(int)

def fit(self, X, y):
&amp;quot;&amp;quot;&amp;quot;
Update model using a pair of training sample
&amp;quot;&amp;quot;&amp;quot;

diff = y - self.predict_prob(X)

if self.l2_norm &amp;amp;amp;amp;gt; 0:
for f in self.weight:
self.weight[f] *= (1-self.l2_norm*self.learning_rate)

for f, v in X:
self.weight[f] += diff*v*self.learning_rate

def predict(self, X):
return 1 if self.predict_prob(X) &amp;amp;amp;amp;gt; 0.5 else 0

def predict_prob(self, X):
return 1.0 / (1.0 + math.exp(-math.fsum((self.weight[f]*v for f, v in X))))



The first class is keeping track of how many examples it has been fed, and using an additionally dictionary to track when the last update to that weight was.   The second class iterates through each non-zero weight for each data point and ‘decaying’ it.

# Command Line Training

To show and ensure the SGD component of the training, I am going to use the command line piping to load the data into the model. There is a python script to read in stdin and parse the svmlight data. It then trains the SGD logistic mode.

#!/usr/bin/env python

import sys
import random
import pickle
from optparse import OptionParser
from LogisticRegressionSGDSlow import LogisticRegressionSGDSlow

to_float_tuple = lambda x: (int(x[0]),float(x[1]))

if __name__ == '__main__':
parser = OptionParser()
dest=&amp;quot;learning_rate&amp;quot;, default=0.01, help=&amp;quot;step size&amp;quot;, type=&amp;quot;float&amp;quot;)
dest=&amp;quot;l2_norm&amp;quot;, default=0.0, help=&amp;quot;regularization strength&amp;quot;, type=&amp;quot;float&amp;quot;)
default=&amp;quot;logSGD.model&amp;quot;, help=&amp;quot;path where trained classifier was saved&amp;quot;)

options, args = parser.parse_args(sys.argv)

model = LogisticRegressionSGDSlow(options.learning_rate, options.l2_norm)

for line in sys.stdin:
row = line.split(&amp;quot; &amp;quot;,1)
y = int(row[0])
features = row[1]
X = [ to_float_tuple(pair.split(&amp;quot;:&amp;quot;)) for pair in features.split() ]
model.fit(X,y)

with open(options.path, &amp;quot;wb&amp;quot;) as f:
pickle.dump(model, f)



Training the model on sparse data with 25k rows and 300k features looks like

time cat data/training.data | python train_data_slow.py -c 0.1 -l 0.1


The time output is:

real      0m27.605s

user     0m27.442s

sys       0m0.070s

This took about 30 second to train in the command light.   Using the same code but running on lazy evaluation has a significant speed increase.

time cat data/training.data | python train_data_lazy.py -c 0.1 -l 0.1


real      0m0.959s

user     0m0.844s

sys       0m0.044s

The speed increase in this case is 30x.   The results of the model are very similar, though they never will be exactly the same.   This is because the updates are different, so the final model with be different.

# Final Thoughts

In this post I wanted to review a paper I was exposed to in my class that allowed me to understand how taking advantage of the sparse representation of data can lead to performance increase in training a model.   Sklearn in Python and glmnet in R both take advantage of this representation, so there is not any need to do it yourself.   I, just out of habit of mind, like to deconstruct these ideas when I have a chance.